Troubleshooting

The non-obvious failures Ultron Infra actually hit while standing it up and onboarding its first example app (Penvoice), each as Symptom → Cause → Fix. These are reproduced from the build notes; trust them over intuition.

Keycloak unreachable / serving Traefik’s default cert

Symptom. A host service (e.g. host-Docker Keycloak fronted by host nginx) is silently bypassed; the browser gets Traefik’s default self-signed cert instead of the expected backend.

Cause. k3s ServiceLB (klipper) hijacks host ports 80/443 via iptables DNAT — it intercepts public traffic before it reaches a host nginx socket. So nginx never sees the request and Traefik answers instead.

Fix. Don’t run host services on 80/443 alongside k3s. Make Traefik the sole edge and route everything through Kubernetes Ingress (this is why the host Keycloak container + host nginx were retired).

CNPG backups to Oracle fail with `NotImplemented`

Symptom. Backups/WAL uploads to Oracle Object Storage via CNPG fail; the error mentions NotImplemented for a checksum.

Cause. botocore ≥ 1.36 sends checksum trailers that Oracle’s S3-compat API returns NotImplemented for. Region is also required.

Fix. On the CNPG cluster, set:

AWS_REQUEST_CHECKSUM_CALCULATION=when_required
AWS_RESPONSE_CHECKSUM_VALIDATION=when_required
AWS_DEFAULT_REGION=af-johannesburg-1

CNPG backups fail with a signature / header error

Symptom. Backups fail SigV4 signing or header parsing, even though the keys look correct.

Cause. The Oracle Customer Secret Key halves were swapped. The access key is clean hex; the secret key contains +/=. Putting the secret (with a /) into the access-key slot inserts a / into the SigV4 Credential and breaks header parsing.

Fix. Access key = the clean hex string; secret key = the one with +/=. Don’t swap them.

Certificate stuck `READY=False`

Symptom. A cert-manager Certificate never becomes ready:

kubectl get certificate -A
# READY=False, stuck on the HTTP-01 challenge

Cause. Let’s Encrypt’s HTTP-01 challenge can’t reach the host. Either the public DNS A record points at the Tailscale 100.x IP instead of the node’s public IP, or port 80 is closed (host firewall / VCN).

Fix. Point the ingress hostname’s A record at the node’s public IP, and open 80/443 to the internet (6443 stays private). Re-check:

kubectl describe certificate <name> -n <ns>
kubectl get challenges -A

CNPG `barmanObjectStore` deprecation

Symptom. Deprecation warnings on the in-tree backup config; backups break after a CNPG operator upgrade.

Cause. The in-tree barmanObjectStore is deprecated — it works through CNPG 1.29 and is removed in 1.30.

Fix. Stay on ≤ 1.29 for now; migrate to the Barman Cloud plugin before upgrading the operator past 1.29.

busybox `nslookup` shows NXDOMAIN (red herring)

Symptom. A debug nslookup from a busybox pod returns NXDOMAIN for a service, suggesting broken DNS.

Cause. busybox’s nslookup mishandles short names / search-domain expansion. The DNS is usually fine.

Fix. Query the FQDN instead (e.g. prometheus-operated.monitoring.svc.cluster.local), or use a different tool. Don’t chase a DNS outage off the busybox result alone.

OIDC redirect breaks on Vercel previews

Symptom. The web app’s login (OIDC redirect to Keycloak) fails on a Vercel preview deployment — stuck behind an auth wall.

Cause. Vercel Deployment Protection puts a preview auth wall in front of the deployment, which intercepts the OIDC redirect flow.

Fix. Disable Deployment Protection for the environment so the OIDC redirect can complete.

Canary stalls on pod readiness

Symptom. A canary rollout sits at step 1 indefinitely; the new ReplicaSet shows 0 Ready pods. The metric gate isn’t the blocker.

Cause. The new pods are failing readiness (/readyz) — usually a bad config repoint, a missing/wrong Secret, or an unreachable dependency.

Fix. This is the canary working as designed — stable pods keep serving. Diagnose the new pods, fix the config, push again:

kubectl argo rollouts get rollout penvoice-api -n penvoice --watch
kubectl describe pod -n penvoice -l app=penvoice-api
kubectl logs -n penvoice -l app=penvoice-api --tail=50

See Operate rollouts for promote/abort/retry.

Troubleshooting

Keycloak unreachable / serving Traefik’s default cert

CNPG backups to Oracle fail with NotImplemented