# Without AI — Part 7: Troubleshooting **Track:** Advanced (No AI) **Time for this section:** Ongoing (this is a reference you return to) Troubleshooting is not a step you complete — it is a skill you build over time. This section teaches the methodology and documents the real issues encountered building KiteStacks, with full explanations of how each was diagnosed and fixed. --- ## The Troubleshooting Mindset Before running any command, form a hypothesis. Before Googling, read the error. **The diagnostic loop:** 1. **Observe** — what exactly is failing? URL? Error message? Which service? 2. **Hypothesize** — what could cause this? List 2–3 possibilities 3. **Test** — run the simplest command to prove or disprove your hypothesis 4. **Narrow** — eliminate possibilities until one remains 5. **Fix** — apply the fix 6. **Verify** — confirm the fix worked 7. **Document** — write what broke and what fixed it The most common mistake: jumping to step 5 without completing steps 2–4. --- ## Diagnostic Commands to Know Cold ```bash # Container status docker ps # All running containers docker ps -a # All containers (including stopped) docker inspect # Full container config and state # Logs docker logs # All logs docker logs --tail 50 # Last 50 lines docker logs -f # Follow live docker logs --since 5m # Last 5 minutes # Network docker exec curl -s http://other-container:port/health docker exec nslookup other-container docker exec ss -tlnp docker network inspect kitestacks # Disk and resources docker system df # Docker disk usage docker stats --no-stream # One-shot resource usage df -h # Host disk usage free -h # Host RAM # DNS and HTTP from host curl -sv https://grafana.kitestacks.com # -v = verbose (shows headers, TLS) dig grafana.kitestacks.com # DNS lookup ``` --- ## Real Issues Encountered Building KiteStacks ### Issue 1 — SSO: `invalid_grant` on OAuth Login (50% of the time) **Symptom:** Clicking "Sign in with Authentik" in Grafana, Kavita, etc. sometimes worked and sometimes showed `invalid_grant: The provided authorization grant is invalid`. Happened roughly 50% of the time. No correlation to time of day. **Observation:** The error appeared specifically after the authorization code redirect, during the token exchange step. **Hypothesis:** 1. Authentik configuration wrong (but then it would fail 100% of the time) 2. Network issue (but HTTP 400 means request reached Authentik) 3. The code created in step 1 is not found in step 2 **Testing:** ```bash # Check if both Authentik instances have the same database docker exec authentik psql -U authentik -h $KSCLOUD1_IP -c "SELECT count(*) FROM authentik_providers_oauth2_authorizationcode;" # Monk's Authentik: count = 3 # kscloud1's Authentik: count = 1 # Different! Step 1 created the code in one DB, step 2 looked in the other. ``` **Root cause:** Two Authentik instances, two separate Postgres databases. Cloudflare routes `/authorize` and `/application/o/token/` independently — they can hit different hosts. **Fix:** Migrate both Authentik instances to a single shared Postgres, hosted on kscloud1, bound to the Tailscale IP only. ```bash # 1. Dump monk's Authentik DB docker exec authentik-postgres pg_dump -U authentik authentik --clean --if-exists \ > /tmp/authentik_dump.sql # 2. Restore to kscloud1's new shared Postgres scp /tmp/authentik_dump.sql kenpat@100.123.x.x:/tmp/ ssh kenpat@100.123.x.x "docker exec -i authentik-postgres psql -U authentik -d authentik \ < /tmp/authentik_dump.sql" # 3. Update monk's Authentik .env to point to kscloud1's Tailscale IP AUTHENTIK_POSTGRESQL__HOST=100.123.x.x AUTHENTIK_REDIS__HOST=100.123.x.x # 4. Remove monk's local Postgres and Redis docker stop authentik-postgres authentik-redis # Stop, don't delete (keep data as backup) # 5. Restart monk's Authentik docker compose up -d ``` **Verification:** Logged in from a browser with DevTools open, watching Network tab. `/authorize` returned 302 with a code. `/token` returned 200 with a JWT. Done. **Lesson:** Stateful services with active-active routing need shared state. Any session, token, or code stored in one instance's database is invisible to the other instance. --- ### Issue 2 — Phantom Third Connector in Cloudflare Dashboard **Symptom:** Cloudflare Tunnel showed 3 active connectors when only 2 were expected (monk + kscloud1). Which was the third? **Investigation:** ```bash # Check running Docker containers for cloudflared docker ps | grep cloudflared # Shows: one cloudflared container — expected # Check for non-Docker cloudflared processes ps aux | grep cloudflared # Shows: TWO processes! # /usr/bin/cloudflared (system-installed, running as a systemd service) # /usr/local/bin/cloudflared (Docker container) ``` **Root cause:** A cloudflared systemd service was installed separately from the Docker container. Both connected to the same tunnel with the same token, registering as separate connectors. ```bash # Verify the systemd service sudo systemctl status cloudflared # Fix: disable the systemd service sudo systemctl stop cloudflared sudo systemctl disable cloudflared # Verify only one connector process remains ps aux | grep cloudflared ``` **Verification:** Cloudflare dashboard refreshed to show 2 connectors within 30 seconds. **Lesson:** A service installed via package manager AND in Docker is a recipe for duplicate processes. Check both `docker ps` and `ps aux` when troubleshooting unexpected behavior. --- ### Issue 3 — Karakeep SSO "Redirect URI Error" **Symptom:** After configuring Authentik OAuth2 for Karakeep, clicking "Sign in" showed "Redirect URI Error: The provided redirect_uri does not match any of the allowed redirect URIs" from Authentik. **Investigation:** ```bash # Check what redirect URI was used in the OAuth2 request # Read from Authentik's logs docker logs authentik --tail 100 | grep "redirect_uri" # Shows: redirect_uri=https://links.kitestacks.com/api/auth/callback/authentik ``` **Root cause:** Karakeep uses NextAuth.js internally with provider ID `custom`. NextAuth constructs callback URLs as `/api/auth/callback/`. The provider ID is `custom`, not `authentik`. So the callback is `/api/auth/callback/custom`, not `/api/auth/callback/authentik`. **Fix:** ```bash # Update Authentik's OAuth2 provider for Karakeep in the shared Postgres docker exec -it authentik-postgres psql -U authentik -d authentik BEGIN; UPDATE authentik_providers_oauth2_oauth2provider SET _redirect_uris = '["https://links.kitestacks.com/api/auth/callback/custom"]' WHERE name = 'Karakeep'; COMMIT; -- Verify SELECT name, _redirect_uris FROM authentik_providers_oauth2_oauth2provider WHERE name = 'Karakeep'; \q ``` Restart Authentik on both hosts: ```bash docker compose restart authentik authentik-worker # Wait for healthy before testing ``` **Lesson:** When you get a redirect URI mismatch, always check what URI the APP is actually sending — not what you think it should send. The app's logs or browser DevTools Network tab show the actual request. --- ### Issue 4 — Kavita OIDC Config Gets Wiped on Restart **Symptom:** Configured Kavita's OIDC settings by editing `kavita.db` directly (using sqlite3). Settings looked correct in the DB. After `docker compose restart kavita`, the OIDC config was reset to empty/disabled. **Investigation:** ```bash # Check the ServerSetting row before and after restart docker exec -it kavita sqlite3 /kavita/config/kavita.db \ "SELECT Value, RowVersion FROM ServerSetting WHERE \"Key\"=40;" # Before restart: {"enabled":true,"authority":"...","clientId":"kavita",...}, RowVersion=8 # After restart: {"enabled":false,"authority":"","clientId":"","clientSecret":"",...}, RowVersion=10 # RowVersion incremented by 2 — Kavita wrote to the row twice during startup ``` **Root cause:** Kavita validates and resets `ServerSetting` rows during startup from its own defaults. Any value that does not pass Kavita's internal validation (including OIDC config with the wrong format) gets reset to defaults. Direct SQL writes do not go through Kavita's validation pipeline, so they get overwritten. **Fix:** Use Kavita's own Settings UI via SSH port forwarding to bypass Cloudflare and reach kscloud1's Kavita directly: ```bash # Forward kscloud1's Kavita port to localhost ssh -L 5099:localhost:5000 -i ~/.ssh/id_ed25519_kscloud1 kenpat@100.123.x.x -N & # Now visit http://localhost:5099 in browser # Log in with your Kavita credentials # Settings → OIDC → configure there # Click Save → changes survive restart ``` **Verification:** After saving in the UI, checked `RowVersion` was not incrementing on restart. **Lesson:** Do not write directly to application databases unless you know the app does not reinitialize those values on startup. Use the application's own APIs or UI. **Critical detail:** The Authority URL MUST have a trailing slash: `https://auth.kitestacks.com/application/o/kavita/` Without it: "issuer does not match" error, because Authentik's `openid-configuration` returns an `issuer` field that includes the trailing slash, and Kavita compares them exactly. --- ### Issue 5 — SSO Login Fails After monk Reconnects **Symptom:** When monk went offline and came back, SSO logins failed for 5–10 minutes with `invalid_grant`, then started working again. **Investigation:** Timeline reconstruction: - T+0: monk goes offline (power or network) - T+0: kscloud1 handles all traffic solo — SSO works fine, codes stored in shared DB - T+5min: monk comes back online, cloudflared reconnects - T+5min to T+8min: monk's Authentik is still starting (container startup takes ~3–4 min) - During this window: Cloudflare routes some `/authorize` to kscloud1, some `/token` to monk - Monk's Authentik hasn't finished starting — it responds with errors or invalid state **Root cause:** The OAuth2 authorization code has a 1-minute TTL (default). Monk's Authentik takes 3–5 minutes to fully start. During startup, Cloudflare is already routing traffic to monk's cloudflared (which is running), but monk's Authentik is not ready. Codes created on kscloud1 expire before monk's Authentik is healthy enough to serve them. **Fix:** Increase the OAuth2 code TTL from 1 minute to 10 minutes: ```bash docker exec -it authentik-postgres psql -U authentik -d authentik UPDATE authentik_providers_oauth2_oauth2provider SET access_code_validity = '00:10:00'; \q ``` Restart both Authentik instances. Now codes have a 10-minute window — enough for monk to finish starting before the code expires. **Alternative/additional fix:** Add a health check to monk's cloudflared or Authentik that keeps cloudflared from accepting traffic until Authentik is healthy. --- ### Issue 6 — kscloud1 SSH Key Auth Broken After Long Absence **Symptom:** After not connecting to kscloud1 for several weeks, `ssh kenpat@kscloud1` returned "Permission denied (publickey)". **Investigation:** ```bash ssh -v -i ~/.ssh/id_ed25519_kscloud1 kenpat@100.123.x.x # Verbose output showed: offered key was not accepted # No other errors — key was being offered but rejected ``` **Root cause:** The `authorized_keys` file on kscloud1 had somehow been reset or corrupted (possibly from a VPS maintenance event or snapshot restore). **Fix:** Use Hetzner's console (web-based terminal that does not require SSH): 1. Hetzner dashboard → Server → Console 2. Log in as root (reset root password via Hetzner UI if needed) 3. Restore the public key: ```bash # On kscloud1 via Hetzner console mkdir -p /home/kenpat/.ssh cat >> /home/kenpat/.ssh/authorized_keys << 'EOF' ssh-ed25519 AAAA... your-public-key-here EOF chmod 700 /home/kenpat/.ssh chmod 600 /home/kenpat/.ssh/authorized_keys chown -R kenpat:kenpat /home/kenpat/.ssh ``` **Lesson:** Always keep your public key backed up. Cloud providers (Hetzner, AWS, DigitalOcean) all have web-based console access for exactly this situation. Never rely only on SSH for access to a remote server. --- ### Issue 7 — ufw Blocking Docker Container to Host Port **Symptom:** The portal homepage on kscloud1 showed "0%" and "Offline" for the System Status widget. On monk it showed real values. **Investigation:** ```bash # Test the metrics API directly from inside the homepage container on kscloud1 docker exec homepage-backup curl -s http://host.docker.internal:8000/api/metrics # No response after timeout # Test from host directly curl -s http://localhost:8000/api/metrics # Returns real metrics immediately # Check ufw rules sudo ufw status verbose # default deny incoming — no specific rule for port 8000 ``` **Root cause:** The `kitestacks-metrics-api` container runs with `network_mode: host`. When `homepage-backup` calls `host.docker.internal:8000`, the kernel sees the source IP as the Docker bridge network (`172.x.x.x`). ufw's `default deny incoming` blocks it. Docker's iptables bypass (that allows published ports to work despite ufw) does not apply here because this is host-to-host traffic, not container-published port traffic. **Fix:** ```bash sudo ufw allow from 172.16.0.0/12 to any port 8000 proto tcp sudo ufw status verbose # Verify rule added ``` `172.16.0.0/12` covers all Docker bridge subnets (172.16.x.x through 172.31.x.x). **Verification:** ```bash docker exec homepage-backup curl -s http://host.docker.internal:8000/api/metrics # Now returns: {"cpu_percent": 4.2, "ram_percent": 71.3, ...} ``` --- ## General Troubleshooting Cheatsheet | Symptom | First Commands to Run | |---------|----------------------| | Container won't start | `docker logs ` | | Container starts then crashes | `docker logs --tail 30` | | Can't reach service from browser | `docker exec cloudflared curl -s http://:` | | SSL/TLS error in browser | `curl -sv https://yourdomain.com` (check Cloudflare is resolving) | | SSO failing with invalid_grant | Check both Authentik instances point to same shared Postgres | | Database error | Check data directory permissions: `ls -la ./data/` | | Port already in use | `sudo ss -tlnp | grep :` | | Out of disk space | `df -h` and `docker system df` | | Out of RAM | `free -h` and `docker stats --no-stream` | | Can't ping between containers | `docker network inspect kitestacks` | | Forgejo 502 | `docker logs forgejo` — likely DB connection issue | | Authentik won't start | Check it can reach `$KSCLOUD1_TAILSCALE:5432` (Tailscale up?) |