kitestacks-homelab/homelab-mastery/build-guide/without-ai/07-troubleshooting.md
kenpat 1e8319ee75 docs: comprehensive homelab-mastery rewrite with full build guides
Complete documentation suite for KiteStacks covering all 11 services across
2-host active-active architecture. Includes beginner track (with AI, 8 files)
and advanced track (without AI, 7 files) with time estimates, real troubleshooting
cases, and command-by-command explanations. Updates certifications roadmap to
reflect July 7 2026 A+ Core 2 exam goal.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-19 01:08:43 -05:00

14 KiB
Raw Blame History

Without AI — Part 7: Troubleshooting

Track: Advanced (No AI)
Time for this section: Ongoing (this is a reference you return to)

Troubleshooting is not a step you complete — it is a skill you build over time. This section teaches the methodology and documents the real issues encountered building KiteStacks, with full explanations of how each was diagnosed and fixed.


The Troubleshooting Mindset

Before running any command, form a hypothesis. Before Googling, read the error.

The diagnostic loop:

  1. Observe — what exactly is failing? URL? Error message? Which service?
  2. Hypothesize — what could cause this? List 23 possibilities
  3. Test — run the simplest command to prove or disprove your hypothesis
  4. Narrow — eliminate possibilities until one remains
  5. Fix — apply the fix
  6. Verify — confirm the fix worked
  7. Document — write what broke and what fixed it

The most common mistake: jumping to step 5 without completing steps 24.


Diagnostic Commands to Know Cold

# Container status
docker ps                          # All running containers
docker ps -a                       # All containers (including stopped)
docker inspect <container>         # Full container config and state

# Logs
docker logs <container>            # All logs
docker logs <container> --tail 50  # Last 50 lines
docker logs <container> -f         # Follow live
docker logs <container> --since 5m # Last 5 minutes

# Network
docker exec <container> curl -s http://other-container:port/health
docker exec <container> nslookup other-container
docker exec <container> ss -tlnp
docker network inspect kitestacks

# Disk and resources
docker system df                   # Docker disk usage
docker stats --no-stream           # One-shot resource usage
df -h                              # Host disk usage
free -h                            # Host RAM

# DNS and HTTP from host
curl -sv https://grafana.kitestacks.com  # -v = verbose (shows headers, TLS)
dig grafana.kitestacks.com               # DNS lookup

Real Issues Encountered Building KiteStacks

Issue 1 — SSO: invalid_grant on OAuth Login (50% of the time)

Symptom: Clicking "Sign in with Authentik" in Grafana, Kavita, etc. sometimes worked and sometimes showed invalid_grant: The provided authorization grant is invalid. Happened roughly 50% of the time. No correlation to time of day.

Observation: The error appeared specifically after the authorization code redirect, during the token exchange step.

Hypothesis:

  1. Authentik configuration wrong (but then it would fail 100% of the time)
  2. Network issue (but HTTP 400 means request reached Authentik)
  3. The code created in step 1 is not found in step 2

Testing:

# Check if both Authentik instances have the same database
docker exec authentik psql -U authentik -h $KSCLOUD1_IP -c "SELECT count(*) FROM authentik_providers_oauth2_authorizationcode;"
# Monk's Authentik: count = 3
# kscloud1's Authentik: count = 1
# Different! Step 1 created the code in one DB, step 2 looked in the other.

Root cause: Two Authentik instances, two separate Postgres databases. Cloudflare routes /authorize and /application/o/token/ independently — they can hit different hosts.

Fix: Migrate both Authentik instances to a single shared Postgres, hosted on kscloud1, bound to the Tailscale IP only.

# 1. Dump monk's Authentik DB
docker exec authentik-postgres pg_dump -U authentik authentik --clean --if-exists \
  > /tmp/authentik_dump.sql

# 2. Restore to kscloud1's new shared Postgres
scp /tmp/authentik_dump.sql kenpat@100.123.x.x:/tmp/
ssh kenpat@100.123.x.x "docker exec -i authentik-postgres psql -U authentik -d authentik \
  < /tmp/authentik_dump.sql"

# 3. Update monk's Authentik .env to point to kscloud1's Tailscale IP
AUTHENTIK_POSTGRESQL__HOST=100.123.x.x
AUTHENTIK_REDIS__HOST=100.123.x.x

# 4. Remove monk's local Postgres and Redis
docker stop authentik-postgres authentik-redis   # Stop, don't delete (keep data as backup)

# 5. Restart monk's Authentik
docker compose up -d

Verification: Logged in from a browser with DevTools open, watching Network tab. /authorize returned 302 with a code. /token returned 200 with a JWT. Done.

Lesson: Stateful services with active-active routing need shared state. Any session, token, or code stored in one instance's database is invisible to the other instance.


Issue 2 — Phantom Third Connector in Cloudflare Dashboard

Symptom: Cloudflare Tunnel showed 3 active connectors when only 2 were expected (monk + kscloud1). Which was the third?

Investigation:

# Check running Docker containers for cloudflared
docker ps | grep cloudflared
# Shows: one cloudflared container — expected

# Check for non-Docker cloudflared processes
ps aux | grep cloudflared
# Shows: TWO processes!
# /usr/bin/cloudflared (system-installed, running as a systemd service)
# /usr/local/bin/cloudflared (Docker container)

Root cause: A cloudflared systemd service was installed separately from the Docker container. Both connected to the same tunnel with the same token, registering as separate connectors.

# Verify the systemd service
sudo systemctl status cloudflared

# Fix: disable the systemd service
sudo systemctl stop cloudflared
sudo systemctl disable cloudflared

# Verify only one connector process remains
ps aux | grep cloudflared

Verification: Cloudflare dashboard refreshed to show 2 connectors within 30 seconds.

Lesson: A service installed via package manager AND in Docker is a recipe for duplicate processes. Check both docker ps and ps aux when troubleshooting unexpected behavior.


Issue 3 — Karakeep SSO "Redirect URI Error"

Symptom: After configuring Authentik OAuth2 for Karakeep, clicking "Sign in" showed "Redirect URI Error: The provided redirect_uri does not match any of the allowed redirect URIs" from Authentik.

Investigation:

# Check what redirect URI was used in the OAuth2 request
# Read from Authentik's logs
docker logs authentik --tail 100 | grep "redirect_uri"
# Shows: redirect_uri=https://links.kitestacks.com/api/auth/callback/authentik

Root cause: Karakeep uses NextAuth.js internally with provider ID custom. NextAuth constructs callback URLs as /api/auth/callback/<provider-id>. The provider ID is custom, not authentik.

So the callback is /api/auth/callback/custom, not /api/auth/callback/authentik.

Fix:

# Update Authentik's OAuth2 provider for Karakeep in the shared Postgres
docker exec -it authentik-postgres psql -U authentik -d authentik

BEGIN;
UPDATE authentik_providers_oauth2_oauth2provider
SET _redirect_uris = '["https://links.kitestacks.com/api/auth/callback/custom"]'
WHERE name = 'Karakeep';
COMMIT;

-- Verify
SELECT name, _redirect_uris FROM authentik_providers_oauth2_oauth2provider WHERE name = 'Karakeep';
\q

Restart Authentik on both hosts:

docker compose restart authentik authentik-worker
# Wait for healthy before testing

Lesson: When you get a redirect URI mismatch, always check what URI the APP is actually sending — not what you think it should send. The app's logs or browser DevTools Network tab show the actual request.


Issue 4 — Kavita OIDC Config Gets Wiped on Restart

Symptom: Configured Kavita's OIDC settings by editing kavita.db directly (using sqlite3). Settings looked correct in the DB. After docker compose restart kavita, the OIDC config was reset to empty/disabled.

Investigation:

# Check the ServerSetting row before and after restart
docker exec -it kavita sqlite3 /kavita/config/kavita.db \
  "SELECT Value, RowVersion FROM ServerSetting WHERE \"Key\"=40;"
# Before restart: {"enabled":true,"authority":"...","clientId":"kavita",...}, RowVersion=8
# After restart: {"enabled":false,"authority":"","clientId":"","clientSecret":"",...}, RowVersion=10
# RowVersion incremented by 2 — Kavita wrote to the row twice during startup

Root cause: Kavita validates and resets ServerSetting rows during startup from its own defaults. Any value that does not pass Kavita's internal validation (including OIDC config with the wrong format) gets reset to defaults. Direct SQL writes do not go through Kavita's validation pipeline, so they get overwritten.

Fix: Use Kavita's own Settings UI via SSH port forwarding to bypass Cloudflare and reach kscloud1's Kavita directly:

# Forward kscloud1's Kavita port to localhost
ssh -L 5099:localhost:5000 -i ~/.ssh/id_ed25519_kscloud1 kenpat@100.123.x.x -N &
# Now visit http://localhost:5099 in browser
# Log in with your Kavita credentials
# Settings → OIDC → configure there
# Click Save → changes survive restart

Verification: After saving in the UI, checked RowVersion was not incrementing on restart.

Lesson: Do not write directly to application databases unless you know the app does not reinitialize those values on startup. Use the application's own APIs or UI.

Critical detail: The Authority URL MUST have a trailing slash: https://auth.kitestacks.com/application/o/kavita/ Without it: "issuer does not match" error, because Authentik's openid-configuration returns an issuer field that includes the trailing slash, and Kavita compares them exactly.


Issue 5 — SSO Login Fails After monk Reconnects

Symptom: When monk went offline and came back, SSO logins failed for 510 minutes with invalid_grant, then started working again.

Investigation: Timeline reconstruction:

  • T+0: monk goes offline (power or network)
  • T+0: kscloud1 handles all traffic solo — SSO works fine, codes stored in shared DB
  • T+5min: monk comes back online, cloudflared reconnects
  • T+5min to T+8min: monk's Authentik is still starting (container startup takes ~34 min)
  • During this window: Cloudflare routes some /authorize to kscloud1, some /token to monk
  • Monk's Authentik hasn't finished starting — it responds with errors or invalid state

Root cause: The OAuth2 authorization code has a 1-minute TTL (default). Monk's Authentik takes 35 minutes to fully start. During startup, Cloudflare is already routing traffic to monk's cloudflared (which is running), but monk's Authentik is not ready.

Codes created on kscloud1 expire before monk's Authentik is healthy enough to serve them.

Fix: Increase the OAuth2 code TTL from 1 minute to 10 minutes:

docker exec -it authentik-postgres psql -U authentik -d authentik

UPDATE authentik_providers_oauth2_oauth2provider
SET access_code_validity = '00:10:00';

\q

Restart both Authentik instances. Now codes have a 10-minute window — enough for monk to finish starting before the code expires.

Alternative/additional fix: Add a health check to monk's cloudflared or Authentik that keeps cloudflared from accepting traffic until Authentik is healthy.


Issue 6 — kscloud1 SSH Key Auth Broken After Long Absence

Symptom: After not connecting to kscloud1 for several weeks, ssh kenpat@kscloud1 returned "Permission denied (publickey)".

Investigation:

ssh -v -i ~/.ssh/id_ed25519_kscloud1 kenpat@100.123.x.x
# Verbose output showed: offered key was not accepted
# No other errors — key was being offered but rejected

Root cause: The authorized_keys file on kscloud1 had somehow been reset or corrupted (possibly from a VPS maintenance event or snapshot restore).

Fix: Use Hetzner's console (web-based terminal that does not require SSH):

  1. Hetzner dashboard → Server → Console
  2. Log in as root (reset root password via Hetzner UI if needed)
  3. Restore the public key:
# On kscloud1 via Hetzner console
mkdir -p /home/kenpat/.ssh
cat >> /home/kenpat/.ssh/authorized_keys << 'EOF'
ssh-ed25519 AAAA... your-public-key-here
EOF
chmod 700 /home/kenpat/.ssh
chmod 600 /home/kenpat/.ssh/authorized_keys
chown -R kenpat:kenpat /home/kenpat/.ssh

Lesson: Always keep your public key backed up. Cloud providers (Hetzner, AWS, DigitalOcean) all have web-based console access for exactly this situation. Never rely only on SSH for access to a remote server.


Issue 7 — ufw Blocking Docker Container to Host Port

Symptom: The portal homepage on kscloud1 showed "0%" and "Offline" for the System Status widget. On monk it showed real values.

Investigation:

# Test the metrics API directly from inside the homepage container on kscloud1
docker exec homepage-backup curl -s http://host.docker.internal:8000/api/metrics
# No response after timeout

# Test from host directly
curl -s http://localhost:8000/api/metrics
# Returns real metrics immediately

# Check ufw rules
sudo ufw status verbose
# default deny incoming — no specific rule for port 8000

Root cause: The kitestacks-metrics-api container runs with network_mode: host. When homepage-backup calls host.docker.internal:8000, the kernel sees the source IP as the Docker bridge network (172.x.x.x). ufw's default deny incoming blocks it.

Docker's iptables bypass (that allows published ports to work despite ufw) does not apply here because this is host-to-host traffic, not container-published port traffic.

Fix:

sudo ufw allow from 172.16.0.0/12 to any port 8000 proto tcp
sudo ufw status verbose   # Verify rule added

172.16.0.0/12 covers all Docker bridge subnets (172.16.x.x through 172.31.x.x).

Verification:

docker exec homepage-backup curl -s http://host.docker.internal:8000/api/metrics
# Now returns: {"cpu_percent": 4.2, "ram_percent": 71.3, ...}

General Troubleshooting Cheatsheet

Symptom First Commands to Run
Container won't start docker logs <container>
Container starts then crashes docker logs <container> --tail 30
Can't reach service from browser docker exec cloudflared curl -s http://<service>:<port>
SSL/TLS error in browser curl -sv https://yourdomain.com (check Cloudflare is resolving)
SSO failing with invalid_grant Check both Authentik instances point to same shared Postgres
Database error Check data directory permissions: ls -la ./data/
Port already in use `sudo ss -tlnp
Out of disk space df -h and docker system df
Out of RAM free -h and docker stats --no-stream
Can't ping between containers docker network inspect kitestacks
Forgejo 502 docker logs forgejo — likely DB connection issue
Authentik won't start Check it can reach $KSCLOUD1_TAILSCALE:5432 (Tailscale up?)