kitestacks-homelab/homelab-mastery/build-guide/without-ai/07-troubleshooting.md
kenpat 1e8319ee75 docs: comprehensive homelab-mastery rewrite with full build guides
Complete documentation suite for KiteStacks covering all 11 services across
2-host active-active architecture. Includes beginner track (with AI, 8 files)
and advanced track (without AI, 7 files) with time estimates, real troubleshooting
cases, and command-by-command explanations. Updates certifications roadmap to
reflect July 7 2026 A+ Core 2 exam goal.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-19 01:08:43 -05:00

389 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Without AI — Part 7: Troubleshooting
**Track:** Advanced (No AI)
**Time for this section:** Ongoing (this is a reference you return to)
Troubleshooting is not a step you complete — it is a skill you build over time.
This section teaches the methodology and documents the real issues encountered
building KiteStacks, with full explanations of how each was diagnosed and fixed.
---
## The Troubleshooting Mindset
Before running any command, form a hypothesis. Before Googling, read the error.
**The diagnostic loop:**
1. **Observe** — what exactly is failing? URL? Error message? Which service?
2. **Hypothesize** — what could cause this? List 23 possibilities
3. **Test** — run the simplest command to prove or disprove your hypothesis
4. **Narrow** — eliminate possibilities until one remains
5. **Fix** — apply the fix
6. **Verify** — confirm the fix worked
7. **Document** — write what broke and what fixed it
The most common mistake: jumping to step 5 without completing steps 24.
---
## Diagnostic Commands to Know Cold
```bash
# Container status
docker ps # All running containers
docker ps -a # All containers (including stopped)
docker inspect <container> # Full container config and state
# Logs
docker logs <container> # All logs
docker logs <container> --tail 50 # Last 50 lines
docker logs <container> -f # Follow live
docker logs <container> --since 5m # Last 5 minutes
# Network
docker exec <container> curl -s http://other-container:port/health
docker exec <container> nslookup other-container
docker exec <container> ss -tlnp
docker network inspect kitestacks
# Disk and resources
docker system df # Docker disk usage
docker stats --no-stream # One-shot resource usage
df -h # Host disk usage
free -h # Host RAM
# DNS and HTTP from host
curl -sv https://grafana.kitestacks.com # -v = verbose (shows headers, TLS)
dig grafana.kitestacks.com # DNS lookup
```
---
## Real Issues Encountered Building KiteStacks
### Issue 1 — SSO: `invalid_grant` on OAuth Login (50% of the time)
**Symptom:** Clicking "Sign in with Authentik" in Grafana, Kavita, etc. sometimes
worked and sometimes showed `invalid_grant: The provided authorization grant is invalid`.
Happened roughly 50% of the time. No correlation to time of day.
**Observation:** The error appeared specifically after the authorization code redirect,
during the token exchange step.
**Hypothesis:**
1. Authentik configuration wrong (but then it would fail 100% of the time)
2. Network issue (but HTTP 400 means request reached Authentik)
3. The code created in step 1 is not found in step 2
**Testing:**
```bash
# Check if both Authentik instances have the same database
docker exec authentik psql -U authentik -h $KSCLOUD1_IP -c "SELECT count(*) FROM authentik_providers_oauth2_authorizationcode;"
# Monk's Authentik: count = 3
# kscloud1's Authentik: count = 1
# Different! Step 1 created the code in one DB, step 2 looked in the other.
```
**Root cause:** Two Authentik instances, two separate Postgres databases. Cloudflare
routes `/authorize` and `/application/o/token/` independently — they can hit different hosts.
**Fix:** Migrate both Authentik instances to a single shared Postgres, hosted on kscloud1,
bound to the Tailscale IP only.
```bash
# 1. Dump monk's Authentik DB
docker exec authentik-postgres pg_dump -U authentik authentik --clean --if-exists \
> /tmp/authentik_dump.sql
# 2. Restore to kscloud1's new shared Postgres
scp /tmp/authentik_dump.sql kenpat@100.123.x.x:/tmp/
ssh kenpat@100.123.x.x "docker exec -i authentik-postgres psql -U authentik -d authentik \
< /tmp/authentik_dump.sql"
# 3. Update monk's Authentik .env to point to kscloud1's Tailscale IP
AUTHENTIK_POSTGRESQL__HOST=100.123.x.x
AUTHENTIK_REDIS__HOST=100.123.x.x
# 4. Remove monk's local Postgres and Redis
docker stop authentik-postgres authentik-redis # Stop, don't delete (keep data as backup)
# 5. Restart monk's Authentik
docker compose up -d
```
**Verification:** Logged in from a browser with DevTools open, watching Network tab.
`/authorize` returned 302 with a code. `/token` returned 200 with a JWT. Done.
**Lesson:** Stateful services with active-active routing need shared state. Any session,
token, or code stored in one instance's database is invisible to the other instance.
---
### Issue 2 — Phantom Third Connector in Cloudflare Dashboard
**Symptom:** Cloudflare Tunnel showed 3 active connectors when only 2 were expected
(monk + kscloud1). Which was the third?
**Investigation:**
```bash
# Check running Docker containers for cloudflared
docker ps | grep cloudflared
# Shows: one cloudflared container — expected
# Check for non-Docker cloudflared processes
ps aux | grep cloudflared
# Shows: TWO processes!
# /usr/bin/cloudflared (system-installed, running as a systemd service)
# /usr/local/bin/cloudflared (Docker container)
```
**Root cause:** A cloudflared systemd service was installed separately from the Docker
container. Both connected to the same tunnel with the same token, registering as separate connectors.
```bash
# Verify the systemd service
sudo systemctl status cloudflared
# Fix: disable the systemd service
sudo systemctl stop cloudflared
sudo systemctl disable cloudflared
# Verify only one connector process remains
ps aux | grep cloudflared
```
**Verification:** Cloudflare dashboard refreshed to show 2 connectors within 30 seconds.
**Lesson:** A service installed via package manager AND in Docker is a recipe for duplicate
processes. Check both `docker ps` and `ps aux` when troubleshooting unexpected behavior.
---
### Issue 3 — Karakeep SSO "Redirect URI Error"
**Symptom:** After configuring Authentik OAuth2 for Karakeep, clicking "Sign in"
showed "Redirect URI Error: The provided redirect_uri does not match any of the
allowed redirect URIs" from Authentik.
**Investigation:**
```bash
# Check what redirect URI was used in the OAuth2 request
# Read from Authentik's logs
docker logs authentik --tail 100 | grep "redirect_uri"
# Shows: redirect_uri=https://links.kitestacks.com/api/auth/callback/authentik
```
**Root cause:** Karakeep uses NextAuth.js internally with provider ID `custom`.
NextAuth constructs callback URLs as `/api/auth/callback/<provider-id>`.
The provider ID is `custom`, not `authentik`.
So the callback is `/api/auth/callback/custom`, not `/api/auth/callback/authentik`.
**Fix:**
```bash
# Update Authentik's OAuth2 provider for Karakeep in the shared Postgres
docker exec -it authentik-postgres psql -U authentik -d authentik
BEGIN;
UPDATE authentik_providers_oauth2_oauth2provider
SET _redirect_uris = '["https://links.kitestacks.com/api/auth/callback/custom"]'
WHERE name = 'Karakeep';
COMMIT;
-- Verify
SELECT name, _redirect_uris FROM authentik_providers_oauth2_oauth2provider WHERE name = 'Karakeep';
\q
```
Restart Authentik on both hosts:
```bash
docker compose restart authentik authentik-worker
# Wait for healthy before testing
```
**Lesson:** When you get a redirect URI mismatch, always check what URI the APP is
actually sending — not what you think it should send. The app's logs or browser DevTools
Network tab show the actual request.
---
### Issue 4 — Kavita OIDC Config Gets Wiped on Restart
**Symptom:** Configured Kavita's OIDC settings by editing `kavita.db` directly
(using sqlite3). Settings looked correct in the DB. After `docker compose restart kavita`,
the OIDC config was reset to empty/disabled.
**Investigation:**
```bash
# Check the ServerSetting row before and after restart
docker exec -it kavita sqlite3 /kavita/config/kavita.db \
"SELECT Value, RowVersion FROM ServerSetting WHERE \"Key\"=40;"
# Before restart: {"enabled":true,"authority":"...","clientId":"kavita",...}, RowVersion=8
# After restart: {"enabled":false,"authority":"","clientId":"","clientSecret":"",...}, RowVersion=10
# RowVersion incremented by 2 — Kavita wrote to the row twice during startup
```
**Root cause:** Kavita validates and resets `ServerSetting` rows during startup from
its own defaults. Any value that does not pass Kavita's internal validation (including
OIDC config with the wrong format) gets reset to defaults. Direct SQL writes do not
go through Kavita's validation pipeline, so they get overwritten.
**Fix:** Use Kavita's own Settings UI via SSH port forwarding to bypass Cloudflare
and reach kscloud1's Kavita directly:
```bash
# Forward kscloud1's Kavita port to localhost
ssh -L 5099:localhost:5000 -i ~/.ssh/id_ed25519_kscloud1 kenpat@100.123.x.x -N &
# Now visit http://localhost:5099 in browser
# Log in with your Kavita credentials
# Settings → OIDC → configure there
# Click Save → changes survive restart
```
**Verification:** After saving in the UI, checked `RowVersion` was not incrementing on restart.
**Lesson:** Do not write directly to application databases unless you know the app does not
reinitialize those values on startup. Use the application's own APIs or UI.
**Critical detail:** The Authority URL MUST have a trailing slash:
`https://auth.kitestacks.com/application/o/kavita/`
Without it: "issuer does not match" error, because Authentik's `openid-configuration`
returns an `issuer` field that includes the trailing slash, and Kavita compares them exactly.
---
### Issue 5 — SSO Login Fails After monk Reconnects
**Symptom:** When monk went offline and came back, SSO logins failed for 510 minutes
with `invalid_grant`, then started working again.
**Investigation:**
Timeline reconstruction:
- T+0: monk goes offline (power or network)
- T+0: kscloud1 handles all traffic solo — SSO works fine, codes stored in shared DB
- T+5min: monk comes back online, cloudflared reconnects
- T+5min to T+8min: monk's Authentik is still starting (container startup takes ~34 min)
- During this window: Cloudflare routes some `/authorize` to kscloud1, some `/token` to monk
- Monk's Authentik hasn't finished starting — it responds with errors or invalid state
**Root cause:** The OAuth2 authorization code has a 1-minute TTL (default). Monk's Authentik
takes 35 minutes to fully start. During startup, Cloudflare is already routing traffic to
monk's cloudflared (which is running), but monk's Authentik is not ready.
Codes created on kscloud1 expire before monk's Authentik is healthy enough to serve them.
**Fix:** Increase the OAuth2 code TTL from 1 minute to 10 minutes:
```bash
docker exec -it authentik-postgres psql -U authentik -d authentik
UPDATE authentik_providers_oauth2_oauth2provider
SET access_code_validity = '00:10:00';
\q
```
Restart both Authentik instances. Now codes have a 10-minute window — enough for monk
to finish starting before the code expires.
**Alternative/additional fix:** Add a health check to monk's cloudflared or Authentik
that keeps cloudflared from accepting traffic until Authentik is healthy.
---
### Issue 6 — kscloud1 SSH Key Auth Broken After Long Absence
**Symptom:** After not connecting to kscloud1 for several weeks, `ssh kenpat@kscloud1`
returned "Permission denied (publickey)".
**Investigation:**
```bash
ssh -v -i ~/.ssh/id_ed25519_kscloud1 kenpat@100.123.x.x
# Verbose output showed: offered key was not accepted
# No other errors — key was being offered but rejected
```
**Root cause:** The `authorized_keys` file on kscloud1 had somehow been reset or corrupted
(possibly from a VPS maintenance event or snapshot restore).
**Fix:** Use Hetzner's console (web-based terminal that does not require SSH):
1. Hetzner dashboard → Server → Console
2. Log in as root (reset root password via Hetzner UI if needed)
3. Restore the public key:
```bash
# On kscloud1 via Hetzner console
mkdir -p /home/kenpat/.ssh
cat >> /home/kenpat/.ssh/authorized_keys << 'EOF'
ssh-ed25519 AAAA... your-public-key-here
EOF
chmod 700 /home/kenpat/.ssh
chmod 600 /home/kenpat/.ssh/authorized_keys
chown -R kenpat:kenpat /home/kenpat/.ssh
```
**Lesson:** Always keep your public key backed up. Cloud providers (Hetzner, AWS, DigitalOcean)
all have web-based console access for exactly this situation. Never rely only on SSH for
access to a remote server.
---
### Issue 7 — ufw Blocking Docker Container to Host Port
**Symptom:** The portal homepage on kscloud1 showed "0%" and "Offline" for the System Status
widget. On monk it showed real values.
**Investigation:**
```bash
# Test the metrics API directly from inside the homepage container on kscloud1
docker exec homepage-backup curl -s http://host.docker.internal:8000/api/metrics
# No response after timeout
# Test from host directly
curl -s http://localhost:8000/api/metrics
# Returns real metrics immediately
# Check ufw rules
sudo ufw status verbose
# default deny incoming — no specific rule for port 8000
```
**Root cause:** The `kitestacks-metrics-api` container runs with `network_mode: host`.
When `homepage-backup` calls `host.docker.internal:8000`, the kernel sees the source IP
as the Docker bridge network (`172.x.x.x`). ufw's `default deny incoming` blocks it.
Docker's iptables bypass (that allows published ports to work despite ufw) does not apply
here because this is host-to-host traffic, not container-published port traffic.
**Fix:**
```bash
sudo ufw allow from 172.16.0.0/12 to any port 8000 proto tcp
sudo ufw status verbose # Verify rule added
```
`172.16.0.0/12` covers all Docker bridge subnets (172.16.x.x through 172.31.x.x).
**Verification:**
```bash
docker exec homepage-backup curl -s http://host.docker.internal:8000/api/metrics
# Now returns: {"cpu_percent": 4.2, "ram_percent": 71.3, ...}
```
---
## General Troubleshooting Cheatsheet
| Symptom | First Commands to Run |
|---------|----------------------|
| Container won't start | `docker logs <container>` |
| Container starts then crashes | `docker logs <container> --tail 30` |
| Can't reach service from browser | `docker exec cloudflared curl -s http://<service>:<port>` |
| SSL/TLS error in browser | `curl -sv https://yourdomain.com` (check Cloudflare is resolving) |
| SSO failing with invalid_grant | Check both Authentik instances point to same shared Postgres |
| Database error | Check data directory permissions: `ls -la ./data/` |
| Port already in use | `sudo ss -tlnp | grep :<port>` |
| Out of disk space | `df -h` and `docker system df` |
| Out of RAM | `free -h` and `docker stats --no-stream` |
| Can't ping between containers | `docker network inspect kitestacks` |
| Forgejo 502 | `docker logs forgejo` — likely DB connection issue |
| Authentik won't start | Check it can reach `$KSCLOUD1_TAILSCALE:5432` (Tailscale up?) |