Complete documentation suite for KiteStacks covering all 11 services across 2-host active-active architecture. Includes beginner track (with AI, 8 files) and advanced track (without AI, 7 files) with time estimates, real troubleshooting cases, and command-by-command explanations. Updates certifications roadmap to reflect July 7 2026 A+ Core 2 exam goal. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
389 lines
14 KiB
Markdown
389 lines
14 KiB
Markdown
# Without AI — Part 7: Troubleshooting
|
||
|
||
**Track:** Advanced (No AI)
|
||
**Time for this section:** Ongoing (this is a reference you return to)
|
||
|
||
Troubleshooting is not a step you complete — it is a skill you build over time.
|
||
This section teaches the methodology and documents the real issues encountered
|
||
building KiteStacks, with full explanations of how each was diagnosed and fixed.
|
||
|
||
---
|
||
|
||
## The Troubleshooting Mindset
|
||
|
||
Before running any command, form a hypothesis. Before Googling, read the error.
|
||
|
||
**The diagnostic loop:**
|
||
1. **Observe** — what exactly is failing? URL? Error message? Which service?
|
||
2. **Hypothesize** — what could cause this? List 2–3 possibilities
|
||
3. **Test** — run the simplest command to prove or disprove your hypothesis
|
||
4. **Narrow** — eliminate possibilities until one remains
|
||
5. **Fix** — apply the fix
|
||
6. **Verify** — confirm the fix worked
|
||
7. **Document** — write what broke and what fixed it
|
||
|
||
The most common mistake: jumping to step 5 without completing steps 2–4.
|
||
|
||
---
|
||
|
||
## Diagnostic Commands to Know Cold
|
||
|
||
```bash
|
||
# Container status
|
||
docker ps # All running containers
|
||
docker ps -a # All containers (including stopped)
|
||
docker inspect <container> # Full container config and state
|
||
|
||
# Logs
|
||
docker logs <container> # All logs
|
||
docker logs <container> --tail 50 # Last 50 lines
|
||
docker logs <container> -f # Follow live
|
||
docker logs <container> --since 5m # Last 5 minutes
|
||
|
||
# Network
|
||
docker exec <container> curl -s http://other-container:port/health
|
||
docker exec <container> nslookup other-container
|
||
docker exec <container> ss -tlnp
|
||
docker network inspect kitestacks
|
||
|
||
# Disk and resources
|
||
docker system df # Docker disk usage
|
||
docker stats --no-stream # One-shot resource usage
|
||
df -h # Host disk usage
|
||
free -h # Host RAM
|
||
|
||
# DNS and HTTP from host
|
||
curl -sv https://grafana.kitestacks.com # -v = verbose (shows headers, TLS)
|
||
dig grafana.kitestacks.com # DNS lookup
|
||
```
|
||
|
||
---
|
||
|
||
## Real Issues Encountered Building KiteStacks
|
||
|
||
### Issue 1 — SSO: `invalid_grant` on OAuth Login (50% of the time)
|
||
|
||
**Symptom:** Clicking "Sign in with Authentik" in Grafana, Kavita, etc. sometimes
|
||
worked and sometimes showed `invalid_grant: The provided authorization grant is invalid`.
|
||
Happened roughly 50% of the time. No correlation to time of day.
|
||
|
||
**Observation:** The error appeared specifically after the authorization code redirect,
|
||
during the token exchange step.
|
||
|
||
**Hypothesis:**
|
||
1. Authentik configuration wrong (but then it would fail 100% of the time)
|
||
2. Network issue (but HTTP 400 means request reached Authentik)
|
||
3. The code created in step 1 is not found in step 2
|
||
|
||
**Testing:**
|
||
```bash
|
||
# Check if both Authentik instances have the same database
|
||
docker exec authentik psql -U authentik -h $KSCLOUD1_IP -c "SELECT count(*) FROM authentik_providers_oauth2_authorizationcode;"
|
||
# Monk's Authentik: count = 3
|
||
# kscloud1's Authentik: count = 1
|
||
# Different! Step 1 created the code in one DB, step 2 looked in the other.
|
||
```
|
||
|
||
**Root cause:** Two Authentik instances, two separate Postgres databases. Cloudflare
|
||
routes `/authorize` and `/application/o/token/` independently — they can hit different hosts.
|
||
|
||
**Fix:** Migrate both Authentik instances to a single shared Postgres, hosted on kscloud1,
|
||
bound to the Tailscale IP only.
|
||
|
||
```bash
|
||
# 1. Dump monk's Authentik DB
|
||
docker exec authentik-postgres pg_dump -U authentik authentik --clean --if-exists \
|
||
> /tmp/authentik_dump.sql
|
||
|
||
# 2. Restore to kscloud1's new shared Postgres
|
||
scp /tmp/authentik_dump.sql kenpat@100.123.x.x:/tmp/
|
||
ssh kenpat@100.123.x.x "docker exec -i authentik-postgres psql -U authentik -d authentik \
|
||
< /tmp/authentik_dump.sql"
|
||
|
||
# 3. Update monk's Authentik .env to point to kscloud1's Tailscale IP
|
||
AUTHENTIK_POSTGRESQL__HOST=100.123.x.x
|
||
AUTHENTIK_REDIS__HOST=100.123.x.x
|
||
|
||
# 4. Remove monk's local Postgres and Redis
|
||
docker stop authentik-postgres authentik-redis # Stop, don't delete (keep data as backup)
|
||
|
||
# 5. Restart monk's Authentik
|
||
docker compose up -d
|
||
```
|
||
|
||
**Verification:** Logged in from a browser with DevTools open, watching Network tab.
|
||
`/authorize` returned 302 with a code. `/token` returned 200 with a JWT. Done.
|
||
|
||
**Lesson:** Stateful services with active-active routing need shared state. Any session,
|
||
token, or code stored in one instance's database is invisible to the other instance.
|
||
|
||
---
|
||
|
||
### Issue 2 — Phantom Third Connector in Cloudflare Dashboard
|
||
|
||
**Symptom:** Cloudflare Tunnel showed 3 active connectors when only 2 were expected
|
||
(monk + kscloud1). Which was the third?
|
||
|
||
**Investigation:**
|
||
```bash
|
||
# Check running Docker containers for cloudflared
|
||
docker ps | grep cloudflared
|
||
# Shows: one cloudflared container — expected
|
||
|
||
# Check for non-Docker cloudflared processes
|
||
ps aux | grep cloudflared
|
||
# Shows: TWO processes!
|
||
# /usr/bin/cloudflared (system-installed, running as a systemd service)
|
||
# /usr/local/bin/cloudflared (Docker container)
|
||
```
|
||
|
||
**Root cause:** A cloudflared systemd service was installed separately from the Docker
|
||
container. Both connected to the same tunnel with the same token, registering as separate connectors.
|
||
|
||
```bash
|
||
# Verify the systemd service
|
||
sudo systemctl status cloudflared
|
||
|
||
# Fix: disable the systemd service
|
||
sudo systemctl stop cloudflared
|
||
sudo systemctl disable cloudflared
|
||
|
||
# Verify only one connector process remains
|
||
ps aux | grep cloudflared
|
||
```
|
||
|
||
**Verification:** Cloudflare dashboard refreshed to show 2 connectors within 30 seconds.
|
||
|
||
**Lesson:** A service installed via package manager AND in Docker is a recipe for duplicate
|
||
processes. Check both `docker ps` and `ps aux` when troubleshooting unexpected behavior.
|
||
|
||
---
|
||
|
||
### Issue 3 — Karakeep SSO "Redirect URI Error"
|
||
|
||
**Symptom:** After configuring Authentik OAuth2 for Karakeep, clicking "Sign in"
|
||
showed "Redirect URI Error: The provided redirect_uri does not match any of the
|
||
allowed redirect URIs" from Authentik.
|
||
|
||
**Investigation:**
|
||
```bash
|
||
# Check what redirect URI was used in the OAuth2 request
|
||
# Read from Authentik's logs
|
||
docker logs authentik --tail 100 | grep "redirect_uri"
|
||
# Shows: redirect_uri=https://links.kitestacks.com/api/auth/callback/authentik
|
||
```
|
||
|
||
**Root cause:** Karakeep uses NextAuth.js internally with provider ID `custom`.
|
||
NextAuth constructs callback URLs as `/api/auth/callback/<provider-id>`.
|
||
The provider ID is `custom`, not `authentik`.
|
||
|
||
So the callback is `/api/auth/callback/custom`, not `/api/auth/callback/authentik`.
|
||
|
||
**Fix:**
|
||
```bash
|
||
# Update Authentik's OAuth2 provider for Karakeep in the shared Postgres
|
||
docker exec -it authentik-postgres psql -U authentik -d authentik
|
||
|
||
BEGIN;
|
||
UPDATE authentik_providers_oauth2_oauth2provider
|
||
SET _redirect_uris = '["https://links.kitestacks.com/api/auth/callback/custom"]'
|
||
WHERE name = 'Karakeep';
|
||
COMMIT;
|
||
|
||
-- Verify
|
||
SELECT name, _redirect_uris FROM authentik_providers_oauth2_oauth2provider WHERE name = 'Karakeep';
|
||
\q
|
||
```
|
||
|
||
Restart Authentik on both hosts:
|
||
```bash
|
||
docker compose restart authentik authentik-worker
|
||
# Wait for healthy before testing
|
||
```
|
||
|
||
**Lesson:** When you get a redirect URI mismatch, always check what URI the APP is
|
||
actually sending — not what you think it should send. The app's logs or browser DevTools
|
||
Network tab show the actual request.
|
||
|
||
---
|
||
|
||
### Issue 4 — Kavita OIDC Config Gets Wiped on Restart
|
||
|
||
**Symptom:** Configured Kavita's OIDC settings by editing `kavita.db` directly
|
||
(using sqlite3). Settings looked correct in the DB. After `docker compose restart kavita`,
|
||
the OIDC config was reset to empty/disabled.
|
||
|
||
**Investigation:**
|
||
```bash
|
||
# Check the ServerSetting row before and after restart
|
||
docker exec -it kavita sqlite3 /kavita/config/kavita.db \
|
||
"SELECT Value, RowVersion FROM ServerSetting WHERE \"Key\"=40;"
|
||
# Before restart: {"enabled":true,"authority":"...","clientId":"kavita",...}, RowVersion=8
|
||
# After restart: {"enabled":false,"authority":"","clientId":"","clientSecret":"",...}, RowVersion=10
|
||
# RowVersion incremented by 2 — Kavita wrote to the row twice during startup
|
||
```
|
||
|
||
**Root cause:** Kavita validates and resets `ServerSetting` rows during startup from
|
||
its own defaults. Any value that does not pass Kavita's internal validation (including
|
||
OIDC config with the wrong format) gets reset to defaults. Direct SQL writes do not
|
||
go through Kavita's validation pipeline, so they get overwritten.
|
||
|
||
**Fix:** Use Kavita's own Settings UI via SSH port forwarding to bypass Cloudflare
|
||
and reach kscloud1's Kavita directly:
|
||
|
||
```bash
|
||
# Forward kscloud1's Kavita port to localhost
|
||
ssh -L 5099:localhost:5000 -i ~/.ssh/id_ed25519_kscloud1 kenpat@100.123.x.x -N &
|
||
# Now visit http://localhost:5099 in browser
|
||
# Log in with your Kavita credentials
|
||
# Settings → OIDC → configure there
|
||
# Click Save → changes survive restart
|
||
```
|
||
|
||
**Verification:** After saving in the UI, checked `RowVersion` was not incrementing on restart.
|
||
|
||
**Lesson:** Do not write directly to application databases unless you know the app does not
|
||
reinitialize those values on startup. Use the application's own APIs or UI.
|
||
|
||
**Critical detail:** The Authority URL MUST have a trailing slash:
|
||
`https://auth.kitestacks.com/application/o/kavita/`
|
||
Without it: "issuer does not match" error, because Authentik's `openid-configuration`
|
||
returns an `issuer` field that includes the trailing slash, and Kavita compares them exactly.
|
||
|
||
---
|
||
|
||
### Issue 5 — SSO Login Fails After monk Reconnects
|
||
|
||
**Symptom:** When monk went offline and came back, SSO logins failed for 5–10 minutes
|
||
with `invalid_grant`, then started working again.
|
||
|
||
**Investigation:**
|
||
Timeline reconstruction:
|
||
- T+0: monk goes offline (power or network)
|
||
- T+0: kscloud1 handles all traffic solo — SSO works fine, codes stored in shared DB
|
||
- T+5min: monk comes back online, cloudflared reconnects
|
||
- T+5min to T+8min: monk's Authentik is still starting (container startup takes ~3–4 min)
|
||
- During this window: Cloudflare routes some `/authorize` to kscloud1, some `/token` to monk
|
||
- Monk's Authentik hasn't finished starting — it responds with errors or invalid state
|
||
|
||
**Root cause:** The OAuth2 authorization code has a 1-minute TTL (default). Monk's Authentik
|
||
takes 3–5 minutes to fully start. During startup, Cloudflare is already routing traffic to
|
||
monk's cloudflared (which is running), but monk's Authentik is not ready.
|
||
|
||
Codes created on kscloud1 expire before monk's Authentik is healthy enough to serve them.
|
||
|
||
**Fix:** Increase the OAuth2 code TTL from 1 minute to 10 minutes:
|
||
|
||
```bash
|
||
docker exec -it authentik-postgres psql -U authentik -d authentik
|
||
|
||
UPDATE authentik_providers_oauth2_oauth2provider
|
||
SET access_code_validity = '00:10:00';
|
||
|
||
\q
|
||
```
|
||
|
||
Restart both Authentik instances. Now codes have a 10-minute window — enough for monk
|
||
to finish starting before the code expires.
|
||
|
||
**Alternative/additional fix:** Add a health check to monk's cloudflared or Authentik
|
||
that keeps cloudflared from accepting traffic until Authentik is healthy.
|
||
|
||
---
|
||
|
||
### Issue 6 — kscloud1 SSH Key Auth Broken After Long Absence
|
||
|
||
**Symptom:** After not connecting to kscloud1 for several weeks, `ssh kenpat@kscloud1`
|
||
returned "Permission denied (publickey)".
|
||
|
||
**Investigation:**
|
||
```bash
|
||
ssh -v -i ~/.ssh/id_ed25519_kscloud1 kenpat@100.123.x.x
|
||
# Verbose output showed: offered key was not accepted
|
||
# No other errors — key was being offered but rejected
|
||
```
|
||
|
||
**Root cause:** The `authorized_keys` file on kscloud1 had somehow been reset or corrupted
|
||
(possibly from a VPS maintenance event or snapshot restore).
|
||
|
||
**Fix:** Use Hetzner's console (web-based terminal that does not require SSH):
|
||
1. Hetzner dashboard → Server → Console
|
||
2. Log in as root (reset root password via Hetzner UI if needed)
|
||
3. Restore the public key:
|
||
|
||
```bash
|
||
# On kscloud1 via Hetzner console
|
||
mkdir -p /home/kenpat/.ssh
|
||
cat >> /home/kenpat/.ssh/authorized_keys << 'EOF'
|
||
ssh-ed25519 AAAA... your-public-key-here
|
||
EOF
|
||
chmod 700 /home/kenpat/.ssh
|
||
chmod 600 /home/kenpat/.ssh/authorized_keys
|
||
chown -R kenpat:kenpat /home/kenpat/.ssh
|
||
```
|
||
|
||
**Lesson:** Always keep your public key backed up. Cloud providers (Hetzner, AWS, DigitalOcean)
|
||
all have web-based console access for exactly this situation. Never rely only on SSH for
|
||
access to a remote server.
|
||
|
||
---
|
||
|
||
### Issue 7 — ufw Blocking Docker Container to Host Port
|
||
|
||
**Symptom:** The portal homepage on kscloud1 showed "0%" and "Offline" for the System Status
|
||
widget. On monk it showed real values.
|
||
|
||
**Investigation:**
|
||
```bash
|
||
# Test the metrics API directly from inside the homepage container on kscloud1
|
||
docker exec homepage-backup curl -s http://host.docker.internal:8000/api/metrics
|
||
# No response after timeout
|
||
|
||
# Test from host directly
|
||
curl -s http://localhost:8000/api/metrics
|
||
# Returns real metrics immediately
|
||
|
||
# Check ufw rules
|
||
sudo ufw status verbose
|
||
# default deny incoming — no specific rule for port 8000
|
||
```
|
||
|
||
**Root cause:** The `kitestacks-metrics-api` container runs with `network_mode: host`.
|
||
When `homepage-backup` calls `host.docker.internal:8000`, the kernel sees the source IP
|
||
as the Docker bridge network (`172.x.x.x`). ufw's `default deny incoming` blocks it.
|
||
|
||
Docker's iptables bypass (that allows published ports to work despite ufw) does not apply
|
||
here because this is host-to-host traffic, not container-published port traffic.
|
||
|
||
**Fix:**
|
||
```bash
|
||
sudo ufw allow from 172.16.0.0/12 to any port 8000 proto tcp
|
||
sudo ufw status verbose # Verify rule added
|
||
```
|
||
|
||
`172.16.0.0/12` covers all Docker bridge subnets (172.16.x.x through 172.31.x.x).
|
||
|
||
**Verification:**
|
||
```bash
|
||
docker exec homepage-backup curl -s http://host.docker.internal:8000/api/metrics
|
||
# Now returns: {"cpu_percent": 4.2, "ram_percent": 71.3, ...}
|
||
```
|
||
|
||
---
|
||
|
||
## General Troubleshooting Cheatsheet
|
||
|
||
| Symptom | First Commands to Run |
|
||
|---------|----------------------|
|
||
| Container won't start | `docker logs <container>` |
|
||
| Container starts then crashes | `docker logs <container> --tail 30` |
|
||
| Can't reach service from browser | `docker exec cloudflared curl -s http://<service>:<port>` |
|
||
| SSL/TLS error in browser | `curl -sv https://yourdomain.com` (check Cloudflare is resolving) |
|
||
| SSO failing with invalid_grant | Check both Authentik instances point to same shared Postgres |
|
||
| Database error | Check data directory permissions: `ls -la ./data/` |
|
||
| Port already in use | `sudo ss -tlnp | grep :<port>` |
|
||
| Out of disk space | `df -h` and `docker system df` |
|
||
| Out of RAM | `free -h` and `docker stats --no-stream` |
|
||
| Can't ping between containers | `docker network inspect kitestacks` |
|
||
| Forgejo 502 | `docker logs forgejo` — likely DB connection issue |
|
||
| Authentik won't start | Check it can reach `$KSCLOUD1_TAILSCALE:5432` (Tailscale up?) |
|