docs: clean up runbook and create dedicated debugging guide

This commit is contained in:
kenpat 2026-06-15 15:23:44 -05:00
parent bdec86b16f
commit c4c6b49bf4
2 changed files with 53 additions and 73 deletions

44
docs/DEBUGGING.md Normal file
View file

@ -0,0 +1,44 @@
# KiteStacks Homelab - Debugging & Troubleshooting
This document contains solutions and diagnostic steps for known issues that have occurred during the setup and operation of the KiteStacks homelab.
---
## 1. osTicket: New User Activation Emails Not Sending
**Symptom:** When a new user registers for a Help Desk account, they do not receive the activation email.
**Root Cause:** osTicket runs in a Docker container without a local Mail Transfer Agent (MTA) like Postfix or Sendmail. By default, PHP's internal `mail()` function silently fails because it cannot route the email.
**Fix:** You must configure an external SMTP server in the osTicket Admin Panel.
1. Log into the osTicket Staff Control Panel (`/scp/`).
2. Go to **Emails > Emails**.
3. Select the default outbound email address (e.g., `noreply@kitestacks.com`).
4. Scroll down to **SMTP Settings** and configure it to use a real mail provider (e.g., SendGrid, Mailgun, Amazon SES, or Gmail SMTP).
5. Ensure **Authentication Required** is set to **Yes**.
6. Save and send a test email.
## 2. Cloudflare Tunnel "Hmm. We're having trouble finding that site"
**Symptom:** A subdomain is correctly configured in Cloudflare Zero Trust, but visiting the site returns a Cloudflare error.
**Root Cause 1:** The internal service is down or restarting. Check the Docker container logs.
**Root Cause 2:** Multi-node load balancing cache. Cloudflare balances requests between `monk` and `kscloud1`. If you update a container on `monk` but forget to update it on `kscloud1`, 50% of requests will fail or show stale data.
**Fix:** Ensure Docker containers on both hosts are perfectly mirrored or explicitly configure Cloudflare to route only to the active host for that specific subdomain.
## 3. Authentik "invalid_grant" or "Code does not exist"
**Symptom:** Logging into a service via Authentik SSO randomly fails with "invalid_grant".
**Root Cause:** Initially, `monk` and `kscloud1` ran separate Authentik Postgres databases. Auth codes were generated on one node and consumed on the other, failing validation.
**Fix:** The Authentik databases are now **Unified** over Tailscale. `monk` points its Postgres and Redis connections to `100.123.254.52` (kscloud1). Do not run local Postgres for Authentik on `monk`. If Tailscale goes down, SSO will fail.
## 4. Kavita "Sign in with Authentik" Button Missing
**Symptom:** The Authentik OIDC login button does not appear on the Kavita login screen.
**Root Cause:** Kavita stores OIDC settings in its internal SQLite database (`kavita.db`), not in an environment variable.
**Fix:** The OIDC settings must be configured manually via the Kavita UI (Admin Settings -> OIDC). Direct SQL edits are overwritten by the Kavita container upon restart.
## 5. Portainer Password Reset
**Symptom:** Admin password is lost.
**Fix:** Stop the Portainer container. You must use a Go container with `bbolt` to patch the underlying BoltDB directly, or temporarily pass the `--admin-password` flag to the container entrypoint to reset it.
## 6. Uptime Kuma "Reconnecting to server..." Loop
**Symptom:** The Uptime Kuma UI constantly shows "Reconnecting...".
**Fix:** Uptime Kuma requires WebSockets. Ensure Cloudflare Tunnel does not aggressively cache the HTML/JS and that Nginx proxy timeouts are not aggressively closing the WebSocket connection.
## 7. Forgejo Authentication Failures (LDAP/OIDC)
**Symptom:** Forgejo throws a 500 or redirect error after SSO login.
**Fix:** Ensure the `ROOT_URL` in Forgejo's `app.ini` exactly matches the public domain (`https://gitforge.kitestacks.com/`), and that the Authentik Application Launch URL strictly matches the OAuth redirect URI configured in Forgejo.