- DEBUGGING.md: add issue #9 — native cloudflared systemd running alongside Docker container causes phantom 3rd replica in CF dashboard; fix is to disable systemd service - RUNBOOK.md: correct architecture diagram from 3 connectors to 2 (monk Docker + kscloud1); add warning to disable native cloudflared systemd after containerizing; update failover test procedure with verified 2026-06-16 results (zero downtime confirmed) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
73 lines
5.5 KiB
Markdown
73 lines
5.5 KiB
Markdown
# KiteStacks Homelab - Debugging & Troubleshooting
|
|
|
|
This document contains solutions and diagnostic steps for known issues that have occurred during the setup and operation of the KiteStacks homelab.
|
|
|
|
---
|
|
|
|
## 1. osTicket: New User Activation Emails Not Sending
|
|
**Symptom:** When a new user registers for a Help Desk account, they do not receive the activation email.
|
|
**Root Cause:** osTicket runs in a Docker container without a local Mail Transfer Agent (MTA) like Postfix or Sendmail. By default, PHP's internal `mail()` function silently fails because it cannot route the email.
|
|
**Fix:** You must configure an external SMTP server in the osTicket Admin Panel.
|
|
1. Log into the osTicket Staff Control Panel (`/scp/`).
|
|
2. Go to **Emails > Emails**.
|
|
3. Select the default outbound email address (e.g., `noreply@kitestacks.com`).
|
|
4. Scroll down to **SMTP Settings** and configure it to use a real mail provider (e.g., SendGrid, Mailgun, Amazon SES, or Gmail SMTP).
|
|
5. Ensure **Authentication Required** is set to **Yes**.
|
|
6. Save and send a test email.
|
|
|
|
## 2. Cloudflare Tunnel "Hmm. We're having trouble finding that site"
|
|
**Symptom:** A subdomain is correctly configured in Cloudflare Zero Trust, but visiting the site returns a Cloudflare error.
|
|
**Root Cause 1:** The internal service is down or restarting. Check the Docker container logs.
|
|
**Root Cause 2:** Multi-node load balancing cache. Cloudflare balances requests between `monk` and `kscloud1`. If you update a container on `monk` but forget to update it on `kscloud1`, 50% of requests will fail or show stale data.
|
|
**Fix:** Ensure Docker containers on both hosts are perfectly mirrored or explicitly configure Cloudflare to route only to the active host for that specific subdomain.
|
|
|
|
## 3. Authentik "invalid_grant" or "Code does not exist"
|
|
**Symptom:** Logging into a service via Authentik SSO randomly fails with "invalid_grant".
|
|
**Root Cause:** Initially, `monk` and `kscloud1` ran separate Authentik Postgres databases. Auth codes were generated on one node and consumed on the other, failing validation.
|
|
**Fix:** The Authentik databases are now **Unified** over Tailscale. `monk` points its Postgres and Redis connections to `100.123.254.52` (kscloud1). Do not run local Postgres for Authentik on `monk`. If Tailscale goes down, SSO will fail.
|
|
|
|
## 4. Kavita "Sign in with Authentik" Button Missing
|
|
**Symptom:** The Authentik OIDC login button does not appear on the Kavita login screen.
|
|
**Root Cause:** Kavita stores OIDC settings in its internal SQLite database (`kavita.db`), not in an environment variable.
|
|
**Fix:** The OIDC settings must be configured manually via the Kavita UI (Admin Settings -> OIDC). Direct SQL edits are overwritten by the Kavita container upon restart.
|
|
|
|
## 5. Portainer Password Reset
|
|
**Symptom:** Admin password is lost.
|
|
**Fix:** Stop the Portainer container. You must use a Go container with `bbolt` to patch the underlying BoltDB directly, or temporarily pass the `--admin-password` flag to the container entrypoint to reset it.
|
|
|
|
## 6. Uptime Kuma "Reconnecting to server..." Loop
|
|
**Symptom:** The Uptime Kuma UI constantly shows "Reconnecting...".
|
|
**Fix:** Uptime Kuma requires WebSockets. Ensure Cloudflare Tunnel does not aggressively cache the HTML/JS and that Nginx proxy timeouts are not aggressively closing the WebSocket connection.
|
|
|
|
## 7. Forgejo Authentication Failures (LDAP/OIDC)
|
|
**Symptom:** Forgejo throws a 500 or redirect error after SSO login.
|
|
**Fix:** Ensure the `ROOT_URL` in Forgejo's `app.ini` exactly matches the public domain (`https://gitforge.kitestacks.com/`), and that the Authentik Application Launch URL strictly matches the OAuth redirect URI configured in Forgejo.
|
|
|
|
## 9. Phantom 3rd Cloudflare Tunnel Replica
|
|
|
|
**Symptom:** CF dashboard shows 3 active replicas when only 2 connectors are expected (monk + kscloud1).
|
|
|
|
**Root Cause:** A native `cloudflared` systemd service was installed on monk at some point (before cloudflared was containerized). It continues running alongside the Docker container, both connecting with the same tunnel token and registering as separate connectors.
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
systemctl status cloudflared
|
|
ps aux | grep cloudflared
|
|
```
|
|
If you see two cloudflared processes (one owned by root via systemd, one by the Docker container user), the systemd service is the ghost.
|
|
|
|
**Fix:**
|
|
```bash
|
|
sudo systemctl stop cloudflared
|
|
sudo systemctl disable cloudflared
|
|
```
|
|
The phantom replica disappears from the CF dashboard within ~1 minute. The Docker container (`restart: unless-stopped`) is the correct long-term cloudflared on monk.
|
|
|
|
**Prevention:** After containerizing cloudflared, always check for and disable any pre-existing native systemd cloudflared service. Add this check to any new host setup.
|
|
|
|
---
|
|
|
|
## 8. Random 502 Errors on New Subdomains (e.g., ntfy)
|
|
**Symptom:** Accessing a newly created subdomain (like `ntfy.kitestacks.com`) randomly returns a 502 Bad Gateway error from Cloudflare, even though it works internally.
|
|
**Root Cause:** The KiteStacks architecture uses a single Cloudflare Tunnel with multiple connectors (`monk` and `kscloud1`) for active-active high availability. Cloudflare load balances traffic across all active connectors blindly. If you deploy a new service (like `ntfy`) only on `monk`, any request that Cloudflare sends to the `kscloud1` connector will fail with a 502 because the container doesn't exist on that node.
|
|
**Fix:** For a multi-connector tunnel setup, you **must** deploy the identical service stack on all nodes. Deploy the missing container (e.g., `ntfy`) to the `kscloud1` replica to ensure both connectors can route the traffic successfully.
|