From e69f236c89e529b9adbf372ae43d88c16342a17a Mon Sep 17 00:00:00 2001 From: kenpat Date: Tue, 16 Jun 2026 21:35:23 -0500 Subject: [PATCH] docs: document phantom 3rd tunnel replica fix + update runbook for 2-connector arch MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - DEBUGGING.md: add issue #9 — native cloudflared systemd running alongside Docker container causes phantom 3rd replica in CF dashboard; fix is to disable systemd service - RUNBOOK.md: correct architecture diagram from 3 connectors to 2 (monk Docker + kscloud1); add warning to disable native cloudflared systemd after containerizing; update failover test procedure with verified 2026-06-16 results (zero downtime confirmed) Co-Authored-By: Claude Sonnet 4.6 --- RUNBOOK.md | 27 ++++++++++++++++++--------- docs/DEBUGGING.md | 24 ++++++++++++++++++++++++ 2 files changed, 42 insertions(+), 9 deletions(-) diff --git a/RUNBOOK.md b/RUNBOOK.md index 0fab624..1fd0196 100644 --- a/RUNBOOK.md +++ b/RUNBOOK.md @@ -12,10 +12,9 @@ Internet │ └── Cloudflare (DNS + Tunnel) - │ Active-Active across 3 connectors - ├── cloudflared on monk (primary home machine) - ├── cloudflared on kscloud1 (Hetzner VPS, ) - └── cloudflared on T14s (currently OFF) + │ Active-Active across 2 connectors + ├── cloudflared on monk (primary home machine, Docker container) + └── cloudflared on kscloud1 (Hetzner VPS, ) Tailscale overlay network (VPN mesh): monk @@ -327,6 +326,12 @@ networks: cd ~/kitestacks-live/docker/cloudflared && docker compose up -d ``` +> **Important:** After starting the Docker container, check for a pre-existing native cloudflared systemd service and disable it — both will connect with the same token and register as separate phantom replicas in the CF dashboard: +> ```bash +> systemctl status cloudflared +> sudo systemctl stop cloudflared && sudo systemctl disable cloudflared +> ``` + ### 5.2 Authentik (monk side — points to shared DB on kscloud1) `~/kitestacks-live/docker/authentik/docker-compose.yml`: @@ -1121,7 +1126,7 @@ All 9 service directories live under `/opt/kitestacks/docker/` on kscloud1. The - `FORGEJO_API_BASE=http://:` for metrics-api (monk's Forgejo over Tailscale) - Authentik on kscloud1 uses the same shared DB (it's the host — localhost resolves fine; use `` for consistency) -### 7.1 Deploy cloudflared on kscloud1 (3rd connector) +### 7.1 Deploy cloudflared on kscloud1 (2nd connector) Same `docker-compose.yml` as monk — same `TUNNEL_TOKEN`. Cloudflare assigns a new connector ID automatically. @@ -1314,17 +1319,21 @@ Expected: all return 200 (or 301/302 for redirect-based logins). - [ ] `https://links.kitestacks.com` → Karakeep login with Authentik → works - [ ] `https://kavita.kitestacks.com` → "Sign in with authentik" → works -### Failover test (disconnect monk's internet) +### Failover test (stop monk's cloudflared) -With monk's home network off (phone hotspot or at a different location): ```bash -for sub in www auth gitforge tasks ai links kavita grafana status; do +docker stop cloudflared +sleep 5 +for sub in www auth gitforge tasks ai links kavita grafana status portainer; do code=$(curl -sk -o /dev/null -w "%{http_code}" "https://${sub}.kitestacks.com") echo "$sub: $code" done +docker start cloudflared ``` -All 9 should still return 200 (served by kscloud1). +All subdomains should return 200/302 (served by kscloud1 alone). + +**Verified 2026-06-16:** www=200, auth=302, status=302, portainer=200 — zero downtime during monk cloudflared outage. kscloud1 took over immediately. ### Authentik shared DB health diff --git a/docs/DEBUGGING.md b/docs/DEBUGGING.md index 810c516..3ac47c5 100644 --- a/docs/DEBUGGING.md +++ b/docs/DEBUGGING.md @@ -43,6 +43,30 @@ This document contains solutions and diagnostic steps for known issues that have **Symptom:** Forgejo throws a 500 or redirect error after SSO login. **Fix:** Ensure the `ROOT_URL` in Forgejo's `app.ini` exactly matches the public domain (`https://gitforge.kitestacks.com/`), and that the Authentik Application Launch URL strictly matches the OAuth redirect URI configured in Forgejo. +## 9. Phantom 3rd Cloudflare Tunnel Replica + +**Symptom:** CF dashboard shows 3 active replicas when only 2 connectors are expected (monk + kscloud1). + +**Root Cause:** A native `cloudflared` systemd service was installed on monk at some point (before cloudflared was containerized). It continues running alongside the Docker container, both connecting with the same tunnel token and registering as separate connectors. + +**Diagnosis:** +```bash +systemctl status cloudflared +ps aux | grep cloudflared +``` +If you see two cloudflared processes (one owned by root via systemd, one by the Docker container user), the systemd service is the ghost. + +**Fix:** +```bash +sudo systemctl stop cloudflared +sudo systemctl disable cloudflared +``` +The phantom replica disappears from the CF dashboard within ~1 minute. The Docker container (`restart: unless-stopped`) is the correct long-term cloudflared on monk. + +**Prevention:** After containerizing cloudflared, always check for and disable any pre-existing native systemd cloudflared service. Add this check to any new host setup. + +--- + ## 8. Random 502 Errors on New Subdomains (e.g., ntfy) **Symptom:** Accessing a newly created subdomain (like `ntfy.kitestacks.com`) randomly returns a 502 Bad Gateway error from Cloudflare, even though it works internally. **Root Cause:** The KiteStacks architecture uses a single Cloudflare Tunnel with multiple connectors (`monk` and `kscloud1`) for active-active high availability. Cloudflare load balances traffic across all active connectors blindly. If you deploy a new service (like `ntfy`) only on `monk`, any request that Cloudflare sends to the `kscloud1` connector will fail with a 502 because the container doesn't exist on that node.