diff --git a/docs/disaster-recovery/RUNBOOK.md b/docs/disaster-recovery/RUNBOOK.md index 83b7372..3672668 100644 --- a/docs/disaster-recovery/RUNBOOK.md +++ b/docs/disaster-recovery/RUNBOOK.md @@ -2,23 +2,33 @@ ## Purpose -This document describes how to restore the entire KiteStacks platform if the primary server (Assassin) fails. +This document describes how to restore the entire KiteStacks platform if a +host fails. As of 2026-06-10, KiteStacks runs active-active across two hosts +plus Cloudflare Tunnel, so no single host is a hard dependency for the site +to stay up. ## Current Infrastructure Primary Production: -- Host: Assassin -- IP: 192.168.1.205 +- Host: monk +- LAN IP: 192.168.1.205 -Cloud Backup: -- Host: kscloud1 +Cloud Failover (PERMANENT, active-active - NOT cold standby): +- Host: kscloud1 (Hetzner VPS) - Public IP: 5.78.233.28 +- Tailscale IP: 100.123.254.52 +- Runs a full replica of all 9 services + +assassin (T14): retired/OFF, no longer part of the topology. Domains: -- www.kitestacks.com -- gitforge.kitestacks.com -- www-backup.kitestacks.com -- git-backup.kitestacks.com +- www.kitestacks.com (+ ai, auth, gitforge, grafana, kavita, links, status, tasks) +- www-backup.kitestacks.com / git-backup.kitestacks.com (kscloud1 direct + A-records via local Caddy on port 80, separate from the Tunnel) + +Cloudflare Tunnel: +- 3 connectors load-balance ACTIVE-ACTIVE across all 9 *.kitestacks.com + hostnames - no primary/backup priority. ## Recovery Priority @@ -31,20 +41,35 @@ Domains: ## Current Backup Status -Website Backup: -- Operational on kscloud1 +Website: +- Full replica running on kscloud1, served live via the Tunnel. -Forgejo Backup: -- Operational on kscloud1 +Forgejo: +- Full replica running on kscloud1, but with a SEPARATE database - repos and + commits pushed to monk's Forgejo do NOT appear on kscloud1's Forgejo (and + vice versa). Accepted tradeoff for uptime. +- The portal's Recent Activity widget on BOTH hosts queries monk's Forgejo + directly (FORGEJO_API_BASE -> http://100.85.209.116:3006 over Tailscale + from kscloud1, http://localhost:3006 on monk) so it stays consistent + regardless of which connector serves the page. -Git Repository: -- Synced to both Forgejo instances +Authentik: +- Shared Postgres+Redis hosted on kscloud1, reachable only over Tailscale + (100.123.254.52). Both monk's and kscloud1's authentik+worker use this + single database/cache - fixes invalid_grant SSO caused by active-active + routing splitting an OAuth flow across connectors. + +Other stateful apps (kavita, karakeep, openproject, etc.): +- Fresh/separate data on kscloud1 - may show different/stale data depending + on which connector serves a request. Accepted as the cost of guaranteed + uptime. ## Validation Checklist -- [ ] Website accessible -- [ ] Backup website accessible -- [ ] Forgejo operational -- [ ] Backup Forgejo operational +- [ ] Website accessible (www.kitestacks.com) +- [ ] kscloud1 replica accessible (www-backup.kitestacks.com) +- [ ] Forgejo operational (gitforge.kitestacks.com) +- [ ] kscloud1 Forgejo replica operational (git-backup.kitestacks.com) +- [ ] Authentik SSO works (auth.kitestacks.com) - [ ] Cloudflare DNS verified -- [ ] Cloudflare Tunnel verified +- [ ] Cloudflare Tunnel: all 3 connectors healthy