2.5 KiB
2.5 KiB
KiteStacks Disaster Recovery Runbook
Purpose
This document describes how to restore the entire KiteStacks platform if a host fails. As of 2026-06-10, KiteStacks runs active-active across two hosts plus Cloudflare Tunnel, so no single host is a hard dependency for the site to stay up.
Current Infrastructure
Primary Production:
- Host: monk
- LAN IP: 192.168.1.205
Cloud Failover (PERMANENT, active-active - NOT cold standby):
- Host: kscloud1 (Hetzner VPS)
- Public IP: 5.78.233.28
- Tailscale IP: 100.123.254.52
- Runs a full replica of all 9 services
assassin (T14): retired/OFF, no longer part of the topology.
Domains:
- www.kitestacks.com (+ ai, auth, gitforge, grafana, kavita, links, status, tasks)
- www-backup.kitestacks.com / git-backup.kitestacks.com (kscloud1 direct A-records via local Caddy on port 80, separate from the Tunnel)
Cloudflare Tunnel:
- 3 connectors load-balance ACTIVE-ACTIVE across all 9 *.kitestacks.com hostnames - no primary/backup priority.
Recovery Priority
- Forgejo
- Website
- Authentik
- Monitoring
- AI Services
- Knowledge Services
Current Backup Status
Website:
- Full replica running on kscloud1, served live via the Tunnel.
Forgejo:
- Full replica running on kscloud1, but with a SEPARATE database - repos and commits pushed to monk's Forgejo do NOT appear on kscloud1's Forgejo (and vice versa). Accepted tradeoff for uptime.
- The portal's Recent Activity widget on BOTH hosts queries monk's Forgejo directly (FORGEJO_API_BASE -> http://100.85.209.116:3006 over Tailscale from kscloud1, http://localhost:3006 on monk) so it stays consistent regardless of which connector serves the page.
Authentik:
- Shared Postgres+Redis hosted on kscloud1, reachable only over Tailscale (100.123.254.52). Both monk's and kscloud1's authentik+worker use this single database/cache - fixes invalid_grant SSO caused by active-active routing splitting an OAuth flow across connectors.
Other stateful apps (kavita, karakeep, openproject, etc.):
- Fresh/separate data on kscloud1 - may show different/stale data depending on which connector serves a request. Accepted as the cost of guaranteed uptime.
Validation Checklist
- Website accessible (www.kitestacks.com)
- kscloud1 replica accessible (www-backup.kitestacks.com)
- Forgejo operational (gitforge.kitestacks.com)
- kscloud1 Forgejo replica operational (git-backup.kitestacks.com)
- Authentik SSO works (auth.kitestacks.com)
- Cloudflare DNS verified
- Cloudflare Tunnel: all 3 connectors healthy