Replace all production IPs (public, LAN, Tailscale), host port bindings, and hardcoded passwords/secrets across RUNBOOK.md, docs/, and projects/ with descriptive placeholders (<KSCLOUD1_PUBLIC_IP>, <port>, <KSCLOUD1_SUDO_PASSWORD>, etc.) so no sensitive infrastructure details are committed to the repository. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
75 lines
2.6 KiB
Markdown
75 lines
2.6 KiB
Markdown
# KiteStacks Disaster Recovery Runbook
|
|
|
|
## Purpose
|
|
|
|
This document describes how to restore the entire KiteStacks platform if a
|
|
host fails. As of 2026-06-10, KiteStacks runs active-active across two hosts
|
|
plus Cloudflare Tunnel, so no single host is a hard dependency for the site
|
|
to stay up.
|
|
|
|
## Current Infrastructure
|
|
|
|
Primary Production:
|
|
- Host: monk
|
|
- LAN IP: <MONK_LAN_IP>
|
|
|
|
Cloud Failover (PERMANENT, active-active - NOT cold standby):
|
|
- Host: kscloud1 (Hetzner VPS)
|
|
- Public IP: <KSCLOUD1_PUBLIC_IP>
|
|
- Tailscale IP: <KSCLOUD1_TAILSCALE_IP>
|
|
- Runs a full replica of all 9 services
|
|
|
|
assassin (T14): retired/OFF, no longer part of the topology.
|
|
|
|
Domains:
|
|
- www.kitestacks.com (+ ai, auth, gitforge, grafana, kavita, links, status, tasks)
|
|
- www-backup.kitestacks.com / git-backup.kitestacks.com (kscloud1 direct
|
|
A-records via local Caddy on port <port>, separate from the Tunnel)
|
|
|
|
Cloudflare Tunnel:
|
|
- 3 connectors load-balance ACTIVE-ACTIVE across all 9 *.kitestacks.com
|
|
hostnames - no primary/backup priority.
|
|
|
|
## Recovery Priority
|
|
|
|
1. Forgejo
|
|
2. Website
|
|
3. Authentik
|
|
4. Monitoring
|
|
5. AI Services
|
|
6. Knowledge Services
|
|
|
|
## Current Backup Status
|
|
|
|
Website:
|
|
- Full replica running on kscloud1, served live via the Tunnel.
|
|
|
|
Forgejo:
|
|
- Full replica running on kscloud1, but with a SEPARATE database - repos and
|
|
commits pushed to monk's Forgejo do NOT appear on kscloud1's Forgejo (and
|
|
vice versa). Accepted tradeoff for uptime.
|
|
- The portal's Recent Activity widget on BOTH hosts queries monk's Forgejo
|
|
directly (FORGEJO_API_BASE -> http://<MONK_TAILSCALE_IP>:<port> over Tailscale
|
|
from kscloud1, http://localhost:<port> on monk) so it stays consistent
|
|
regardless of which connector serves the page.
|
|
|
|
Authentik:
|
|
- Shared Postgres+Redis hosted on kscloud1, reachable only over Tailscale
|
|
(<KSCLOUD1_TAILSCALE_IP>). Both monk's and kscloud1's authentik+worker use this
|
|
single database/cache - fixes invalid_grant SSO caused by active-active
|
|
routing splitting an OAuth flow across connectors.
|
|
|
|
Other stateful apps (kavita, karakeep, openproject, etc.):
|
|
- Fresh/separate data on kscloud1 - may show different/stale data depending
|
|
on which connector serves a request. Accepted as the cost of guaranteed
|
|
uptime.
|
|
|
|
## Validation Checklist
|
|
|
|
- [ ] Website accessible (www.kitestacks.com)
|
|
- [ ] kscloud1 replica accessible (www-backup.kitestacks.com)
|
|
- [ ] Forgejo operational (gitforge.kitestacks.com)
|
|
- [ ] kscloud1 Forgejo replica operational (git-backup.kitestacks.com)
|
|
- [ ] Authentik SSO works (auth.kitestacks.com)
|
|
- [ ] Cloudflare DNS verified
|
|
- [ ] Cloudflare Tunnel: all 3 connectors healthy
|