kitestacks-homelab/docs/disaster-recovery/RUNBOOK.md

2.5 KiB

KiteStacks Disaster Recovery Runbook

Purpose

This document describes how to restore the entire KiteStacks platform if a host fails. As of 2026-06-10, KiteStacks runs active-active across two hosts plus Cloudflare Tunnel, so no single host is a hard dependency for the site to stay up.

Current Infrastructure

Primary Production:

  • Host: monk
  • LAN IP: <MONK_LAN_IP>

Cloud Failover (PERMANENT, active-active - NOT cold standby):

  • Host: kscloud1 (Hetzner VPS)
  • Public IP: <KSCLOUD1_PUBLIC_IP>
  • Tailscale IP: <KSCLOUD1_TAILSCALE_IP>
  • Runs a full replica of all 9 services

T14s: Active cluster node (GitOps).

Domains:

  • www.kitestacks.com (+ ai, auth, gitforge, grafana, kavita, links, status, tasks)
  • www-backup.kitestacks.com / git-backup.kitestacks.com (kscloud1 direct A-records via local Caddy on port , separate from the Tunnel)

Cloudflare Tunnel:

  • 3 connectors load-balance ACTIVE-ACTIVE across all 9 *.kitestacks.com hostnames - no primary/backup priority.

Recovery Priority

  1. Forgejo
  2. Website
  3. Authentik
  4. Monitoring
  5. AI Services
  6. Knowledge Services

Current Backup Status

Website:

  • Full replica running on kscloud1, served live via the Tunnel.

Forgejo:

  • Full replica running on kscloud1, but with a SEPARATE database - repos and commits pushed to monk's Forgejo do NOT appear on kscloud1's Forgejo (and vice versa). Accepted tradeoff for uptime.
  • The portal's Recent Activity widget on BOTH hosts queries monk's Forgejo directly (FORGEJO_API_BASE -> http://<MONK_TAILSCALE_IP>: over Tailscale from kscloud1, http://localhost: on monk) so it stays consistent regardless of which connector serves the page.

Authentik:

  • Shared Postgres+Redis hosted on kscloud1, reachable only over Tailscale (<KSCLOUD1_TAILSCALE_IP>). Both monk's and kscloud1's authentik+worker use this single database/cache - fixes invalid_grant SSO caused by active-active routing splitting an OAuth flow across connectors.

Other stateful apps (kavita, karakeep, openproject, etc.):

  • Fresh/separate data on kscloud1 - may show different/stale data depending on which connector serves a request. Accepted as the cost of guaranteed uptime.

Validation Checklist

  • Website accessible (www.kitestacks.com)
  • kscloud1 replica accessible (www-backup.kitestacks.com)
  • Forgejo operational (gitforge.kitestacks.com)
  • kscloud1 Forgejo replica operational (git-backup.kitestacks.com)
  • Authentik SSO works (auth.kitestacks.com)
  • Cloudflare DNS verified
  • Cloudflare Tunnel: all 3 connectors healthy