kitestacks-homelab/homelab-mastery/architecture/decisions.md
kenpat 1e8319ee75 docs: comprehensive homelab-mastery rewrite with full build guides
Complete documentation suite for KiteStacks covering all 11 services across
2-host active-active architecture. Includes beginner track (with AI, 8 files)
and advanced track (without AI, 7 files) with time estimates, real troubleshooting
cases, and command-by-command explanations. Updates certifications roadmap to
reflect July 7 2026 A+ Core 2 exam goal.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-19 01:08:43 -05:00

11 KiB

Architecture Decisions — The Why Behind Every Choice

For every technology choice, there was a reason. Understanding the "why" is what separates someone who copied commands from someone who designed a system.

Last Updated: 2026-06-19


Why Docker Instead of Running Services Directly?

Problem: Running 15+ services directly on a Linux host creates dependency conflicts — different Python versions, conflicting library versions, services that break each other on updates.

Options considered:

  • Bare metal: install each app directly on the OS
  • Virtual machines: one VM per service
  • Docker containers: isolated processes with their own dependencies

Decision: Docker

Why:

  • Each container has its own filesystem and runtime — they can't conflict
  • Starting, stopping, or updating one service doesn't affect others
  • The docker-compose.yml file IS the documentation — it shows exactly what the service needs
  • Portability: move the same compose file to a new machine and it works identically
  • restart: unless-stopped means containers self-heal after a crash or host reboot

What to say in an interview:

"I containerized every service using Docker Compose so each has isolated dependencies and the entire deployment is reproducible from a single YAML file."


Why Cloudflare Tunnel Instead of Port Forwarding?

Problem: How do you make home services accessible from the internet?

Traditional approach: Open ports 80 and 443 on the home router, configure NAT, point DNS to your home IP address.

Problems with that:

  • Your home IP is public (DDoS risk, can be scanned and targeted)
  • Dynamic home IP means DNS breaks every time the ISP changes it
  • Some ISPs block residential ports 80 and 443
  • Router configuration is fragile and varies by hardware

Decision: Cloudflare Tunnel (cloudflared)

Why:

  • cloudflared makes an outbound connection to Cloudflare — no inbound ports needed at all
  • Home IP is never exposed to the public internet
  • Works on any ISP, any network, any firewall
  • Cloudflare handles TLS certificates automatically (no Let's Encrypt setup)
  • Free tier covers everything needed
  • Built-in DDoS protection at Cloudflare's edge

The tradeoff: You depend on Cloudflare. If Cloudflare has an outage, your site goes down even if your hardware is fine. Acceptable — Cloudflare's uptime exceeds most home ISPs.


Why Authentik for SSO?

Problem: Eleven services means eleven separate usernames and passwords. Adding a user means eleven admin panels. Removing access means eleven places to deactivate.

Options:

  • No SSO — separate logins per service
  • Authelia — simpler, forward-auth proxy only
  • Authentik — full OIDC provider, more complex to set up
  • Keycloak — enterprise-grade, very heavy on RAM

Decision: Authentik

Why:

  • One account controls access to everything
  • Apps that support native OIDC (Grafana, Kavita, Karakeep, Open WebUI, Portainer, BookStack, Forgejo) get real SSO — user is authenticated inside the app with a JWT, not just at a proxy
  • Access policies per application (Portainer restricted to homelab-admin group only)
  • Self-hosted — user data never leaves your infrastructure

Why not Authelia: Authelia only does forward-auth proxy. It blocks the login page until authenticated, but the app itself never receives user identity. Authentik sends a real JWT with user email and name — apps can create user accounts automatically on first login.


Why a Shared Postgres Instead of Separate Authentik Databases?

Problem: After deploying two Cloudflare Tunnel connectors, users got invalid_grant errors when signing in through SSO — roughly 50% of the time.

Root cause: OAuth2 authorization codes are short-lived rows in a database.

Step 1: /authorize → creates code → stored in monk's Authentik DB
Step 2: /token     → looks for code → hits kscloud1's Authentik DB → NOT FOUND

Cloudflare load-balances every HTTP request independently. Steps 1 and 2 of the OAuth2 flow can hit completely different hosts. The code exists in one database but not the other.

Options:

  • Sync both databases continuously (complex, slow, conflict-prone)
  • Use sticky sessions (Cloudflare paid feature)
  • Share one database between both Authentik instances

Decision: Single shared Postgres + Redis hosted on kscloud1, accessible only over Tailscale

Why:

  • Both connectors' Authentik instances read and write the same database
  • Authorization codes are always found regardless of which host handles which request
  • Database is bound to kscloud1's Tailscale IP — never reachable from the public internet
  • Simple configuration change: one environment variable pointing to the shared host

The tradeoff: If kscloud1 and Tailscale both go down, monk's Authentik can't connect to the database and fails to start. Rollback: restore local Postgres in monk's compose file.


Why Tailscale Instead of WireGuard or OpenVPN?

Problem: Need private networking between monk (home) and kscloud1 (Hetzner cloud). The shared Authentik database must not be exposed to the public internet.

Options:

  • WireGuard: manual key exchange, manual routing, hard to configure through NAT
  • OpenVPN: complex, slower, more overhead
  • Tailscale: managed WireGuard, automatic key exchange, works behind NAT

Decision: Tailscale

Why:

  • Works in minutes: install, authenticate, done
  • Handles NAT traversal automatically — monk is behind home router NAT
  • Every device gets a stable 100.x.x.x IP regardless of location
  • Free for up to 100 devices
  • WireGuard underneath — same encryption, much easier operation

The tradeoff: You trust Tailscale's coordination servers to manage device authentication. Actual data is encrypted peer-to-peer (Tailscale never sees it), but they control who can join your network. Self-hosted alternative if needed: Headscale.


Why Active-Active Failover Instead of Active-Passive?

The situation: The user travels. When away from home, monk may be unreachable. kscloud1 must keep the site running.

Active-Passive: kscloud1 only starts serving if Cloudflare detects monk as down. Requires health checks, failover rules, and a delay before traffic switches.

Active-Active: Both monk and kscloud1 are always in the Cloudflare Tunnel rotation. Every request may hit either host at any time.

Decision: Active-Active

Why:

  • No failover logic needed — both are always live
  • Instant: if monk goes down, kscloud1 is already handling traffic
  • Free: Cloudflare Tunnel active-active is included; health-check-based failover is paid

The tradeoff: Stateful apps with separate databases (Kavita, Karakeep) may show different data depending on which host answers. Explicitly accepted — the priority is uptime, not data consistency across hosts. Forgejo and Authentik share databases so they are consistent.


Why a Custom Portal Instead of a Pre-Built Dashboard?

Options:

  • Homepage (gethomepage) — nice but limited customization
  • Heimdall — similar limitations
  • Custom static HTML/CSS/JS + nginx — full control, full ownership

Decision: Custom static site

Why:

  • Complete visual control — the cyberpunk theme, layout, every card, every color
  • Static files + nginx are extremely fast and reliable (no Node.js, no build step)
  • nginx proxies the /api/* endpoints to the metrics API without CORS issues
  • No dependency on external frameworks that can change or break

The tradeoff: More work to build and maintain. But you understand every line of it, and you can explain exactly why every piece is there.


Why Python + FastAPI for the Metrics API?

Problem: The portal needs live system stats (CPU, RAM, network), weather, and Forgejo git activity. Static HTML can't provide these.

Decision: Python FastAPI with psutil

Why:

  • psutil reads host system metrics in one line of Python
  • FastAPI auto-generates API documentation and handles async requests well
  • Python is readable — easy to understand and modify
  • async/await means the API doesn't block while waiting for weather API responses

Special requirements:

  • network_mode: host — container shares host network namespace so psutil sees real network interfaces, not the container's virtual interface
  • pid: host — container can read the host's /proc filesystem for accurate process stats

Without these flags, the API would report container-level stats instead of actual laptop stats.


Why Forgejo Instead of GitHub or GitLab?

Problem: Need to store all homelab code, configs, and documentation in version control.

Options:

  • GitHub: free, reliable, but your configs and docs are on someone else's server
  • GitLab: self-hostable but heavy (4GB+ RAM for full install)
  • Forgejo: lightweight GitHub-like self-hosted Git, fork of Gitea

Decision: Forgejo

Why:

  • Self-hosted — configs and documentation stay on your infrastructure
  • Very lightweight — uses less than 100MB RAM
  • GitHub-compatible API — tools that work with GitHub also work with Forgejo
  • Full UI with code review, issues, CI/CD (Forgejo Actions)
  • Shows commit history and documentation to anyone you give access to

The tradeoff: You maintain it yourself. If Forgejo goes down, git operations fail. Mitigated by kscloud1 running a replica and the shared Postgres.


Why OSTicket for the Help Desk?

What it replaced: OpenProject (project management tool on tasks.kitestacks.com)

Why OpenProject was removed:

  • OpenProject CE (Community Edition) requires an Enterprise Edition license for SSO
  • The SSO button simply does not appear in CE — it is a hard paywall with no workaround
  • OpenProject is also resource-heavy for what it provides

Why OSTicket:

  • Lightweight and runs well on the existing stack
  • Email integration works (SMTP via Gmail app password — confirmed working)
  • Handles the ticket/task tracking use case without the licensing barrier

Why BookStack for the Wiki?

Problem: Need a place for long-form documentation that's more structured than markdown files.

Decision: BookStack

Why:

  • Clean, organized UI: Shelves → Books → Chapters → Pages hierarchy
  • WYSIWYG editor — easy to write docs without markdown syntax
  • Authentik OIDC SSO works natively
  • API available — docs can be pushed programmatically from scripts or CI

Key gotcha: Cache directory must be writable by the container user. chown -R abc:users /config/www/framework/cache/ is required after first install.


Why the Forgejo Shared Postgres?

Problem: With two connectors in active-active, Forgejo on monk and kscloud1 had separate SQLite databases. Repos created on one weren't visible on the other.

Fix: Migrated both Forgejo instances to a single shared PostgreSQL database on kscloud1 (same shared server as Authentik's Postgres). Both connectors now serve identical Forgejo data.

How it was done:

  • forgejo dump --database postgres — exported clean SQL from monk's Forgejo
  • Dropped the pgloader schema (had wrong structure), reloaded the clean SQL
  • Both compose files point to authentik-postgres:5432 database forgejo, user forgejo
  • kscloud1's Forgejo joined the authentik_default Docker network to reach authentik-postgres