# Architecture Decisions — The Why Behind Every Choice For every technology choice, there was a reason. Understanding the "why" is what separates someone who copied commands from someone who designed a system. **Last Updated:** 2026-06-19 --- ## Why Docker Instead of Running Services Directly? **Problem:** Running 15+ services directly on a Linux host creates dependency conflicts — different Python versions, conflicting library versions, services that break each other on updates. **Options considered:** - Bare metal: install each app directly on the OS - Virtual machines: one VM per service - Docker containers: isolated processes with their own dependencies **Decision:** Docker **Why:** - Each container has its own filesystem and runtime — they can't conflict - Starting, stopping, or updating one service doesn't affect others - The `docker-compose.yml` file IS the documentation — it shows exactly what the service needs - Portability: move the same compose file to a new machine and it works identically - `restart: unless-stopped` means containers self-heal after a crash or host reboot **What to say in an interview:** > *"I containerized every service using Docker Compose so each has isolated dependencies > and the entire deployment is reproducible from a single YAML file."* --- ## Why Cloudflare Tunnel Instead of Port Forwarding? **Problem:** How do you make home services accessible from the internet? **Traditional approach:** Open ports 80 and 443 on the home router, configure NAT, point DNS to your home IP address. **Problems with that:** - Your home IP is public (DDoS risk, can be scanned and targeted) - Dynamic home IP means DNS breaks every time the ISP changes it - Some ISPs block residential ports 80 and 443 - Router configuration is fragile and varies by hardware **Decision:** Cloudflare Tunnel (cloudflared) **Why:** - cloudflared makes an outbound connection to Cloudflare — no inbound ports needed at all - Home IP is never exposed to the public internet - Works on any ISP, any network, any firewall - Cloudflare handles TLS certificates automatically (no Let's Encrypt setup) - Free tier covers everything needed - Built-in DDoS protection at Cloudflare's edge **The tradeoff:** You depend on Cloudflare. If Cloudflare has an outage, your site goes down even if your hardware is fine. Acceptable — Cloudflare's uptime exceeds most home ISPs. --- ## Why Authentik for SSO? **Problem:** Eleven services means eleven separate usernames and passwords. Adding a user means eleven admin panels. Removing access means eleven places to deactivate. **Options:** - No SSO — separate logins per service - Authelia — simpler, forward-auth proxy only - Authentik — full OIDC provider, more complex to set up - Keycloak — enterprise-grade, very heavy on RAM **Decision:** Authentik **Why:** - One account controls access to everything - Apps that support native OIDC (Grafana, Kavita, Karakeep, Open WebUI, Portainer, BookStack, Forgejo) get real SSO — user is authenticated inside the app with a JWT, not just at a proxy - Access policies per application (Portainer restricted to `homelab-admin` group only) - Self-hosted — user data never leaves your infrastructure **Why not Authelia:** Authelia only does forward-auth proxy. It blocks the login page until authenticated, but the app itself never receives user identity. Authentik sends a real JWT with user email and name — apps can create user accounts automatically on first login. --- ## Why a Shared Postgres Instead of Separate Authentik Databases? **Problem:** After deploying two Cloudflare Tunnel connectors, users got `invalid_grant` errors when signing in through SSO — roughly 50% of the time. **Root cause:** OAuth2 authorization codes are short-lived rows in a database. ``` Step 1: /authorize → creates code → stored in monk's Authentik DB Step 2: /token → looks for code → hits kscloud1's Authentik DB → NOT FOUND ``` Cloudflare load-balances every HTTP request independently. Steps 1 and 2 of the OAuth2 flow can hit completely different hosts. The code exists in one database but not the other. **Options:** - Sync both databases continuously (complex, slow, conflict-prone) - Use sticky sessions (Cloudflare paid feature) - Share one database between both Authentik instances **Decision:** Single shared Postgres + Redis hosted on kscloud1, accessible only over Tailscale **Why:** - Both connectors' Authentik instances read and write the same database - Authorization codes are always found regardless of which host handles which request - Database is bound to kscloud1's Tailscale IP — never reachable from the public internet - Simple configuration change: one environment variable pointing to the shared host **The tradeoff:** If kscloud1 and Tailscale both go down, monk's Authentik can't connect to the database and fails to start. Rollback: restore local Postgres in monk's compose file. --- ## Why Tailscale Instead of WireGuard or OpenVPN? **Problem:** Need private networking between monk (home) and kscloud1 (Hetzner cloud). The shared Authentik database must not be exposed to the public internet. **Options:** - WireGuard: manual key exchange, manual routing, hard to configure through NAT - OpenVPN: complex, slower, more overhead - Tailscale: managed WireGuard, automatic key exchange, works behind NAT **Decision:** Tailscale **Why:** - Works in minutes: install, authenticate, done - Handles NAT traversal automatically — monk is behind home router NAT - Every device gets a stable `100.x.x.x` IP regardless of location - Free for up to 100 devices - WireGuard underneath — same encryption, much easier operation **The tradeoff:** You trust Tailscale's coordination servers to manage device authentication. Actual data is encrypted peer-to-peer (Tailscale never sees it), but they control who can join your network. Self-hosted alternative if needed: Headscale. --- ## Why Active-Active Failover Instead of Active-Passive? **The situation:** The user travels. When away from home, monk may be unreachable. kscloud1 must keep the site running. **Active-Passive:** kscloud1 only starts serving if Cloudflare detects monk as down. Requires health checks, failover rules, and a delay before traffic switches. **Active-Active:** Both monk and kscloud1 are always in the Cloudflare Tunnel rotation. Every request may hit either host at any time. **Decision:** Active-Active **Why:** - No failover logic needed — both are always live - Instant: if monk goes down, kscloud1 is already handling traffic - Free: Cloudflare Tunnel active-active is included; health-check-based failover is paid **The tradeoff:** Stateful apps with separate databases (Kavita, Karakeep) may show different data depending on which host answers. Explicitly accepted — the priority is uptime, not data consistency across hosts. Forgejo and Authentik share databases so they are consistent. --- ## Why a Custom Portal Instead of a Pre-Built Dashboard? **Options:** - Homepage (gethomepage) — nice but limited customization - Heimdall — similar limitations - Custom static HTML/CSS/JS + nginx — full control, full ownership **Decision:** Custom static site **Why:** - Complete visual control — the cyberpunk theme, layout, every card, every color - Static files + nginx are extremely fast and reliable (no Node.js, no build step) - nginx proxies the `/api/*` endpoints to the metrics API without CORS issues - No dependency on external frameworks that can change or break **The tradeoff:** More work to build and maintain. But you understand every line of it, and you can explain exactly why every piece is there. --- ## Why Python + FastAPI for the Metrics API? **Problem:** The portal needs live system stats (CPU, RAM, network), weather, and Forgejo git activity. Static HTML can't provide these. **Decision:** Python FastAPI with `psutil` **Why:** - `psutil` reads host system metrics in one line of Python - FastAPI auto-generates API documentation and handles async requests well - Python is readable — easy to understand and modify - `async/await` means the API doesn't block while waiting for weather API responses **Special requirements:** - `network_mode: host` — container shares host network namespace so psutil sees real network interfaces, not the container's virtual interface - `pid: host` — container can read the host's `/proc` filesystem for accurate process stats Without these flags, the API would report container-level stats instead of actual laptop stats. --- ## Why Forgejo Instead of GitHub or GitLab? **Problem:** Need to store all homelab code, configs, and documentation in version control. **Options:** - GitHub: free, reliable, but your configs and docs are on someone else's server - GitLab: self-hostable but heavy (4GB+ RAM for full install) - Forgejo: lightweight GitHub-like self-hosted Git, fork of Gitea **Decision:** Forgejo **Why:** - Self-hosted — configs and documentation stay on your infrastructure - Very lightweight — uses less than 100MB RAM - GitHub-compatible API — tools that work with GitHub also work with Forgejo - Full UI with code review, issues, CI/CD (Forgejo Actions) - Shows commit history and documentation to anyone you give access to **The tradeoff:** You maintain it yourself. If Forgejo goes down, git operations fail. Mitigated by kscloud1 running a replica and the shared Postgres. --- ## Why OSTicket for the Help Desk? **What it replaced:** OpenProject (project management tool on tasks.kitestacks.com) **Why OpenProject was removed:** - OpenProject CE (Community Edition) requires an Enterprise Edition license for SSO - The SSO button simply does not appear in CE — it is a hard paywall with no workaround - OpenProject is also resource-heavy for what it provides **Why OSTicket:** - Lightweight and runs well on the existing stack - Email integration works (SMTP via Gmail app password — confirmed working) - Handles the ticket/task tracking use case without the licensing barrier --- ## Why BookStack for the Wiki? **Problem:** Need a place for long-form documentation that's more structured than markdown files. **Decision:** BookStack **Why:** - Clean, organized UI: Shelves → Books → Chapters → Pages hierarchy - WYSIWYG editor — easy to write docs without markdown syntax - Authentik OIDC SSO works natively - API available — docs can be pushed programmatically from scripts or CI **Key gotcha:** Cache directory must be writable by the container user. `chown -R abc:users /config/www/framework/cache/` is required after first install. --- ## Why the Forgejo Shared Postgres? **Problem:** With two connectors in active-active, Forgejo on monk and kscloud1 had separate SQLite databases. Repos created on one weren't visible on the other. **Fix:** Migrated both Forgejo instances to a single shared PostgreSQL database on kscloud1 (same shared server as Authentik's Postgres). Both connectors now serve identical Forgejo data. **How it was done:** - `forgejo dump --database postgres` — exported clean SQL from monk's Forgejo - Dropped the pgloader schema (had wrong structure), reloaded the clean SQL - Both compose files point to `authentik-postgres:5432` database `forgejo`, user `forgejo` - kscloud1's Forgejo joined the `authentik_default` Docker network to reach authentik-postgres