kitestacks-homelab/homelab-mastery/architecture/decisions.md

# Architecture Decisions — The Why Behind Every Choice

For every technology choice, there was a reason. Understanding the "why" is what separates
someone who copied commands from someone who designed a system.

**Last Updated:** 2026-06-19

---

## Why Docker Instead of Running Services Directly?

**Problem:** Running 15+ services directly on a Linux host creates dependency conflicts —
different Python versions, conflicting library versions, services that break each other on updates.

**Options considered:**
- Bare metal: install each app directly on the OS
- Virtual machines: one VM per service
- Docker containers: isolated processes with their own dependencies

**Decision:** Docker

**Why:**
- Each container has its own filesystem and runtime — they can't conflict
- Starting, stopping, or updating one service doesn't affect others
- The `docker-compose.yml` file IS the documentation — it shows exactly what the service needs
- Portability: move the same compose file to a new machine and it works identically
- `restart: unless-stopped` means containers self-heal after a crash or host reboot

**What to say in an interview:**
> *"I containerized every service using Docker Compose so each has isolated dependencies
> and the entire deployment is reproducible from a single YAML file."*

---

## Why Cloudflare Tunnel Instead of Port Forwarding?

**Problem:** How do you make home services accessible from the internet?

**Traditional approach:** Open ports 80 and 443 on the home router, configure NAT,
point DNS to your home IP address.

**Problems with that:**
- Your home IP is public (DDoS risk, can be scanned and targeted)
- Dynamic home IP means DNS breaks every time the ISP changes it
- Some ISPs block residential ports 80 and 443
- Router configuration is fragile and varies by hardware

**Decision:** Cloudflare Tunnel (cloudflared)

**Why:**
- cloudflared makes an outbound connection to Cloudflare — no inbound ports needed at all
- Home IP is never exposed to the public internet
- Works on any ISP, any network, any firewall
- Cloudflare handles TLS certificates automatically (no Let's Encrypt setup)
- Free tier covers everything needed
- Built-in DDoS protection at Cloudflare's edge

**The tradeoff:** You depend on Cloudflare. If Cloudflare has an outage, your site goes down
even if your hardware is fine. Acceptable — Cloudflare's uptime exceeds most home ISPs.

---

## Why Authentik for SSO?

**Problem:** Eleven services means eleven separate usernames and passwords. Adding a user
means eleven admin panels. Removing access means eleven places to deactivate.

**Options:**
- No SSO — separate logins per service
- Authelia — simpler, forward-auth proxy only
- Authentik — full OIDC provider, more complex to set up
- Keycloak — enterprise-grade, very heavy on RAM

**Decision:** Authentik

**Why:**
- One account controls access to everything
- Apps that support native OIDC (Grafana, Kavita, Karakeep, Open WebUI, Portainer, BookStack,
  Forgejo) get real SSO — user is authenticated inside the app with a JWT, not just at a proxy
- Access policies per application (Portainer restricted to `homelab-admin` group only)
- Self-hosted — user data never leaves your infrastructure

**Why not Authelia:** Authelia only does forward-auth proxy. It blocks the login page until
authenticated, but the app itself never receives user identity. Authentik sends a real JWT
with user email and name — apps can create user accounts automatically on first login.

---

## Why a Shared Postgres Instead of Separate Authentik Databases?

**Problem:** After deploying two Cloudflare Tunnel connectors, users got `invalid_grant`
errors when signing in through SSO — roughly 50% of the time.

**Root cause:** OAuth2 authorization codes are short-lived rows in a database.

```
Step 1: /authorize → creates code → stored in monk's Authentik DB
Step 2: /token     → looks for code → hits kscloud1's Authentik DB → NOT FOUND
```

Cloudflare load-balances every HTTP request independently. Steps 1 and 2 of the OAuth2
flow can hit completely different hosts. The code exists in one database but not the other.

**Options:**
- Sync both databases continuously (complex, slow, conflict-prone)
- Use sticky sessions (Cloudflare paid feature)
- Share one database between both Authentik instances

**Decision:** Single shared Postgres + Redis hosted on kscloud1, accessible only over Tailscale

**Why:**
- Both connectors' Authentik instances read and write the same database
- Authorization codes are always found regardless of which host handles which request
- Database is bound to kscloud1's Tailscale IP — never reachable from the public internet
- Simple configuration change: one environment variable pointing to the shared host

**The tradeoff:** If kscloud1 and Tailscale both go down, monk's Authentik can't connect
to the database and fails to start. Rollback: restore local Postgres in monk's compose file.

---

## Why Tailscale Instead of WireGuard or OpenVPN?

**Problem:** Need private networking between monk (home) and kscloud1 (Hetzner cloud).
The shared Authentik database must not be exposed to the public internet.

**Options:**
- WireGuard: manual key exchange, manual routing, hard to configure through NAT
- OpenVPN: complex, slower, more overhead
- Tailscale: managed WireGuard, automatic key exchange, works behind NAT

**Decision:** Tailscale

**Why:**
- Works in minutes: install, authenticate, done
- Handles NAT traversal automatically — monk is behind home router NAT
- Every device gets a stable `100.x.x.x` IP regardless of location
- Free for up to 100 devices
- WireGuard underneath — same encryption, much easier operation

**The tradeoff:** You trust Tailscale's coordination servers to manage device authentication.
Actual data is encrypted peer-to-peer (Tailscale never sees it), but they control who can
join your network. Self-hosted alternative if needed: Headscale.

---

## Why Active-Active Failover Instead of Active-Passive?

**The situation:** The user travels. When away from home, monk may be unreachable.
kscloud1 must keep the site running.

**Active-Passive:** kscloud1 only starts serving if Cloudflare detects monk as down.
Requires health checks, failover rules, and a delay before traffic switches.

**Active-Active:** Both monk and kscloud1 are always in the Cloudflare Tunnel rotation.
Every request may hit either host at any time.

**Decision:** Active-Active

**Why:**
- No failover logic needed — both are always live
- Instant: if monk goes down, kscloud1 is already handling traffic
- Free: Cloudflare Tunnel active-active is included; health-check-based failover is paid

**The tradeoff:** Stateful apps with separate databases (Kavita, Karakeep) may show
different data depending on which host answers. Explicitly accepted — the priority is
uptime, not data consistency across hosts. Forgejo and Authentik share databases so
they are consistent.

---

## Why a Custom Portal Instead of a Pre-Built Dashboard?

**Options:**
- Homepage (gethomepage) — nice but limited customization
- Heimdall — similar limitations
- Custom static HTML/CSS/JS + nginx — full control, full ownership

**Decision:** Custom static site

**Why:**
- Complete visual control — the cyberpunk theme, layout, every card, every color
- Static files + nginx are extremely fast and reliable (no Node.js, no build step)
- nginx proxies the `/api/*` endpoints to the metrics API without CORS issues
- No dependency on external frameworks that can change or break

**The tradeoff:** More work to build and maintain. But you understand every line of it,
and you can explain exactly why every piece is there.

---

## Why Python + FastAPI for the Metrics API?

**Problem:** The portal needs live system stats (CPU, RAM, network), weather, and
Forgejo git activity. Static HTML can't provide these.

**Decision:** Python FastAPI with `psutil`

**Why:**
- `psutil` reads host system metrics in one line of Python
- FastAPI auto-generates API documentation and handles async requests well
- Python is readable — easy to understand and modify
- `async/await` means the API doesn't block while waiting for weather API responses

**Special requirements:**
- `network_mode: host` — container shares host network namespace so psutil sees real
  network interfaces, not the container's virtual interface
- `pid: host` — container can read the host's `/proc` filesystem for accurate process stats

Without these flags, the API would report container-level stats instead of actual laptop stats.

---

## Why Forgejo Instead of GitHub or GitLab?

**Problem:** Need to store all homelab code, configs, and documentation in version control.

**Options:**
- GitHub: free, reliable, but your configs and docs are on someone else's server
- GitLab: self-hostable but heavy (4GB+ RAM for full install)
- Forgejo: lightweight GitHub-like self-hosted Git, fork of Gitea

**Decision:** Forgejo

**Why:**
- Self-hosted — configs and documentation stay on your infrastructure
- Very lightweight — uses less than 100MB RAM
- GitHub-compatible API — tools that work with GitHub also work with Forgejo
- Full UI with code review, issues, CI/CD (Forgejo Actions)
- Shows commit history and documentation to anyone you give access to

**The tradeoff:** You maintain it yourself. If Forgejo goes down, git operations fail.
Mitigated by kscloud1 running a replica and the shared Postgres.

---

## Why OSTicket for the Help Desk?

**What it replaced:** OpenProject (project management tool on tasks.kitestacks.com)

**Why OpenProject was removed:**
- OpenProject CE (Community Edition) requires an Enterprise Edition license for SSO
- The SSO button simply does not appear in CE — it is a hard paywall with no workaround
- OpenProject is also resource-heavy for what it provides

**Why OSTicket:**
- Lightweight and runs well on the existing stack
- Email integration works (SMTP via Gmail app password — confirmed working)
- Handles the ticket/task tracking use case without the licensing barrier

---

## Why BookStack for the Wiki?

**Problem:** Need a place for long-form documentation that's more structured than markdown files.

**Decision:** BookStack

**Why:**
- Clean, organized UI: Shelves → Books → Chapters → Pages hierarchy
- WYSIWYG editor — easy to write docs without markdown syntax
- Authentik OIDC SSO works natively
- API available — docs can be pushed programmatically from scripts or CI

**Key gotcha:** Cache directory must be writable by the container user.
`chown -R abc:users /config/www/framework/cache/` is required after first install.

---

## Why the Forgejo Shared Postgres?

**Problem:** With two connectors in active-active, Forgejo on monk and kscloud1 had
separate SQLite databases. Repos created on one weren't visible on the other.

**Fix:** Migrated both Forgejo instances to a single shared PostgreSQL database on kscloud1
(same shared server as Authentik's Postgres). Both connectors now serve identical Forgejo data.

**How it was done:**
- `forgejo dump --database postgres` — exported clean SQL from monk's Forgejo
- Dropped the pgloader schema (had wrong structure), reloaded the clean SQL
- Both compose files point to `authentik-postgres:5432` database `forgejo`, user `forgejo`
- kscloud1's Forgejo joined the `authentik_default` Docker network to reach authentik-postgres