This repository has been archived on 2026-06-19. You can view files and clone it, but you cannot make any changes to it's state, such as pushing and creating new issues, pull requests or comments.
homelab-mastery/architecture/decisions.md
kenpat ca9e8a7959 init: complete homelab mastery guide
Architecture overview, design decisions, Docker/networking/OAuth2/Linux
concept deep-dives, cert roadmap for cloud engineering track, interview
prep with model answers, and structured learning path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-11 20:08:27 -05:00

8.8 KiB

Architecture Decisions — The Why Behind Every Choice

For every technology choice, there was a reason. Understanding the "why" is what separates someone who copied commands from someone who designed a system.


Why Docker Instead of Running Services Directly?

Problem: Running 15+ services directly on a Linux host creates dependency hell — different Python versions, conflicting library versions, services affecting each other.

Options considered:

  • Bare metal: install each app directly on the OS
  • Virtual machines: one VM per service
  • Docker containers: isolated processes with their own dependencies

Decision: Docker

Why:

  • Each container has its own filesystem, dependencies, and runtime — they can't conflict
  • Starting/stopping/updating one service doesn't affect others
  • The docker-compose.yml file IS the documentation — it shows exactly what the service needs to run
  • Portability: move the same compose file to a new machine and it works identically
  • Isolation: if Karakeep gets compromised, it can't easily touch Forgejo's data

What you'd say to a hiring manager: "I containerized every service using Docker and Docker Compose so each has isolated dependencies and the entire deployment is reproducible from a single YAML file."


Why Cloudflare Tunnel Instead of Port Forwarding?

Problem: How do you make home services accessible from the internet?

Traditional approach: Open port 80 and 443 on the home router, configure NAT, point DNS to home IP.

Problems with that:

  • Exposes your home IP address publicly (DDoS risk, can be found, ISP tracks it)
  • Dynamic home IP means DNS breaks every time IP changes
  • Some ISPs block residential port 80/443
  • Router configuration is error-prone and varies by hardware

Decision: Cloudflare Tunnel (cloudflared)

Why:

  • cloudflared makes an OUTBOUND connection to Cloudflare — no inbound ports needed
  • Home IP never exposed
  • Works regardless of ISP restrictions
  • Cloudflare handles TLS/HTTPS — you don't manage SSL certificates
  • Free tier covers everything needed
  • Bonus: built-in DDoS protection

The trade-off: You depend on Cloudflare. If Cloudflare has an outage, your site goes down even if your hardware is fine. This is acceptable — Cloudflare's uptime is better than most home internet connections.


Why Authentik for SSO Instead of Separate Logins Per App?

Problem: 9 services means 9 different usernames and passwords to manage. Adding a user requires going into 9 admin panels. Removing access means 9 places to deactivate.

Options:

  • Separate logins per service (no SSO)
  • Authelia (simpler, forward-auth proxy only)
  • Authentik (full OIDC provider, more complex)
  • Keycloak (enterprise-grade, very heavy)

Decision: Authentik

Why:

  • One account controls access to everything
  • Apps that support native OIDC (Grafana, Kavita, Open WebUI, Karakeep) get real SSO — the user is authenticated inside the app
  • Can restrict which groups can access which applications (Portainer restricted to homelab-admin group)
  • Self-hosted — user data stays on your infrastructure
  • Authentik supports both native OIDC (for apps that support it) and proxy provider (for apps that don't)

The trade-off: Authentik is complex to set up and has a significant memory footprint. Authelia would be simpler. But Authelia only does forward-auth proxy — it can't give an app a real JWT. Authentik does both.


Why a Shared Postgres Instead of Separate Authentik Databases?

Problem: After setting up active-active failover, users kept getting invalid_grant errors when signing in through SSO.

Root cause: OAuth2 authorization codes are rows in a database. The flow is:

  1. /authorize → code stored in Database A (monk's Authentik)
  2. /token → looks for code in Database B (kscloud1's Authentik)
  3. Code not found → invalid_grant

Cloudflare Tunnel load-balances between monk and kscloud1 for every HTTP request. Steps 1 and 2 of the OAuth flow can hit different hosts.

Options:

  • Sync databases continuously (complex, slow, conflict-prone)
  • Use sticky sessions (Cloudflare paid feature)
  • Share one database (simple, reliable)

Decision: Shared Postgres on kscloud1, accessible only over Tailscale

Why:

  • Both monk and kscloud1 Authentik read/write the same database — authorization codes always found
  • Tailscale binding means the database is never exposed to the public internet (security)
  • Simple: one line change in each docker-compose.yml to point to a different host
  • Cost: free (already paying for kscloud1)

The trade-off: If kscloud1 goes down and Tailscale connectivity breaks, monk's Authentik can't start. Rollback procedure: restore monk's compose to use a local Postgres.


Why Tailscale Instead of WireGuard or OpenVPN?

Problem: Need private networking between monk (home) and kscloud1 (Hetzner cloud) without exposing the Authentik database to the public internet.

Options:

  • WireGuard: manual key exchange, manual routing, technical to configure
  • OpenVPN: even more complex, slower
  • Tailscale: managed WireGuard, automatic key exchange, works behind NAT

Decision: Tailscale

Why:

  • Works instantly — install, authenticate, done
  • Handles NAT traversal automatically (monk is behind home router NAT)
  • Devices get stable 100.x.x.x IPs regardless of actual network location
  • Free for up to 100 devices
  • Uses WireGuard under the hood — same encryption, much easier configuration

The trade-off: Tailscale is a managed service — you trust Tailscale's coordination servers. The actual data is encrypted peer-to-peer (Tailscale can't see it), but they control device authentication. Self-hosted alternative: Headscale.


Why Active-Active Instead of Active-Passive Failover?

The context: The user travels. When away from home, monk might be inaccessible (home network down, ISP outage, power). kscloud1 should keep the site running.

Active-Passive: kscloud1 only starts serving if monk is detected as down. Cloudflare would need health checks and failover rules.

Active-Active: Both monk and kscloud1 are always in the Cloudflare Tunnel rotation. Every request might hit either host.

Decision: Active-Active

Why:

  • Simpler: no health checks to configure, no failover logic
  • Instant: if monk goes down, kscloud1 is already handling 50% of traffic
  • Free: Cloudflare Tunnel active-active is free; health-check-based failover requires paid plans

The trade-off: Stateful apps (Forgejo, OpenProject, Kavita) have separate databases on each host. A user might see different data depending on which host answers. This was explicitly accepted: the point is uptime, not data consistency across hosts.


Why nginx for the Portal Instead of a Pre-Built Dashboard?

Options:

  • gethomepage (what was used before) — nice but limited customization
  • Heimdall — similar limitations
  • Custom static site + nginx — full control

Decision: Custom static HTML/CSS/JS + nginx

Why:

  • Complete visual control — the cyberpunk theme, the layout, every pixel
  • Static files served by nginx are extremely fast and reliable
  • Can proxy the metrics API for real-time stats without CORS issues
  • No framework dependencies — no Node.js, no build step, just files

The trade-off: More work to build and maintain than a pre-built dashboard. But you now understand every line of it.


Why Python + FastAPI for the Metrics API?

Problem: The portal needs real-time system stats (CPU, RAM, network), weather, and Forgejo activity. These can't come from static HTML files.

Options:

  • Shell scripts + cron → write stats to a JSON file the frontend reads
  • Node.js + Express
  • Python + FastAPI

Decision: Python FastAPI

Why:

  • Python's psutil library reads system metrics with one line of code
  • FastAPI is modern, fast, and automatically documents the API
  • async/await means the API doesn't block while waiting for weather API responses
  • Python is readable — you can understand and modify the code

The special requirement: The container needs network_mode: host and pid: host. Without these:

  • network_mode: host: the container can see the host's network interfaces and report real network throughput (not container-level)
  • pid: host: psutil can read the host's /proc filesystem, showing actual system stats instead of container stats

Why the Forgejo Repo for Documentation?

You could keep documentation in Notion, Google Docs, or a wiki.

Why Forgejo:

  • It's self-hosted — you own the data
  • Git tracks every change with a timestamp and message
  • The documentation lives alongside the configs it describes
  • Hiring managers can see the commit history and read your documentation directly

What this shows to a hiring manager: You treat documentation like code — version-controlled, structured, maintained.