Moved homelab-mastery repo content into homelab-mastery/ subdirectory. Covers architecture, concepts, certifications, interview-prep, and learning-path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
199 lines
8.8 KiB
Markdown
199 lines
8.8 KiB
Markdown
# Architecture Decisions — The Why Behind Every Choice
|
|
|
|
For every technology choice, there was a reason. Understanding the "why" is what separates someone who copied commands from someone who designed a system.
|
|
|
|
---
|
|
|
|
## Why Docker Instead of Running Services Directly?
|
|
|
|
**Problem:** Running 15+ services directly on a Linux host creates dependency hell — different Python versions, conflicting library versions, services affecting each other.
|
|
|
|
**Options considered:**
|
|
- Bare metal: install each app directly on the OS
|
|
- Virtual machines: one VM per service
|
|
- Docker containers: isolated processes with their own dependencies
|
|
|
|
**Decision:** Docker
|
|
|
|
**Why:**
|
|
- Each container has its own filesystem, dependencies, and runtime — they can't conflict
|
|
- Starting/stopping/updating one service doesn't affect others
|
|
- The `docker-compose.yml` file IS the documentation — it shows exactly what the service needs to run
|
|
- Portability: move the same compose file to a new machine and it works identically
|
|
- Isolation: if Karakeep gets compromised, it can't easily touch Forgejo's data
|
|
|
|
**What you'd say to a hiring manager:** *"I containerized every service using Docker and Docker Compose so each has isolated dependencies and the entire deployment is reproducible from a single YAML file."*
|
|
|
|
---
|
|
|
|
## Why Cloudflare Tunnel Instead of Port Forwarding?
|
|
|
|
**Problem:** How do you make home services accessible from the internet?
|
|
|
|
**Traditional approach:** Open port 80 and 443 on the home router, configure NAT, point DNS to home IP.
|
|
|
|
**Problems with that:**
|
|
- Exposes your home IP address publicly (DDoS risk, can be found, ISP tracks it)
|
|
- Dynamic home IP means DNS breaks every time IP changes
|
|
- Some ISPs block residential port 80/443
|
|
- Router configuration is error-prone and varies by hardware
|
|
|
|
**Decision:** Cloudflare Tunnel (cloudflared)
|
|
|
|
**Why:**
|
|
- cloudflared makes an OUTBOUND connection to Cloudflare — no inbound ports needed
|
|
- Home IP never exposed
|
|
- Works regardless of ISP restrictions
|
|
- Cloudflare handles TLS/HTTPS — you don't manage SSL certificates
|
|
- Free tier covers everything needed
|
|
- Bonus: built-in DDoS protection
|
|
|
|
**The trade-off:** You depend on Cloudflare. If Cloudflare has an outage, your site goes down even if your hardware is fine. This is acceptable — Cloudflare's uptime is better than most home internet connections.
|
|
|
|
---
|
|
|
|
## Why Authentik for SSO Instead of Separate Logins Per App?
|
|
|
|
**Problem:** 9 services means 9 different usernames and passwords to manage. Adding a user requires going into 9 admin panels. Removing access means 9 places to deactivate.
|
|
|
|
**Options:**
|
|
- Separate logins per service (no SSO)
|
|
- Authelia (simpler, forward-auth proxy only)
|
|
- Authentik (full OIDC provider, more complex)
|
|
- Keycloak (enterprise-grade, very heavy)
|
|
|
|
**Decision:** Authentik
|
|
|
|
**Why:**
|
|
- One account controls access to everything
|
|
- Apps that support native OIDC (Grafana, Kavita, Open WebUI, Karakeep) get real SSO — the user is authenticated inside the app
|
|
- Can restrict which groups can access which applications (Portainer restricted to homelab-admin group)
|
|
- Self-hosted — user data stays on your infrastructure
|
|
- Authentik supports both native OIDC (for apps that support it) and proxy provider (for apps that don't)
|
|
|
|
**The trade-off:** Authentik is complex to set up and has a significant memory footprint. Authelia would be simpler. But Authelia only does forward-auth proxy — it can't give an app a real JWT. Authentik does both.
|
|
|
|
---
|
|
|
|
## Why a Shared Postgres Instead of Separate Authentik Databases?
|
|
|
|
**Problem:** After setting up active-active failover, users kept getting `invalid_grant` errors when signing in through SSO.
|
|
|
|
**Root cause:** OAuth2 authorization codes are rows in a database. The flow is:
|
|
1. `/authorize` → code stored in Database A (monk's Authentik)
|
|
2. `/token` → looks for code in Database B (kscloud1's Authentik)
|
|
3. Code not found → `invalid_grant`
|
|
|
|
Cloudflare Tunnel load-balances between monk and kscloud1 for every HTTP request. Steps 1 and 2 of the OAuth flow can hit different hosts.
|
|
|
|
**Options:**
|
|
- Sync databases continuously (complex, slow, conflict-prone)
|
|
- Use sticky sessions (Cloudflare paid feature)
|
|
- Share one database (simple, reliable)
|
|
|
|
**Decision:** Shared Postgres on kscloud1, accessible only over Tailscale
|
|
|
|
**Why:**
|
|
- Both monk and kscloud1 Authentik read/write the same database — authorization codes always found
|
|
- Tailscale binding means the database is never exposed to the public internet (security)
|
|
- Simple: one line change in each `docker-compose.yml` to point to a different host
|
|
- Cost: free (already paying for kscloud1)
|
|
|
|
**The trade-off:** If kscloud1 goes down and Tailscale connectivity breaks, monk's Authentik can't start. Rollback procedure: restore monk's compose to use a local Postgres.
|
|
|
|
---
|
|
|
|
## Why Tailscale Instead of WireGuard or OpenVPN?
|
|
|
|
**Problem:** Need private networking between monk (home) and kscloud1 (Hetzner cloud) without exposing the Authentik database to the public internet.
|
|
|
|
**Options:**
|
|
- WireGuard: manual key exchange, manual routing, technical to configure
|
|
- OpenVPN: even more complex, slower
|
|
- Tailscale: managed WireGuard, automatic key exchange, works behind NAT
|
|
|
|
**Decision:** Tailscale
|
|
|
|
**Why:**
|
|
- Works instantly — install, authenticate, done
|
|
- Handles NAT traversal automatically (monk is behind home router NAT)
|
|
- Devices get stable 100.x.x.x IPs regardless of actual network location
|
|
- Free for up to 100 devices
|
|
- Uses WireGuard under the hood — same encryption, much easier configuration
|
|
|
|
**The trade-off:** Tailscale is a managed service — you trust Tailscale's coordination servers. The actual data is encrypted peer-to-peer (Tailscale can't see it), but they control device authentication. Self-hosted alternative: Headscale.
|
|
|
|
---
|
|
|
|
## Why Active-Active Instead of Active-Passive Failover?
|
|
|
|
**The context:** The user travels. When away from home, monk might be inaccessible (home network down, ISP outage, power). kscloud1 should keep the site running.
|
|
|
|
**Active-Passive:** kscloud1 only starts serving if monk is detected as down. Cloudflare would need health checks and failover rules.
|
|
|
|
**Active-Active:** Both monk and kscloud1 are always in the Cloudflare Tunnel rotation. Every request might hit either host.
|
|
|
|
**Decision:** Active-Active
|
|
|
|
**Why:**
|
|
- Simpler: no health checks to configure, no failover logic
|
|
- Instant: if monk goes down, kscloud1 is already handling 50% of traffic
|
|
- Free: Cloudflare Tunnel active-active is free; health-check-based failover requires paid plans
|
|
|
|
**The trade-off:** Stateful apps (Forgejo, OpenProject, Kavita) have separate databases on each host. A user might see different data depending on which host answers. This was explicitly accepted: the point is uptime, not data consistency across hosts.
|
|
|
|
---
|
|
|
|
## Why nginx for the Portal Instead of a Pre-Built Dashboard?
|
|
|
|
**Options:**
|
|
- gethomepage (what was used before) — nice but limited customization
|
|
- Heimdall — similar limitations
|
|
- Custom static site + nginx — full control
|
|
|
|
**Decision:** Custom static HTML/CSS/JS + nginx
|
|
|
|
**Why:**
|
|
- Complete visual control — the cyberpunk theme, the layout, every pixel
|
|
- Static files served by nginx are extremely fast and reliable
|
|
- Can proxy the metrics API for real-time stats without CORS issues
|
|
- No framework dependencies — no Node.js, no build step, just files
|
|
|
|
**The trade-off:** More work to build and maintain than a pre-built dashboard. But you now understand every line of it.
|
|
|
|
---
|
|
|
|
## Why Python + FastAPI for the Metrics API?
|
|
|
|
**Problem:** The portal needs real-time system stats (CPU, RAM, network), weather, and Forgejo activity. These can't come from static HTML files.
|
|
|
|
**Options:**
|
|
- Shell scripts + cron → write stats to a JSON file the frontend reads
|
|
- Node.js + Express
|
|
- Python + FastAPI
|
|
|
|
**Decision:** Python FastAPI
|
|
|
|
**Why:**
|
|
- Python's `psutil` library reads system metrics with one line of code
|
|
- FastAPI is modern, fast, and automatically documents the API
|
|
- `async/await` means the API doesn't block while waiting for weather API responses
|
|
- Python is readable — you can understand and modify the code
|
|
|
|
**The special requirement:** The container needs `network_mode: host` and `pid: host`. Without these:
|
|
- `network_mode: host`: the container can see the host's network interfaces and report real network throughput (not container-level)
|
|
- `pid: host`: psutil can read the host's `/proc` filesystem, showing actual system stats instead of container stats
|
|
|
|
---
|
|
|
|
## Why the Forgejo Repo for Documentation?
|
|
|
|
You could keep documentation in Notion, Google Docs, or a wiki.
|
|
|
|
**Why Forgejo:**
|
|
- It's self-hosted — you own the data
|
|
- Git tracks every change with a timestamp and message
|
|
- The documentation lives alongside the configs it describes
|
|
- Hiring managers can see the commit history and read your documentation directly
|
|
|
|
**What this shows to a hiring manager:** You treat documentation like code — version-controlled, structured, maintained.
|