Archived

This repository has been archived on 2026-06-19. You can view files and clone it, but you cannot make any changes to it's state, such as pushing and creating new issues, pull requests or comments.

kenpat ca9e8a7959 init: complete homelab mastery guide

Architecture overview, design decisions, Docker/networking/OAuth2/Linux
concept deep-dives, cert roadmap for cloud engineering track, interview
prep with model answers, and structured learning path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-06-11 20:08:27 -05:00

8.8 KiB

Raw Blame History

Architecture Decisions — The Why Behind Every Choice

For every technology choice, there was a reason. Understanding the "why" is what separates someone who copied commands from someone who designed a system.

Why Docker Instead of Running Services Directly?

Problem: Running 15+ services directly on a Linux host creates dependency hell — different Python versions, conflicting library versions, services affecting each other.

Options considered:

Bare metal: install each app directly on the OS
Virtual machines: one VM per service
Docker containers: isolated processes with their own dependencies

Decision: Docker

Why:

Each container has its own filesystem, dependencies, and runtime — they can't conflict
Starting/stopping/updating one service doesn't affect others
The docker-compose.yml file IS the documentation — it shows exactly what the service needs to run
Portability: move the same compose file to a new machine and it works identically
Isolation: if Karakeep gets compromised, it can't easily touch Forgejo's data

What you'd say to a hiring manager: "I containerized every service using Docker and Docker Compose so each has isolated dependencies and the entire deployment is reproducible from a single YAML file."

Why Cloudflare Tunnel Instead of Port Forwarding?

Problem: How do you make home services accessible from the internet?

Traditional approach: Open port 80 and 443 on the home router, configure NAT, point DNS to home IP.

Problems with that:

Exposes your home IP address publicly (DDoS risk, can be found, ISP tracks it)
Dynamic home IP means DNS breaks every time IP changes
Some ISPs block residential port 80/443
Router configuration is error-prone and varies by hardware

Decision: Cloudflare Tunnel (cloudflared)

Why:

cloudflared makes an OUTBOUND connection to Cloudflare — no inbound ports needed
Home IP never exposed
Works regardless of ISP restrictions
Cloudflare handles TLS/HTTPS — you don't manage SSL certificates
Free tier covers everything needed
Bonus: built-in DDoS protection

The trade-off: You depend on Cloudflare. If Cloudflare has an outage, your site goes down even if your hardware is fine. This is acceptable — Cloudflare's uptime is better than most home internet connections.

Why Authentik for SSO Instead of Separate Logins Per App?

Problem: 9 services means 9 different usernames and passwords to manage. Adding a user requires going into 9 admin panels. Removing access means 9 places to deactivate.

Options:

Separate logins per service (no SSO)
Authelia (simpler, forward-auth proxy only)
Authentik (full OIDC provider, more complex)
Keycloak (enterprise-grade, very heavy)

Decision: Authentik

Why:

One account controls access to everything
Apps that support native OIDC (Grafana, Kavita, Open WebUI, Karakeep) get real SSO — the user is authenticated inside the app
Can restrict which groups can access which applications (Portainer restricted to homelab-admin group)
Self-hosted — user data stays on your infrastructure
Authentik supports both native OIDC (for apps that support it) and proxy provider (for apps that don't)

The trade-off: Authentik is complex to set up and has a significant memory footprint. Authelia would be simpler. But Authelia only does forward-auth proxy — it can't give an app a real JWT. Authentik does both.

Why a Shared Postgres Instead of Separate Authentik Databases?

Problem: After setting up active-active failover, users kept getting invalid_grant errors when signing in through SSO.

Root cause: OAuth2 authorization codes are rows in a database. The flow is:

/authorize → code stored in Database A (monk's Authentik)
/token → looks for code in Database B (kscloud1's Authentik)
Code not found → invalid_grant

Cloudflare Tunnel load-balances between monk and kscloud1 for every HTTP request. Steps 1 and 2 of the OAuth flow can hit different hosts.

Options:

Sync databases continuously (complex, slow, conflict-prone)
Use sticky sessions (Cloudflare paid feature)
Share one database (simple, reliable)

Decision: Shared Postgres on kscloud1, accessible only over Tailscale

Why:

Both monk and kscloud1 Authentik read/write the same database — authorization codes always found
Tailscale binding means the database is never exposed to the public internet (security)
Simple: one line change in each docker-compose.yml to point to a different host
Cost: free (already paying for kscloud1)

The trade-off: If kscloud1 goes down and Tailscale connectivity breaks, monk's Authentik can't start. Rollback procedure: restore monk's compose to use a local Postgres.

Why Tailscale Instead of WireGuard or OpenVPN?

Problem: Need private networking between monk (home) and kscloud1 (Hetzner cloud) without exposing the Authentik database to the public internet.

Options:

WireGuard: manual key exchange, manual routing, technical to configure
OpenVPN: even more complex, slower
Tailscale: managed WireGuard, automatic key exchange, works behind NAT

Decision: Tailscale

Why:

Works instantly — install, authenticate, done
Handles NAT traversal automatically (monk is behind home router NAT)
Devices get stable 100.x.x.x IPs regardless of actual network location
Free for up to 100 devices
Uses WireGuard under the hood — same encryption, much easier configuration

The trade-off: Tailscale is a managed service — you trust Tailscale's coordination servers. The actual data is encrypted peer-to-peer (Tailscale can't see it), but they control device authentication. Self-hosted alternative: Headscale.

Why Active-Active Instead of Active-Passive Failover?

The context: The user travels. When away from home, monk might be inaccessible (home network down, ISP outage, power). kscloud1 should keep the site running.

Active-Passive: kscloud1 only starts serving if monk is detected as down. Cloudflare would need health checks and failover rules.

Active-Active: Both monk and kscloud1 are always in the Cloudflare Tunnel rotation. Every request might hit either host.

Decision: Active-Active

Why:

Simpler: no health checks to configure, no failover logic
Instant: if monk goes down, kscloud1 is already handling 50% of traffic
Free: Cloudflare Tunnel active-active is free; health-check-based failover requires paid plans

The trade-off: Stateful apps (Forgejo, OpenProject, Kavita) have separate databases on each host. A user might see different data depending on which host answers. This was explicitly accepted: the point is uptime, not data consistency across hosts.

Why nginx for the Portal Instead of a Pre-Built Dashboard?

Options:

gethomepage (what was used before) — nice but limited customization
Heimdall — similar limitations
Custom static site + nginx — full control

Decision: Custom static HTML/CSS/JS + nginx

Why:

Complete visual control — the cyberpunk theme, the layout, every pixel
Static files served by nginx are extremely fast and reliable
Can proxy the metrics API for real-time stats without CORS issues
No framework dependencies — no Node.js, no build step, just files

The trade-off: More work to build and maintain than a pre-built dashboard. But you now understand every line of it.

Why Python + FastAPI for the Metrics API?

Problem: The portal needs real-time system stats (CPU, RAM, network), weather, and Forgejo activity. These can't come from static HTML files.

Options:

Shell scripts + cron → write stats to a JSON file the frontend reads
Node.js + Express
Python + FastAPI

Decision: Python FastAPI

Why:

Python's psutil library reads system metrics with one line of code
FastAPI is modern, fast, and automatically documents the API
async/await means the API doesn't block while waiting for weather API responses
Python is readable — you can understand and modify the code

The special requirement: The container needs network_mode: host and pid: host. Without these:

network_mode: host: the container can see the host's network interfaces and report real network throughput (not container-level)
pid: host: psutil can read the host's /proc filesystem, showing actual system stats instead of container stats

Why the Forgejo Repo for Documentation?

You could keep documentation in Notion, Google Docs, or a wiki.

Why Forgejo:

It's self-hosted — you own the data
Git tracks every change with a timestamp and message
The documentation lives alongside the configs it describes
Hiring managers can see the commit history and read your documentation directly

What this shows to a hiring manager: You treat documentation like code — version-controlled, structured, maintained.

8.8 KiB Raw Blame History

Architecture Decisions — The Why Behind Every Choice

Why Docker Instead of Running Services Directly?

Why Cloudflare Tunnel Instead of Port Forwarding?

Why Authentik for SSO Instead of Separate Logins Per App?

Why a Shared Postgres Instead of Separate Authentik Databases?

Why Tailscale Instead of WireGuard or OpenVPN?

Why Active-Active Instead of Active-Passive Failover?

Why nginx for the Portal Instead of a Pre-Built Dashboard?

Why Python + FastAPI for the Metrics API?

Why the Forgejo Repo for Documentation?

8.8 KiB

Raw Blame History