init: complete homelab mastery guide

Architecture overview, design decisions, Docker/networking/OAuth2/Linux
concept deep-dives, cert roadmap for cloud engineering track, interview
prep with model answers, and structured learning path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
kenpat 2026-06-11 20:08:27 -05:00
commit ca9e8a7959
10 changed files with 1534 additions and 0 deletions

199
architecture/decisions.md Normal file
View file

@ -0,0 +1,199 @@
# Architecture Decisions — The Why Behind Every Choice
For every technology choice, there was a reason. Understanding the "why" is what separates someone who copied commands from someone who designed a system.
---
## Why Docker Instead of Running Services Directly?
**Problem:** Running 15+ services directly on a Linux host creates dependency hell — different Python versions, conflicting library versions, services affecting each other.
**Options considered:**
- Bare metal: install each app directly on the OS
- Virtual machines: one VM per service
- Docker containers: isolated processes with their own dependencies
**Decision:** Docker
**Why:**
- Each container has its own filesystem, dependencies, and runtime — they can't conflict
- Starting/stopping/updating one service doesn't affect others
- The `docker-compose.yml` file IS the documentation — it shows exactly what the service needs to run
- Portability: move the same compose file to a new machine and it works identically
- Isolation: if Karakeep gets compromised, it can't easily touch Forgejo's data
**What you'd say to a hiring manager:** *"I containerized every service using Docker and Docker Compose so each has isolated dependencies and the entire deployment is reproducible from a single YAML file."*
---
## Why Cloudflare Tunnel Instead of Port Forwarding?
**Problem:** How do you make home services accessible from the internet?
**Traditional approach:** Open port 80 and 443 on the home router, configure NAT, point DNS to home IP.
**Problems with that:**
- Exposes your home IP address publicly (DDoS risk, can be found, ISP tracks it)
- Dynamic home IP means DNS breaks every time IP changes
- Some ISPs block residential port 80/443
- Router configuration is error-prone and varies by hardware
**Decision:** Cloudflare Tunnel (cloudflared)
**Why:**
- cloudflared makes an OUTBOUND connection to Cloudflare — no inbound ports needed
- Home IP never exposed
- Works regardless of ISP restrictions
- Cloudflare handles TLS/HTTPS — you don't manage SSL certificates
- Free tier covers everything needed
- Bonus: built-in DDoS protection
**The trade-off:** You depend on Cloudflare. If Cloudflare has an outage, your site goes down even if your hardware is fine. This is acceptable — Cloudflare's uptime is better than most home internet connections.
---
## Why Authentik for SSO Instead of Separate Logins Per App?
**Problem:** 9 services means 9 different usernames and passwords to manage. Adding a user requires going into 9 admin panels. Removing access means 9 places to deactivate.
**Options:**
- Separate logins per service (no SSO)
- Authelia (simpler, forward-auth proxy only)
- Authentik (full OIDC provider, more complex)
- Keycloak (enterprise-grade, very heavy)
**Decision:** Authentik
**Why:**
- One account controls access to everything
- Apps that support native OIDC (Grafana, Kavita, Open WebUI, Karakeep) get real SSO — the user is authenticated inside the app
- Can restrict which groups can access which applications (Portainer restricted to homelab-admin group)
- Self-hosted — user data stays on your infrastructure
- Authentik supports both native OIDC (for apps that support it) and proxy provider (for apps that don't)
**The trade-off:** Authentik is complex to set up and has a significant memory footprint. Authelia would be simpler. But Authelia only does forward-auth proxy — it can't give an app a real JWT. Authentik does both.
---
## Why a Shared Postgres Instead of Separate Authentik Databases?
**Problem:** After setting up active-active failover, users kept getting `invalid_grant` errors when signing in through SSO.
**Root cause:** OAuth2 authorization codes are rows in a database. The flow is:
1. `/authorize` → code stored in Database A (monk's Authentik)
2. `/token` → looks for code in Database B (kscloud1's Authentik)
3. Code not found → `invalid_grant`
Cloudflare Tunnel load-balances between monk and kscloud1 for every HTTP request. Steps 1 and 2 of the OAuth flow can hit different hosts.
**Options:**
- Sync databases continuously (complex, slow, conflict-prone)
- Use sticky sessions (Cloudflare paid feature)
- Share one database (simple, reliable)
**Decision:** Shared Postgres on kscloud1, accessible only over Tailscale
**Why:**
- Both monk and kscloud1 Authentik read/write the same database — authorization codes always found
- Tailscale binding means the database is never exposed to the public internet (security)
- Simple: one line change in each `docker-compose.yml` to point to a different host
- Cost: free (already paying for kscloud1)
**The trade-off:** If kscloud1 goes down and Tailscale connectivity breaks, monk's Authentik can't start. Rollback procedure: restore monk's compose to use a local Postgres.
---
## Why Tailscale Instead of WireGuard or OpenVPN?
**Problem:** Need private networking between monk (home) and kscloud1 (Hetzner cloud) without exposing the Authentik database to the public internet.
**Options:**
- WireGuard: manual key exchange, manual routing, technical to configure
- OpenVPN: even more complex, slower
- Tailscale: managed WireGuard, automatic key exchange, works behind NAT
**Decision:** Tailscale
**Why:**
- Works instantly — install, authenticate, done
- Handles NAT traversal automatically (monk is behind home router NAT)
- Devices get stable 100.x.x.x IPs regardless of actual network location
- Free for up to 100 devices
- Uses WireGuard under the hood — same encryption, much easier configuration
**The trade-off:** Tailscale is a managed service — you trust Tailscale's coordination servers. The actual data is encrypted peer-to-peer (Tailscale can't see it), but they control device authentication. Self-hosted alternative: Headscale.
---
## Why Active-Active Instead of Active-Passive Failover?
**The context:** The user travels. When away from home, monk might be inaccessible (home network down, ISP outage, power). kscloud1 should keep the site running.
**Active-Passive:** kscloud1 only starts serving if monk is detected as down. Cloudflare would need health checks and failover rules.
**Active-Active:** Both monk and kscloud1 are always in the Cloudflare Tunnel rotation. Every request might hit either host.
**Decision:** Active-Active
**Why:**
- Simpler: no health checks to configure, no failover logic
- Instant: if monk goes down, kscloud1 is already handling 50% of traffic
- Free: Cloudflare Tunnel active-active is free; health-check-based failover requires paid plans
**The trade-off:** Stateful apps (Forgejo, OpenProject, Kavita) have separate databases on each host. A user might see different data depending on which host answers. This was explicitly accepted: the point is uptime, not data consistency across hosts.
---
## Why nginx for the Portal Instead of a Pre-Built Dashboard?
**Options:**
- gethomepage (what was used before) — nice but limited customization
- Heimdall — similar limitations
- Custom static site + nginx — full control
**Decision:** Custom static HTML/CSS/JS + nginx
**Why:**
- Complete visual control — the cyberpunk theme, the layout, every pixel
- Static files served by nginx are extremely fast and reliable
- Can proxy the metrics API for real-time stats without CORS issues
- No framework dependencies — no Node.js, no build step, just files
**The trade-off:** More work to build and maintain than a pre-built dashboard. But you now understand every line of it.
---
## Why Python + FastAPI for the Metrics API?
**Problem:** The portal needs real-time system stats (CPU, RAM, network), weather, and Forgejo activity. These can't come from static HTML files.
**Options:**
- Shell scripts + cron → write stats to a JSON file the frontend reads
- Node.js + Express
- Python + FastAPI
**Decision:** Python FastAPI
**Why:**
- Python's `psutil` library reads system metrics with one line of code
- FastAPI is modern, fast, and automatically documents the API
- `async/await` means the API doesn't block while waiting for weather API responses
- Python is readable — you can understand and modify the code
**The special requirement:** The container needs `network_mode: host` and `pid: host`. Without these:
- `network_mode: host`: the container can see the host's network interfaces and report real network throughput (not container-level)
- `pid: host`: psutil can read the host's `/proc` filesystem, showing actual system stats instead of container stats
---
## Why the Forgejo Repo for Documentation?
You could keep documentation in Notion, Google Docs, or a wiki.
**Why Forgejo:**
- It's self-hosted — you own the data
- Git tracks every change with a timestamp and message
- The documentation lives alongside the configs it describes
- Hiring managers can see the commit history and read your documentation directly
**What this shows to a hiring manager:** You treat documentation like code — version-controlled, structured, maintained.

221
architecture/overview.md Normal file
View file

@ -0,0 +1,221 @@
# KiteStacks Architecture — Full System Overview
## The Big Picture
```
INTERNET
┌──────▼──────┐
│ Cloudflare │ DNS + TLS termination
│ (edge) │ Zero Trust Tunnel
└──────┬──────┘
│ HTTPS (443) only
┌────────────────┼────────────────┐
│ connector 1 │ connector 2 │ connector 3
│ │ │
┌──────▼──────┐ │ ┌──────▼──────┐
│ MONK │ │ │ KSCLOUD1 │
│ (home PC) │ │ │ (Hetzner VPS│
│ │ Active │ │ 5.78.x.x) │
│ All 9 │ Active │ │ │
│ services │ │ │ All 9 │
│ │ │ │ services │
└──────┬──────┘ │ └──────┬──────┘
│ │ │
└────────────────┼───────────────┘
TAILSCALE VPN
(100.x.x.x range)
┌────────▼────────┐
│ SHARED DB LAYER │
│ on kscloud1 │
│ Postgres :5432 │
│ Redis :6379 │
│ (Tailscale │
│ only, private)│
└─────────────────┘
```
---
## Every Service and What It Does
### The Nine Public Services
| Service | Container Name | What It Does | Why It's Here |
|---------|---------------|--------------|---------------|
| **Portal** | `homepage` | The public website (kitestacks.com) — custom nginx serving static HTML/CSS/JS with a cyberpunk theme | Front door to everything. Shows system stats, recent activity, links to all services |
| **Authentik** | `authentik` | Identity provider — handles all logins via OIDC/OAuth2 SSO | Single place to manage all user accounts and access control |
| **Forgejo** | `forgejo` | Self-hosted Git platform (like GitHub but yours) | Store all homelab code, config, and documentation |
| **OpenProject** | `openproject` | Project management (like Jira) | Task tracking, project planning |
| **Open WebUI** | `kite-openwebui` | ChatGPT-like AI chat interface | Access multiple AI models through one interface |
| **Karakeep** | `karakeep` | Bookmark and read-it-later manager | Save links, articles, and content |
| **Kavita** | `kavita` | eBook and manga reader | Personal digital library |
| **Grafana** | `grafana` | Monitoring dashboards | Visualize CPU, RAM, network, uptime across both hosts |
| **Uptime Kuma** | `uptime-kuma` | Status page and uptime monitoring | Monitor that all 9 services are up and alert if they go down |
### The Infrastructure Services (Not Public-Facing)
| Service | What It Does |
|---------|-------------|
| `cloudflared` | Cloudflare Tunnel connector — creates encrypted outbound tunnel to Cloudflare edge |
| `prometheus` | Metrics collection — scrapes system stats from both monk and kscloud1 every 15 seconds |
| `node-exporter` | Exposes host system metrics (CPU, RAM, disk, network) for Prometheus to scrape |
| `kite-litellm` | LLM proxy gateway — routes AI requests to OpenRouter (multiple free models) |
| `portainer` | Docker management UI — visual interface to manage all containers |
| `kitestacks-metrics-api` | Python FastAPI service — serves real-time system stats, weather, and Forgejo activity to the portal |
---
## How Traffic Flows
### When Someone Visits www.kitestacks.com
```
1. Browser sends HTTPS request to www.kitestacks.com
2. DNS resolves to Cloudflare's anycast IP (not your home IP)
3. Cloudflare terminates TLS — your home router never sees HTTPS
4. Cloudflare routes the request through the tunnel to whichever
cloudflared connector responds first (monk or kscloud1)
5. cloudflared resolves "homepage" via Docker DNS
6. Request hits the nginx container serving the static portal
7. Portal's JavaScript fetches /api/metrics and /api/activity
from the kitestacks-metrics-api container via nginx proxy
8. Page renders with live system stats and recent git activity
```
### When Someone Clicks "Sign In with Authentik"
```
1. App (e.g., Grafana) redirects browser to auth.kitestacks.com/application/o/authorize/
2. Authentik presents login page
3. User enters credentials — Authentik validates against its database
(stored on kscloud1's Postgres, shared over Tailscale)
4. Authentik generates an authorization code and redirects back to Grafana
5. Grafana's backend calls auth.kitestacks.com/application/o/token/
to exchange the code for an access token
6. Authentik validates the code (found in shared DB) and returns a JWT
7. Grafana reads the user's email/name from the JWT and logs them in
```
**The critical detail:** Steps 1 and 5 can hit different tunnel connectors (monk vs kscloud1). The authorization code from step 4 must exist in whichever database step 5 hits. That's why both connectors point to the SAME Postgres on kscloud1 — otherwise step 5 returns `invalid_grant` because the code isn't found.
---
## The Two Hosts in Detail
### Monk (Primary Home Machine)
- **Role:** Primary production host
- **Network:** Home LAN, no open ports on router (Cloudflare Tunnel handles all inbound)
- **Services:** All 9 public services + all infrastructure services
- **Data:** Each service has its own database/storage
- **Authentik DB:** Points to kscloud1's Postgres over Tailscale (100.x.x.x)
### kscloud1 (Hetzner VPS)
- **Role:** Permanent cloud replica — always on, even when monk is off (travel, power outage, etc.)
- **Network:** Public IP, Cloudflare Tunnel connector 3
- **Services:** Full replica of all 9 public services (separate databases except Authentik)
- **Hosts:** The shared Authentik Postgres + Redis (bound to Tailscale interface only)
- **Resources:** 3 vCPU, 3.7 GB RAM — tight but functional
### What's the Same Across Both
- Same Cloudflare Tunnel token (different connector IDs assigned automatically)
- Same Authentik database (shared via Tailscale)
- Same Authentik secret key (required for JWT validation)
- Same kavita.db (one-time sync — users and OIDC config)
### What's Different Across Both
- Forgejo data (separate repos — accepted inconsistency)
- OpenProject data (separate projects)
- Karakeep bookmarks (separate)
- Kavita book files (monk has them, kscloud1 doesn't — covers synced, books not)
---
## The Docker Network
Every container joins the `kitestacks` external Docker bridge network:
```bash
docker network create kitestacks
```
This is what makes Cloudflare Tunnel work. The cloudflared container is also on this network, so when Cloudflare tells cloudflared to route `http://grafana:3000`, Docker's internal DNS resolves `grafana` to the grafana container's IP on that network.
Without this shared network, cloudflared can't reach the service containers by name.
---
## Why No Open Ports on the Router
Traditional homelab: open port 80/443 on home router → NAT to home server → expose home IP.
Problems with that:
- Your home IP is public (DDoS risk, targeted attacks)
- Router configuration is fragile
- ISP can change your IP (dynamic IP)
- Some ISPs block port 80/443
Cloudflare Tunnel approach:
- cloudflared container makes an OUTBOUND connection to Cloudflare
- Cloudflare holds that connection open
- Inbound requests come through Cloudflare, over that existing outbound tunnel
- Your home IP is never exposed
- Works on any network, any ISP, any firewall
This is why you can run a public website from a home PC with zero router configuration.
---
## Tailscale — The Private Backbone
Tailscale creates a private overlay network (VPN mesh) across all your devices:
```
monk (100.x.x.x) ←—— encrypted ——→ kscloud1 (100.x.x.x)
monk (100.x.x.x) ←—— encrypted ——→ pixel-6 (100.x.x.x)
```
Used in this project for:
1. **Shared Authentik DB:** kscloud1's Postgres binds to its Tailscale IP, not its public IP. Only devices on the tailnet can connect. Monk points to that address.
2. **Forgejo activity feed:** On kscloud1, the metrics API fetches recent commits from monk's Forgejo via monk's Tailscale IP — so both portal instances show the same activity feed.
3. **SSH/Admin access:** You can SSH into any device on the tailnet from anywhere.
---
## The Monitoring Stack
```
node-exporter (monk) → prometheus (monk) → grafana (monk)
node-exporter (kscloud1) ↗ (scrapes 5.78.x.x:9100)
```
Prometheus scrapes metrics every 15 seconds from:
- `node-exporter:9100` — monk's own node-exporter (via Docker DNS)
- `5.78.x.x:9100` — kscloud1's node-exporter (via public IP, port exposed 0.0.0.0)
Grafana visualizes both, letting you switch between hosts in the instance picker.
---
## The Portal Architecture
The portal is NOT gethomepage or any pre-built dashboard. It's a custom-built static site:
```
nginx (container: "homepage")
├── / → serves static HTML/CSS/JS from ./public/
└── /api/* → proxy_pass to kitestacks-metrics-api:8000 (host)
kitestacks-metrics-api (network_mode: host, pid: host)
├── GET /api/metrics → psutil reads HOST's CPU/RAM/disk/network
├── GET /api/weather → wttr.in API → current weather by IP geolocation
├── GET /api/activity → Forgejo API → recent commits
└── GET /api/health → {"ok": true}
```
The metrics API runs with `network_mode: host` and `pid: host` so it reads the HOST machine's process table and `/proc` filesystem — not the container's. Without this, it would report container stats, not laptop stats.