init: complete homelab mastery guide
Architecture overview, design decisions, Docker/networking/OAuth2/Linux concept deep-dives, cert roadmap for cloud engineering track, interview prep with model answers, and structured learning path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
commit
ca9e8a7959
10 changed files with 1534 additions and 0 deletions
199
architecture/decisions.md
Normal file
199
architecture/decisions.md
Normal file
|
|
@ -0,0 +1,199 @@
|
|||
# Architecture Decisions — The Why Behind Every Choice
|
||||
|
||||
For every technology choice, there was a reason. Understanding the "why" is what separates someone who copied commands from someone who designed a system.
|
||||
|
||||
---
|
||||
|
||||
## Why Docker Instead of Running Services Directly?
|
||||
|
||||
**Problem:** Running 15+ services directly on a Linux host creates dependency hell — different Python versions, conflicting library versions, services affecting each other.
|
||||
|
||||
**Options considered:**
|
||||
- Bare metal: install each app directly on the OS
|
||||
- Virtual machines: one VM per service
|
||||
- Docker containers: isolated processes with their own dependencies
|
||||
|
||||
**Decision:** Docker
|
||||
|
||||
**Why:**
|
||||
- Each container has its own filesystem, dependencies, and runtime — they can't conflict
|
||||
- Starting/stopping/updating one service doesn't affect others
|
||||
- The `docker-compose.yml` file IS the documentation — it shows exactly what the service needs to run
|
||||
- Portability: move the same compose file to a new machine and it works identically
|
||||
- Isolation: if Karakeep gets compromised, it can't easily touch Forgejo's data
|
||||
|
||||
**What you'd say to a hiring manager:** *"I containerized every service using Docker and Docker Compose so each has isolated dependencies and the entire deployment is reproducible from a single YAML file."*
|
||||
|
||||
---
|
||||
|
||||
## Why Cloudflare Tunnel Instead of Port Forwarding?
|
||||
|
||||
**Problem:** How do you make home services accessible from the internet?
|
||||
|
||||
**Traditional approach:** Open port 80 and 443 on the home router, configure NAT, point DNS to home IP.
|
||||
|
||||
**Problems with that:**
|
||||
- Exposes your home IP address publicly (DDoS risk, can be found, ISP tracks it)
|
||||
- Dynamic home IP means DNS breaks every time IP changes
|
||||
- Some ISPs block residential port 80/443
|
||||
- Router configuration is error-prone and varies by hardware
|
||||
|
||||
**Decision:** Cloudflare Tunnel (cloudflared)
|
||||
|
||||
**Why:**
|
||||
- cloudflared makes an OUTBOUND connection to Cloudflare — no inbound ports needed
|
||||
- Home IP never exposed
|
||||
- Works regardless of ISP restrictions
|
||||
- Cloudflare handles TLS/HTTPS — you don't manage SSL certificates
|
||||
- Free tier covers everything needed
|
||||
- Bonus: built-in DDoS protection
|
||||
|
||||
**The trade-off:** You depend on Cloudflare. If Cloudflare has an outage, your site goes down even if your hardware is fine. This is acceptable — Cloudflare's uptime is better than most home internet connections.
|
||||
|
||||
---
|
||||
|
||||
## Why Authentik for SSO Instead of Separate Logins Per App?
|
||||
|
||||
**Problem:** 9 services means 9 different usernames and passwords to manage. Adding a user requires going into 9 admin panels. Removing access means 9 places to deactivate.
|
||||
|
||||
**Options:**
|
||||
- Separate logins per service (no SSO)
|
||||
- Authelia (simpler, forward-auth proxy only)
|
||||
- Authentik (full OIDC provider, more complex)
|
||||
- Keycloak (enterprise-grade, very heavy)
|
||||
|
||||
**Decision:** Authentik
|
||||
|
||||
**Why:**
|
||||
- One account controls access to everything
|
||||
- Apps that support native OIDC (Grafana, Kavita, Open WebUI, Karakeep) get real SSO — the user is authenticated inside the app
|
||||
- Can restrict which groups can access which applications (Portainer restricted to homelab-admin group)
|
||||
- Self-hosted — user data stays on your infrastructure
|
||||
- Authentik supports both native OIDC (for apps that support it) and proxy provider (for apps that don't)
|
||||
|
||||
**The trade-off:** Authentik is complex to set up and has a significant memory footprint. Authelia would be simpler. But Authelia only does forward-auth proxy — it can't give an app a real JWT. Authentik does both.
|
||||
|
||||
---
|
||||
|
||||
## Why a Shared Postgres Instead of Separate Authentik Databases?
|
||||
|
||||
**Problem:** After setting up active-active failover, users kept getting `invalid_grant` errors when signing in through SSO.
|
||||
|
||||
**Root cause:** OAuth2 authorization codes are rows in a database. The flow is:
|
||||
1. `/authorize` → code stored in Database A (monk's Authentik)
|
||||
2. `/token` → looks for code in Database B (kscloud1's Authentik)
|
||||
3. Code not found → `invalid_grant`
|
||||
|
||||
Cloudflare Tunnel load-balances between monk and kscloud1 for every HTTP request. Steps 1 and 2 of the OAuth flow can hit different hosts.
|
||||
|
||||
**Options:**
|
||||
- Sync databases continuously (complex, slow, conflict-prone)
|
||||
- Use sticky sessions (Cloudflare paid feature)
|
||||
- Share one database (simple, reliable)
|
||||
|
||||
**Decision:** Shared Postgres on kscloud1, accessible only over Tailscale
|
||||
|
||||
**Why:**
|
||||
- Both monk and kscloud1 Authentik read/write the same database — authorization codes always found
|
||||
- Tailscale binding means the database is never exposed to the public internet (security)
|
||||
- Simple: one line change in each `docker-compose.yml` to point to a different host
|
||||
- Cost: free (already paying for kscloud1)
|
||||
|
||||
**The trade-off:** If kscloud1 goes down and Tailscale connectivity breaks, monk's Authentik can't start. Rollback procedure: restore monk's compose to use a local Postgres.
|
||||
|
||||
---
|
||||
|
||||
## Why Tailscale Instead of WireGuard or OpenVPN?
|
||||
|
||||
**Problem:** Need private networking between monk (home) and kscloud1 (Hetzner cloud) without exposing the Authentik database to the public internet.
|
||||
|
||||
**Options:**
|
||||
- WireGuard: manual key exchange, manual routing, technical to configure
|
||||
- OpenVPN: even more complex, slower
|
||||
- Tailscale: managed WireGuard, automatic key exchange, works behind NAT
|
||||
|
||||
**Decision:** Tailscale
|
||||
|
||||
**Why:**
|
||||
- Works instantly — install, authenticate, done
|
||||
- Handles NAT traversal automatically (monk is behind home router NAT)
|
||||
- Devices get stable 100.x.x.x IPs regardless of actual network location
|
||||
- Free for up to 100 devices
|
||||
- Uses WireGuard under the hood — same encryption, much easier configuration
|
||||
|
||||
**The trade-off:** Tailscale is a managed service — you trust Tailscale's coordination servers. The actual data is encrypted peer-to-peer (Tailscale can't see it), but they control device authentication. Self-hosted alternative: Headscale.
|
||||
|
||||
---
|
||||
|
||||
## Why Active-Active Instead of Active-Passive Failover?
|
||||
|
||||
**The context:** The user travels. When away from home, monk might be inaccessible (home network down, ISP outage, power). kscloud1 should keep the site running.
|
||||
|
||||
**Active-Passive:** kscloud1 only starts serving if monk is detected as down. Cloudflare would need health checks and failover rules.
|
||||
|
||||
**Active-Active:** Both monk and kscloud1 are always in the Cloudflare Tunnel rotation. Every request might hit either host.
|
||||
|
||||
**Decision:** Active-Active
|
||||
|
||||
**Why:**
|
||||
- Simpler: no health checks to configure, no failover logic
|
||||
- Instant: if monk goes down, kscloud1 is already handling 50% of traffic
|
||||
- Free: Cloudflare Tunnel active-active is free; health-check-based failover requires paid plans
|
||||
|
||||
**The trade-off:** Stateful apps (Forgejo, OpenProject, Kavita) have separate databases on each host. A user might see different data depending on which host answers. This was explicitly accepted: the point is uptime, not data consistency across hosts.
|
||||
|
||||
---
|
||||
|
||||
## Why nginx for the Portal Instead of a Pre-Built Dashboard?
|
||||
|
||||
**Options:**
|
||||
- gethomepage (what was used before) — nice but limited customization
|
||||
- Heimdall — similar limitations
|
||||
- Custom static site + nginx — full control
|
||||
|
||||
**Decision:** Custom static HTML/CSS/JS + nginx
|
||||
|
||||
**Why:**
|
||||
- Complete visual control — the cyberpunk theme, the layout, every pixel
|
||||
- Static files served by nginx are extremely fast and reliable
|
||||
- Can proxy the metrics API for real-time stats without CORS issues
|
||||
- No framework dependencies — no Node.js, no build step, just files
|
||||
|
||||
**The trade-off:** More work to build and maintain than a pre-built dashboard. But you now understand every line of it.
|
||||
|
||||
---
|
||||
|
||||
## Why Python + FastAPI for the Metrics API?
|
||||
|
||||
**Problem:** The portal needs real-time system stats (CPU, RAM, network), weather, and Forgejo activity. These can't come from static HTML files.
|
||||
|
||||
**Options:**
|
||||
- Shell scripts + cron → write stats to a JSON file the frontend reads
|
||||
- Node.js + Express
|
||||
- Python + FastAPI
|
||||
|
||||
**Decision:** Python FastAPI
|
||||
|
||||
**Why:**
|
||||
- Python's `psutil` library reads system metrics with one line of code
|
||||
- FastAPI is modern, fast, and automatically documents the API
|
||||
- `async/await` means the API doesn't block while waiting for weather API responses
|
||||
- Python is readable — you can understand and modify the code
|
||||
|
||||
**The special requirement:** The container needs `network_mode: host` and `pid: host`. Without these:
|
||||
- `network_mode: host`: the container can see the host's network interfaces and report real network throughput (not container-level)
|
||||
- `pid: host`: psutil can read the host's `/proc` filesystem, showing actual system stats instead of container stats
|
||||
|
||||
---
|
||||
|
||||
## Why the Forgejo Repo for Documentation?
|
||||
|
||||
You could keep documentation in Notion, Google Docs, or a wiki.
|
||||
|
||||
**Why Forgejo:**
|
||||
- It's self-hosted — you own the data
|
||||
- Git tracks every change with a timestamp and message
|
||||
- The documentation lives alongside the configs it describes
|
||||
- Hiring managers can see the commit history and read your documentation directly
|
||||
|
||||
**What this shows to a hiring manager:** You treat documentation like code — version-controlled, structured, maintained.
|
||||
221
architecture/overview.md
Normal file
221
architecture/overview.md
Normal file
|
|
@ -0,0 +1,221 @@
|
|||
# KiteStacks Architecture — Full System Overview
|
||||
|
||||
## The Big Picture
|
||||
|
||||
```
|
||||
INTERNET
|
||||
│
|
||||
┌──────▼──────┐
|
||||
│ Cloudflare │ DNS + TLS termination
|
||||
│ (edge) │ Zero Trust Tunnel
|
||||
└──────┬──────┘
|
||||
│ HTTPS (443) only
|
||||
┌────────────────┼────────────────┐
|
||||
│ connector 1 │ connector 2 │ connector 3
|
||||
│ │ │
|
||||
┌──────▼──────┐ │ ┌──────▼──────┐
|
||||
│ MONK │ │ │ KSCLOUD1 │
|
||||
│ (home PC) │ │ │ (Hetzner VPS│
|
||||
│ │ Active │ │ 5.78.x.x) │
|
||||
│ All 9 │ Active │ │ │
|
||||
│ services │ │ │ All 9 │
|
||||
│ │ │ │ services │
|
||||
└──────┬──────┘ │ └──────┬──────┘
|
||||
│ │ │
|
||||
└────────────────┼───────────────┘
|
||||
TAILSCALE VPN
|
||||
(100.x.x.x range)
|
||||
│
|
||||
┌────────▼────────┐
|
||||
│ SHARED DB LAYER │
|
||||
│ on kscloud1 │
|
||||
│ Postgres :5432 │
|
||||
│ Redis :6379 │
|
||||
│ (Tailscale │
|
||||
│ only, private)│
|
||||
└─────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Every Service and What It Does
|
||||
|
||||
### The Nine Public Services
|
||||
|
||||
| Service | Container Name | What It Does | Why It's Here |
|
||||
|---------|---------------|--------------|---------------|
|
||||
| **Portal** | `homepage` | The public website (kitestacks.com) — custom nginx serving static HTML/CSS/JS with a cyberpunk theme | Front door to everything. Shows system stats, recent activity, links to all services |
|
||||
| **Authentik** | `authentik` | Identity provider — handles all logins via OIDC/OAuth2 SSO | Single place to manage all user accounts and access control |
|
||||
| **Forgejo** | `forgejo` | Self-hosted Git platform (like GitHub but yours) | Store all homelab code, config, and documentation |
|
||||
| **OpenProject** | `openproject` | Project management (like Jira) | Task tracking, project planning |
|
||||
| **Open WebUI** | `kite-openwebui` | ChatGPT-like AI chat interface | Access multiple AI models through one interface |
|
||||
| **Karakeep** | `karakeep` | Bookmark and read-it-later manager | Save links, articles, and content |
|
||||
| **Kavita** | `kavita` | eBook and manga reader | Personal digital library |
|
||||
| **Grafana** | `grafana` | Monitoring dashboards | Visualize CPU, RAM, network, uptime across both hosts |
|
||||
| **Uptime Kuma** | `uptime-kuma` | Status page and uptime monitoring | Monitor that all 9 services are up and alert if they go down |
|
||||
|
||||
### The Infrastructure Services (Not Public-Facing)
|
||||
|
||||
| Service | What It Does |
|
||||
|---------|-------------|
|
||||
| `cloudflared` | Cloudflare Tunnel connector — creates encrypted outbound tunnel to Cloudflare edge |
|
||||
| `prometheus` | Metrics collection — scrapes system stats from both monk and kscloud1 every 15 seconds |
|
||||
| `node-exporter` | Exposes host system metrics (CPU, RAM, disk, network) for Prometheus to scrape |
|
||||
| `kite-litellm` | LLM proxy gateway — routes AI requests to OpenRouter (multiple free models) |
|
||||
| `portainer` | Docker management UI — visual interface to manage all containers |
|
||||
| `kitestacks-metrics-api` | Python FastAPI service — serves real-time system stats, weather, and Forgejo activity to the portal |
|
||||
|
||||
---
|
||||
|
||||
## How Traffic Flows
|
||||
|
||||
### When Someone Visits www.kitestacks.com
|
||||
|
||||
```
|
||||
1. Browser sends HTTPS request to www.kitestacks.com
|
||||
2. DNS resolves to Cloudflare's anycast IP (not your home IP)
|
||||
3. Cloudflare terminates TLS — your home router never sees HTTPS
|
||||
4. Cloudflare routes the request through the tunnel to whichever
|
||||
cloudflared connector responds first (monk or kscloud1)
|
||||
5. cloudflared resolves "homepage" via Docker DNS
|
||||
6. Request hits the nginx container serving the static portal
|
||||
7. Portal's JavaScript fetches /api/metrics and /api/activity
|
||||
from the kitestacks-metrics-api container via nginx proxy
|
||||
8. Page renders with live system stats and recent git activity
|
||||
```
|
||||
|
||||
### When Someone Clicks "Sign In with Authentik"
|
||||
|
||||
```
|
||||
1. App (e.g., Grafana) redirects browser to auth.kitestacks.com/application/o/authorize/
|
||||
2. Authentik presents login page
|
||||
3. User enters credentials — Authentik validates against its database
|
||||
(stored on kscloud1's Postgres, shared over Tailscale)
|
||||
4. Authentik generates an authorization code and redirects back to Grafana
|
||||
5. Grafana's backend calls auth.kitestacks.com/application/o/token/
|
||||
to exchange the code for an access token
|
||||
6. Authentik validates the code (found in shared DB) and returns a JWT
|
||||
7. Grafana reads the user's email/name from the JWT and logs them in
|
||||
```
|
||||
|
||||
**The critical detail:** Steps 1 and 5 can hit different tunnel connectors (monk vs kscloud1). The authorization code from step 4 must exist in whichever database step 5 hits. That's why both connectors point to the SAME Postgres on kscloud1 — otherwise step 5 returns `invalid_grant` because the code isn't found.
|
||||
|
||||
---
|
||||
|
||||
## The Two Hosts in Detail
|
||||
|
||||
### Monk (Primary Home Machine)
|
||||
|
||||
- **Role:** Primary production host
|
||||
- **Network:** Home LAN, no open ports on router (Cloudflare Tunnel handles all inbound)
|
||||
- **Services:** All 9 public services + all infrastructure services
|
||||
- **Data:** Each service has its own database/storage
|
||||
- **Authentik DB:** Points to kscloud1's Postgres over Tailscale (100.x.x.x)
|
||||
|
||||
### kscloud1 (Hetzner VPS)
|
||||
|
||||
- **Role:** Permanent cloud replica — always on, even when monk is off (travel, power outage, etc.)
|
||||
- **Network:** Public IP, Cloudflare Tunnel connector 3
|
||||
- **Services:** Full replica of all 9 public services (separate databases except Authentik)
|
||||
- **Hosts:** The shared Authentik Postgres + Redis (bound to Tailscale interface only)
|
||||
- **Resources:** 3 vCPU, 3.7 GB RAM — tight but functional
|
||||
|
||||
### What's the Same Across Both
|
||||
|
||||
- Same Cloudflare Tunnel token (different connector IDs assigned automatically)
|
||||
- Same Authentik database (shared via Tailscale)
|
||||
- Same Authentik secret key (required for JWT validation)
|
||||
- Same kavita.db (one-time sync — users and OIDC config)
|
||||
|
||||
### What's Different Across Both
|
||||
|
||||
- Forgejo data (separate repos — accepted inconsistency)
|
||||
- OpenProject data (separate projects)
|
||||
- Karakeep bookmarks (separate)
|
||||
- Kavita book files (monk has them, kscloud1 doesn't — covers synced, books not)
|
||||
|
||||
---
|
||||
|
||||
## The Docker Network
|
||||
|
||||
Every container joins the `kitestacks` external Docker bridge network:
|
||||
|
||||
```bash
|
||||
docker network create kitestacks
|
||||
```
|
||||
|
||||
This is what makes Cloudflare Tunnel work. The cloudflared container is also on this network, so when Cloudflare tells cloudflared to route `http://grafana:3000`, Docker's internal DNS resolves `grafana` to the grafana container's IP on that network.
|
||||
|
||||
Without this shared network, cloudflared can't reach the service containers by name.
|
||||
|
||||
---
|
||||
|
||||
## Why No Open Ports on the Router
|
||||
|
||||
Traditional homelab: open port 80/443 on home router → NAT to home server → expose home IP.
|
||||
|
||||
Problems with that:
|
||||
- Your home IP is public (DDoS risk, targeted attacks)
|
||||
- Router configuration is fragile
|
||||
- ISP can change your IP (dynamic IP)
|
||||
- Some ISPs block port 80/443
|
||||
|
||||
Cloudflare Tunnel approach:
|
||||
- cloudflared container makes an OUTBOUND connection to Cloudflare
|
||||
- Cloudflare holds that connection open
|
||||
- Inbound requests come through Cloudflare, over that existing outbound tunnel
|
||||
- Your home IP is never exposed
|
||||
- Works on any network, any ISP, any firewall
|
||||
|
||||
This is why you can run a public website from a home PC with zero router configuration.
|
||||
|
||||
---
|
||||
|
||||
## Tailscale — The Private Backbone
|
||||
|
||||
Tailscale creates a private overlay network (VPN mesh) across all your devices:
|
||||
|
||||
```
|
||||
monk (100.x.x.x) ←—— encrypted ——→ kscloud1 (100.x.x.x)
|
||||
monk (100.x.x.x) ←—— encrypted ——→ pixel-6 (100.x.x.x)
|
||||
```
|
||||
|
||||
Used in this project for:
|
||||
1. **Shared Authentik DB:** kscloud1's Postgres binds to its Tailscale IP, not its public IP. Only devices on the tailnet can connect. Monk points to that address.
|
||||
2. **Forgejo activity feed:** On kscloud1, the metrics API fetches recent commits from monk's Forgejo via monk's Tailscale IP — so both portal instances show the same activity feed.
|
||||
3. **SSH/Admin access:** You can SSH into any device on the tailnet from anywhere.
|
||||
|
||||
---
|
||||
|
||||
## The Monitoring Stack
|
||||
|
||||
```
|
||||
node-exporter (monk) → prometheus (monk) → grafana (monk)
|
||||
node-exporter (kscloud1) ↗ (scrapes 5.78.x.x:9100)
|
||||
```
|
||||
|
||||
Prometheus scrapes metrics every 15 seconds from:
|
||||
- `node-exporter:9100` — monk's own node-exporter (via Docker DNS)
|
||||
- `5.78.x.x:9100` — kscloud1's node-exporter (via public IP, port exposed 0.0.0.0)
|
||||
|
||||
Grafana visualizes both, letting you switch between hosts in the instance picker.
|
||||
|
||||
---
|
||||
|
||||
## The Portal Architecture
|
||||
|
||||
The portal is NOT gethomepage or any pre-built dashboard. It's a custom-built static site:
|
||||
|
||||
```
|
||||
nginx (container: "homepage")
|
||||
├── / → serves static HTML/CSS/JS from ./public/
|
||||
└── /api/* → proxy_pass to kitestacks-metrics-api:8000 (host)
|
||||
|
||||
kitestacks-metrics-api (network_mode: host, pid: host)
|
||||
├── GET /api/metrics → psutil reads HOST's CPU/RAM/disk/network
|
||||
├── GET /api/weather → wttr.in API → current weather by IP geolocation
|
||||
├── GET /api/activity → Forgejo API → recent commits
|
||||
└── GET /api/health → {"ok": true}
|
||||
```
|
||||
|
||||
The metrics API runs with `network_mode: host` and `pid: host` so it reads the HOST machine's process table and `/proc` filesystem — not the container's. Without this, it would report container stats, not laptop stats.
|
||||
Reference in a new issue