kenpat 1e8319ee75 docs: comprehensive homelab-mastery rewrite with full build guides

Complete documentation suite for KiteStacks covering all 11 services across
2-host active-active architecture. Includes beginner track (with AI, 8 files)
and advanced track (without AI, 7 files) with time estimates, real troubleshooting
cases, and command-by-command explanations. Updates certifications roadmap to
reflect July 7 2026 A+ Core 2 exam goal.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-06-19 01:08:43 -05:00

11 KiB

Raw Blame History

Architecture Decisions — The Why Behind Every Choice

For every technology choice, there was a reason. Understanding the "why" is what separates someone who copied commands from someone who designed a system.

Last Updated: 2026-06-19

Why Docker Instead of Running Services Directly?

Problem: Running 15+ services directly on a Linux host creates dependency conflicts — different Python versions, conflicting library versions, services that break each other on updates.

Options considered:

Bare metal: install each app directly on the OS
Virtual machines: one VM per service
Docker containers: isolated processes with their own dependencies

Decision: Docker

Why:

Each container has its own filesystem and runtime — they can't conflict
Starting, stopping, or updating one service doesn't affect others
The docker-compose.yml file IS the documentation — it shows exactly what the service needs
Portability: move the same compose file to a new machine and it works identically
restart: unless-stopped means containers self-heal after a crash or host reboot

What to say in an interview:

"I containerized every service using Docker Compose so each has isolated dependencies and the entire deployment is reproducible from a single YAML file."

Why Cloudflare Tunnel Instead of Port Forwarding?

Problem: How do you make home services accessible from the internet?

Traditional approach: Open ports 80 and 443 on the home router, configure NAT, point DNS to your home IP address.

Problems with that:

Your home IP is public (DDoS risk, can be scanned and targeted)
Dynamic home IP means DNS breaks every time the ISP changes it
Some ISPs block residential ports 80 and 443
Router configuration is fragile and varies by hardware

Decision: Cloudflare Tunnel (cloudflared)

Why:

cloudflared makes an outbound connection to Cloudflare — no inbound ports needed at all
Home IP is never exposed to the public internet
Works on any ISP, any network, any firewall
Cloudflare handles TLS certificates automatically (no Let's Encrypt setup)
Free tier covers everything needed
Built-in DDoS protection at Cloudflare's edge

The tradeoff: You depend on Cloudflare. If Cloudflare has an outage, your site goes down even if your hardware is fine. Acceptable — Cloudflare's uptime exceeds most home ISPs.

Why Authentik for SSO?

Problem: Eleven services means eleven separate usernames and passwords. Adding a user means eleven admin panels. Removing access means eleven places to deactivate.

Options:

No SSO — separate logins per service
Authelia — simpler, forward-auth proxy only
Authentik — full OIDC provider, more complex to set up
Keycloak — enterprise-grade, very heavy on RAM

Decision: Authentik

Why:

One account controls access to everything
Apps that support native OIDC (Grafana, Kavita, Karakeep, Open WebUI, Portainer, BookStack, Forgejo) get real SSO — user is authenticated inside the app with a JWT, not just at a proxy
Access policies per application (Portainer restricted to homelab-admin group only)
Self-hosted — user data never leaves your infrastructure

Why not Authelia: Authelia only does forward-auth proxy. It blocks the login page until authenticated, but the app itself never receives user identity. Authentik sends a real JWT with user email and name — apps can create user accounts automatically on first login.

Why a Shared Postgres Instead of Separate Authentik Databases?

Problem: After deploying two Cloudflare Tunnel connectors, users got invalid_grant errors when signing in through SSO — roughly 50% of the time.

Root cause: OAuth2 authorization codes are short-lived rows in a database.

Step 1: /authorize → creates code → stored in monk's Authentik DB
Step 2: /token     → looks for code → hits kscloud1's Authentik DB → NOT FOUND

Cloudflare load-balances every HTTP request independently. Steps 1 and 2 of the OAuth2 flow can hit completely different hosts. The code exists in one database but not the other.

Options:

Sync both databases continuously (complex, slow, conflict-prone)
Use sticky sessions (Cloudflare paid feature)
Share one database between both Authentik instances

Decision: Single shared Postgres + Redis hosted on kscloud1, accessible only over Tailscale

Why:

Both connectors' Authentik instances read and write the same database
Authorization codes are always found regardless of which host handles which request
Database is bound to kscloud1's Tailscale IP — never reachable from the public internet
Simple configuration change: one environment variable pointing to the shared host

The tradeoff: If kscloud1 and Tailscale both go down, monk's Authentik can't connect to the database and fails to start. Rollback: restore local Postgres in monk's compose file.

Why Tailscale Instead of WireGuard or OpenVPN?

Problem: Need private networking between monk (home) and kscloud1 (Hetzner cloud). The shared Authentik database must not be exposed to the public internet.

Options:

WireGuard: manual key exchange, manual routing, hard to configure through NAT
OpenVPN: complex, slower, more overhead
Tailscale: managed WireGuard, automatic key exchange, works behind NAT

Decision: Tailscale

Why:

Works in minutes: install, authenticate, done
Handles NAT traversal automatically — monk is behind home router NAT
Every device gets a stable 100.x.x.x IP regardless of location
Free for up to 100 devices
WireGuard underneath — same encryption, much easier operation

The tradeoff: You trust Tailscale's coordination servers to manage device authentication. Actual data is encrypted peer-to-peer (Tailscale never sees it), but they control who can join your network. Self-hosted alternative if needed: Headscale.

Why Active-Active Failover Instead of Active-Passive?

The situation: The user travels. When away from home, monk may be unreachable. kscloud1 must keep the site running.

Active-Passive: kscloud1 only starts serving if Cloudflare detects monk as down. Requires health checks, failover rules, and a delay before traffic switches.

Active-Active: Both monk and kscloud1 are always in the Cloudflare Tunnel rotation. Every request may hit either host at any time.

Decision: Active-Active

Why:

No failover logic needed — both are always live
Instant: if monk goes down, kscloud1 is already handling traffic
Free: Cloudflare Tunnel active-active is included; health-check-based failover is paid

The tradeoff: Stateful apps with separate databases (Kavita, Karakeep) may show different data depending on which host answers. Explicitly accepted — the priority is uptime, not data consistency across hosts. Forgejo and Authentik share databases so they are consistent.

Why a Custom Portal Instead of a Pre-Built Dashboard?

Options:

Homepage (gethomepage) — nice but limited customization
Heimdall — similar limitations
Custom static HTML/CSS/JS + nginx — full control, full ownership

Decision: Custom static site

Why:

Complete visual control — the cyberpunk theme, layout, every card, every color
Static files + nginx are extremely fast and reliable (no Node.js, no build step)
nginx proxies the /api/* endpoints to the metrics API without CORS issues
No dependency on external frameworks that can change or break

The tradeoff: More work to build and maintain. But you understand every line of it, and you can explain exactly why every piece is there.

Why Python + FastAPI for the Metrics API?

Problem: The portal needs live system stats (CPU, RAM, network), weather, and Forgejo git activity. Static HTML can't provide these.

Decision: Python FastAPI with psutil

Why:

psutil reads host system metrics in one line of Python
FastAPI auto-generates API documentation and handles async requests well
Python is readable — easy to understand and modify
async/await means the API doesn't block while waiting for weather API responses

Special requirements:

network_mode: host — container shares host network namespace so psutil sees real network interfaces, not the container's virtual interface
pid: host — container can read the host's /proc filesystem for accurate process stats

Without these flags, the API would report container-level stats instead of actual laptop stats.

Why Forgejo Instead of GitHub or GitLab?

Problem: Need to store all homelab code, configs, and documentation in version control.

Options:

GitHub: free, reliable, but your configs and docs are on someone else's server
GitLab: self-hostable but heavy (4GB+ RAM for full install)
Forgejo: lightweight GitHub-like self-hosted Git, fork of Gitea

Decision: Forgejo

Why:

Self-hosted — configs and documentation stay on your infrastructure
Very lightweight — uses less than 100MB RAM
GitHub-compatible API — tools that work with GitHub also work with Forgejo
Full UI with code review, issues, CI/CD (Forgejo Actions)
Shows commit history and documentation to anyone you give access to

The tradeoff: You maintain it yourself. If Forgejo goes down, git operations fail. Mitigated by kscloud1 running a replica and the shared Postgres.

Why OSTicket for the Help Desk?

What it replaced: OpenProject (project management tool on tasks.kitestacks.com)

Why OpenProject was removed:

OpenProject CE (Community Edition) requires an Enterprise Edition license for SSO
The SSO button simply does not appear in CE — it is a hard paywall with no workaround
OpenProject is also resource-heavy for what it provides

Why OSTicket:

Lightweight and runs well on the existing stack
Email integration works (SMTP via Gmail app password — confirmed working)
Handles the ticket/task tracking use case without the licensing barrier

Why BookStack for the Wiki?

Problem: Need a place for long-form documentation that's more structured than markdown files.

Decision: BookStack

Why:

Clean, organized UI: Shelves → Books → Chapters → Pages hierarchy
WYSIWYG editor — easy to write docs without markdown syntax
Authentik OIDC SSO works natively
API available — docs can be pushed programmatically from scripts or CI

Key gotcha: Cache directory must be writable by the container user. chown -R abc:users /config/www/framework/cache/ is required after first install.

Why the Forgejo Shared Postgres?

Problem: With two connectors in active-active, Forgejo on monk and kscloud1 had separate SQLite databases. Repos created on one weren't visible on the other.

Fix: Migrated both Forgejo instances to a single shared PostgreSQL database on kscloud1 (same shared server as Authentik's Postgres). Both connectors now serve identical Forgejo data.

How it was done:

forgejo dump --database postgres — exported clean SQL from monk's Forgejo
Dropped the pgloader schema (had wrong structure), reloaded the clean SQL
Both compose files point to authentik-postgres:5432 database forgejo, user forgejo
kscloud1's Forgejo joined the authentik_default Docker network to reach authentik-postgres

11 KiB Raw Blame History

Architecture Decisions — The Why Behind Every Choice

Why Docker Instead of Running Services Directly?

Why Cloudflare Tunnel Instead of Port Forwarding?

Why Authentik for SSO?

Why a Shared Postgres Instead of Separate Authentik Databases?

Why Tailscale Instead of WireGuard or OpenVPN?

Why Active-Active Failover Instead of Active-Passive?

Why a Custom Portal Instead of a Pre-Built Dashboard?

Why Python + FastAPI for the Metrics API?

Why Forgejo Instead of GitHub or GitLab?

Why OSTicket for the Help Desk?

Why BookStack for the Wiki?

Why the Forgejo Shared Postgres?

11 KiB

Raw Blame History