Complete documentation suite for KiteStacks covering all 11 services across 2-host active-active architecture. Includes beginner track (with AI, 8 files) and advanced track (without AI, 7 files) with time estimates, real troubleshooting cases, and command-by-command explanations. Updates certifications roadmap to reflect July 7 2026 A+ Core 2 exam goal. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
11 KiB
Architecture Decisions — The Why Behind Every Choice
For every technology choice, there was a reason. Understanding the "why" is what separates someone who copied commands from someone who designed a system.
Last Updated: 2026-06-19
Why Docker Instead of Running Services Directly?
Problem: Running 15+ services directly on a Linux host creates dependency conflicts — different Python versions, conflicting library versions, services that break each other on updates.
Options considered:
- Bare metal: install each app directly on the OS
- Virtual machines: one VM per service
- Docker containers: isolated processes with their own dependencies
Decision: Docker
Why:
- Each container has its own filesystem and runtime — they can't conflict
- Starting, stopping, or updating one service doesn't affect others
- The
docker-compose.ymlfile IS the documentation — it shows exactly what the service needs - Portability: move the same compose file to a new machine and it works identically
restart: unless-stoppedmeans containers self-heal after a crash or host reboot
What to say in an interview:
"I containerized every service using Docker Compose so each has isolated dependencies and the entire deployment is reproducible from a single YAML file."
Why Cloudflare Tunnel Instead of Port Forwarding?
Problem: How do you make home services accessible from the internet?
Traditional approach: Open ports 80 and 443 on the home router, configure NAT, point DNS to your home IP address.
Problems with that:
- Your home IP is public (DDoS risk, can be scanned and targeted)
- Dynamic home IP means DNS breaks every time the ISP changes it
- Some ISPs block residential ports 80 and 443
- Router configuration is fragile and varies by hardware
Decision: Cloudflare Tunnel (cloudflared)
Why:
- cloudflared makes an outbound connection to Cloudflare — no inbound ports needed at all
- Home IP is never exposed to the public internet
- Works on any ISP, any network, any firewall
- Cloudflare handles TLS certificates automatically (no Let's Encrypt setup)
- Free tier covers everything needed
- Built-in DDoS protection at Cloudflare's edge
The tradeoff: You depend on Cloudflare. If Cloudflare has an outage, your site goes down even if your hardware is fine. Acceptable — Cloudflare's uptime exceeds most home ISPs.
Why Authentik for SSO?
Problem: Eleven services means eleven separate usernames and passwords. Adding a user means eleven admin panels. Removing access means eleven places to deactivate.
Options:
- No SSO — separate logins per service
- Authelia — simpler, forward-auth proxy only
- Authentik — full OIDC provider, more complex to set up
- Keycloak — enterprise-grade, very heavy on RAM
Decision: Authentik
Why:
- One account controls access to everything
- Apps that support native OIDC (Grafana, Kavita, Karakeep, Open WebUI, Portainer, BookStack, Forgejo) get real SSO — user is authenticated inside the app with a JWT, not just at a proxy
- Access policies per application (Portainer restricted to
homelab-admingroup only) - Self-hosted — user data never leaves your infrastructure
Why not Authelia: Authelia only does forward-auth proxy. It blocks the login page until authenticated, but the app itself never receives user identity. Authentik sends a real JWT with user email and name — apps can create user accounts automatically on first login.
Why a Shared Postgres Instead of Separate Authentik Databases?
Problem: After deploying two Cloudflare Tunnel connectors, users got invalid_grant
errors when signing in through SSO — roughly 50% of the time.
Root cause: OAuth2 authorization codes are short-lived rows in a database.
Step 1: /authorize → creates code → stored in monk's Authentik DB
Step 2: /token → looks for code → hits kscloud1's Authentik DB → NOT FOUND
Cloudflare load-balances every HTTP request independently. Steps 1 and 2 of the OAuth2 flow can hit completely different hosts. The code exists in one database but not the other.
Options:
- Sync both databases continuously (complex, slow, conflict-prone)
- Use sticky sessions (Cloudflare paid feature)
- Share one database between both Authentik instances
Decision: Single shared Postgres + Redis hosted on kscloud1, accessible only over Tailscale
Why:
- Both connectors' Authentik instances read and write the same database
- Authorization codes are always found regardless of which host handles which request
- Database is bound to kscloud1's Tailscale IP — never reachable from the public internet
- Simple configuration change: one environment variable pointing to the shared host
The tradeoff: If kscloud1 and Tailscale both go down, monk's Authentik can't connect to the database and fails to start. Rollback: restore local Postgres in monk's compose file.
Why Tailscale Instead of WireGuard or OpenVPN?
Problem: Need private networking between monk (home) and kscloud1 (Hetzner cloud). The shared Authentik database must not be exposed to the public internet.
Options:
- WireGuard: manual key exchange, manual routing, hard to configure through NAT
- OpenVPN: complex, slower, more overhead
- Tailscale: managed WireGuard, automatic key exchange, works behind NAT
Decision: Tailscale
Why:
- Works in minutes: install, authenticate, done
- Handles NAT traversal automatically — monk is behind home router NAT
- Every device gets a stable
100.x.x.xIP regardless of location - Free for up to 100 devices
- WireGuard underneath — same encryption, much easier operation
The tradeoff: You trust Tailscale's coordination servers to manage device authentication. Actual data is encrypted peer-to-peer (Tailscale never sees it), but they control who can join your network. Self-hosted alternative if needed: Headscale.
Why Active-Active Failover Instead of Active-Passive?
The situation: The user travels. When away from home, monk may be unreachable. kscloud1 must keep the site running.
Active-Passive: kscloud1 only starts serving if Cloudflare detects monk as down. Requires health checks, failover rules, and a delay before traffic switches.
Active-Active: Both monk and kscloud1 are always in the Cloudflare Tunnel rotation. Every request may hit either host at any time.
Decision: Active-Active
Why:
- No failover logic needed — both are always live
- Instant: if monk goes down, kscloud1 is already handling traffic
- Free: Cloudflare Tunnel active-active is included; health-check-based failover is paid
The tradeoff: Stateful apps with separate databases (Kavita, Karakeep) may show different data depending on which host answers. Explicitly accepted — the priority is uptime, not data consistency across hosts. Forgejo and Authentik share databases so they are consistent.
Why a Custom Portal Instead of a Pre-Built Dashboard?
Options:
- Homepage (gethomepage) — nice but limited customization
- Heimdall — similar limitations
- Custom static HTML/CSS/JS + nginx — full control, full ownership
Decision: Custom static site
Why:
- Complete visual control — the cyberpunk theme, layout, every card, every color
- Static files + nginx are extremely fast and reliable (no Node.js, no build step)
- nginx proxies the
/api/*endpoints to the metrics API without CORS issues - No dependency on external frameworks that can change or break
The tradeoff: More work to build and maintain. But you understand every line of it, and you can explain exactly why every piece is there.
Why Python + FastAPI for the Metrics API?
Problem: The portal needs live system stats (CPU, RAM, network), weather, and Forgejo git activity. Static HTML can't provide these.
Decision: Python FastAPI with psutil
Why:
psutilreads host system metrics in one line of Python- FastAPI auto-generates API documentation and handles async requests well
- Python is readable — easy to understand and modify
async/awaitmeans the API doesn't block while waiting for weather API responses
Special requirements:
network_mode: host— container shares host network namespace so psutil sees real network interfaces, not the container's virtual interfacepid: host— container can read the host's/procfilesystem for accurate process stats
Without these flags, the API would report container-level stats instead of actual laptop stats.
Why Forgejo Instead of GitHub or GitLab?
Problem: Need to store all homelab code, configs, and documentation in version control.
Options:
- GitHub: free, reliable, but your configs and docs are on someone else's server
- GitLab: self-hostable but heavy (4GB+ RAM for full install)
- Forgejo: lightweight GitHub-like self-hosted Git, fork of Gitea
Decision: Forgejo
Why:
- Self-hosted — configs and documentation stay on your infrastructure
- Very lightweight — uses less than 100MB RAM
- GitHub-compatible API — tools that work with GitHub also work with Forgejo
- Full UI with code review, issues, CI/CD (Forgejo Actions)
- Shows commit history and documentation to anyone you give access to
The tradeoff: You maintain it yourself. If Forgejo goes down, git operations fail. Mitigated by kscloud1 running a replica and the shared Postgres.
Why OSTicket for the Help Desk?
What it replaced: OpenProject (project management tool on tasks.kitestacks.com)
Why OpenProject was removed:
- OpenProject CE (Community Edition) requires an Enterprise Edition license for SSO
- The SSO button simply does not appear in CE — it is a hard paywall with no workaround
- OpenProject is also resource-heavy for what it provides
Why OSTicket:
- Lightweight and runs well on the existing stack
- Email integration works (SMTP via Gmail app password — confirmed working)
- Handles the ticket/task tracking use case without the licensing barrier
Why BookStack for the Wiki?
Problem: Need a place for long-form documentation that's more structured than markdown files.
Decision: BookStack
Why:
- Clean, organized UI: Shelves → Books → Chapters → Pages hierarchy
- WYSIWYG editor — easy to write docs without markdown syntax
- Authentik OIDC SSO works natively
- API available — docs can be pushed programmatically from scripts or CI
Key gotcha: Cache directory must be writable by the container user.
chown -R abc:users /config/www/framework/cache/ is required after first install.
Why the Forgejo Shared Postgres?
Problem: With two connectors in active-active, Forgejo on monk and kscloud1 had separate SQLite databases. Repos created on one weren't visible on the other.
Fix: Migrated both Forgejo instances to a single shared PostgreSQL database on kscloud1 (same shared server as Authentik's Postgres). Both connectors now serve identical Forgejo data.
How it was done:
forgejo dump --database postgres— exported clean SQL from monk's Forgejo- Dropped the pgloader schema (had wrong structure), reloaded the clean SQL
- Both compose files point to
authentik-postgres:5432databaseforgejo, userforgejo - kscloud1's Forgejo joined the
authentik_defaultDocker network to reach authentik-postgres