Complete documentation suite for KiteStacks covering all 11 services across 2-host active-active architecture. Includes beginner track (with AI, 8 files) and advanced track (without AI, 7 files) with time estimates, real troubleshooting cases, and command-by-command explanations. Updates certifications roadmap to reflect July 7 2026 A+ Core 2 exam goal. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
282 lines
11 KiB
Markdown
282 lines
11 KiB
Markdown
# Architecture Decisions — The Why Behind Every Choice
|
|
|
|
For every technology choice, there was a reason. Understanding the "why" is what separates
|
|
someone who copied commands from someone who designed a system.
|
|
|
|
**Last Updated:** 2026-06-19
|
|
|
|
---
|
|
|
|
## Why Docker Instead of Running Services Directly?
|
|
|
|
**Problem:** Running 15+ services directly on a Linux host creates dependency conflicts —
|
|
different Python versions, conflicting library versions, services that break each other on updates.
|
|
|
|
**Options considered:**
|
|
- Bare metal: install each app directly on the OS
|
|
- Virtual machines: one VM per service
|
|
- Docker containers: isolated processes with their own dependencies
|
|
|
|
**Decision:** Docker
|
|
|
|
**Why:**
|
|
- Each container has its own filesystem and runtime — they can't conflict
|
|
- Starting, stopping, or updating one service doesn't affect others
|
|
- The `docker-compose.yml` file IS the documentation — it shows exactly what the service needs
|
|
- Portability: move the same compose file to a new machine and it works identically
|
|
- `restart: unless-stopped` means containers self-heal after a crash or host reboot
|
|
|
|
**What to say in an interview:**
|
|
> *"I containerized every service using Docker Compose so each has isolated dependencies
|
|
> and the entire deployment is reproducible from a single YAML file."*
|
|
|
|
---
|
|
|
|
## Why Cloudflare Tunnel Instead of Port Forwarding?
|
|
|
|
**Problem:** How do you make home services accessible from the internet?
|
|
|
|
**Traditional approach:** Open ports 80 and 443 on the home router, configure NAT,
|
|
point DNS to your home IP address.
|
|
|
|
**Problems with that:**
|
|
- Your home IP is public (DDoS risk, can be scanned and targeted)
|
|
- Dynamic home IP means DNS breaks every time the ISP changes it
|
|
- Some ISPs block residential ports 80 and 443
|
|
- Router configuration is fragile and varies by hardware
|
|
|
|
**Decision:** Cloudflare Tunnel (cloudflared)
|
|
|
|
**Why:**
|
|
- cloudflared makes an outbound connection to Cloudflare — no inbound ports needed at all
|
|
- Home IP is never exposed to the public internet
|
|
- Works on any ISP, any network, any firewall
|
|
- Cloudflare handles TLS certificates automatically (no Let's Encrypt setup)
|
|
- Free tier covers everything needed
|
|
- Built-in DDoS protection at Cloudflare's edge
|
|
|
|
**The tradeoff:** You depend on Cloudflare. If Cloudflare has an outage, your site goes down
|
|
even if your hardware is fine. Acceptable — Cloudflare's uptime exceeds most home ISPs.
|
|
|
|
---
|
|
|
|
## Why Authentik for SSO?
|
|
|
|
**Problem:** Eleven services means eleven separate usernames and passwords. Adding a user
|
|
means eleven admin panels. Removing access means eleven places to deactivate.
|
|
|
|
**Options:**
|
|
- No SSO — separate logins per service
|
|
- Authelia — simpler, forward-auth proxy only
|
|
- Authentik — full OIDC provider, more complex to set up
|
|
- Keycloak — enterprise-grade, very heavy on RAM
|
|
|
|
**Decision:** Authentik
|
|
|
|
**Why:**
|
|
- One account controls access to everything
|
|
- Apps that support native OIDC (Grafana, Kavita, Karakeep, Open WebUI, Portainer, BookStack,
|
|
Forgejo) get real SSO — user is authenticated inside the app with a JWT, not just at a proxy
|
|
- Access policies per application (Portainer restricted to `homelab-admin` group only)
|
|
- Self-hosted — user data never leaves your infrastructure
|
|
|
|
**Why not Authelia:** Authelia only does forward-auth proxy. It blocks the login page until
|
|
authenticated, but the app itself never receives user identity. Authentik sends a real JWT
|
|
with user email and name — apps can create user accounts automatically on first login.
|
|
|
|
---
|
|
|
|
## Why a Shared Postgres Instead of Separate Authentik Databases?
|
|
|
|
**Problem:** After deploying two Cloudflare Tunnel connectors, users got `invalid_grant`
|
|
errors when signing in through SSO — roughly 50% of the time.
|
|
|
|
**Root cause:** OAuth2 authorization codes are short-lived rows in a database.
|
|
|
|
```
|
|
Step 1: /authorize → creates code → stored in monk's Authentik DB
|
|
Step 2: /token → looks for code → hits kscloud1's Authentik DB → NOT FOUND
|
|
```
|
|
|
|
Cloudflare load-balances every HTTP request independently. Steps 1 and 2 of the OAuth2
|
|
flow can hit completely different hosts. The code exists in one database but not the other.
|
|
|
|
**Options:**
|
|
- Sync both databases continuously (complex, slow, conflict-prone)
|
|
- Use sticky sessions (Cloudflare paid feature)
|
|
- Share one database between both Authentik instances
|
|
|
|
**Decision:** Single shared Postgres + Redis hosted on kscloud1, accessible only over Tailscale
|
|
|
|
**Why:**
|
|
- Both connectors' Authentik instances read and write the same database
|
|
- Authorization codes are always found regardless of which host handles which request
|
|
- Database is bound to kscloud1's Tailscale IP — never reachable from the public internet
|
|
- Simple configuration change: one environment variable pointing to the shared host
|
|
|
|
**The tradeoff:** If kscloud1 and Tailscale both go down, monk's Authentik can't connect
|
|
to the database and fails to start. Rollback: restore local Postgres in monk's compose file.
|
|
|
|
---
|
|
|
|
## Why Tailscale Instead of WireGuard or OpenVPN?
|
|
|
|
**Problem:** Need private networking between monk (home) and kscloud1 (Hetzner cloud).
|
|
The shared Authentik database must not be exposed to the public internet.
|
|
|
|
**Options:**
|
|
- WireGuard: manual key exchange, manual routing, hard to configure through NAT
|
|
- OpenVPN: complex, slower, more overhead
|
|
- Tailscale: managed WireGuard, automatic key exchange, works behind NAT
|
|
|
|
**Decision:** Tailscale
|
|
|
|
**Why:**
|
|
- Works in minutes: install, authenticate, done
|
|
- Handles NAT traversal automatically — monk is behind home router NAT
|
|
- Every device gets a stable `100.x.x.x` IP regardless of location
|
|
- Free for up to 100 devices
|
|
- WireGuard underneath — same encryption, much easier operation
|
|
|
|
**The tradeoff:** You trust Tailscale's coordination servers to manage device authentication.
|
|
Actual data is encrypted peer-to-peer (Tailscale never sees it), but they control who can
|
|
join your network. Self-hosted alternative if needed: Headscale.
|
|
|
|
---
|
|
|
|
## Why Active-Active Failover Instead of Active-Passive?
|
|
|
|
**The situation:** The user travels. When away from home, monk may be unreachable.
|
|
kscloud1 must keep the site running.
|
|
|
|
**Active-Passive:** kscloud1 only starts serving if Cloudflare detects monk as down.
|
|
Requires health checks, failover rules, and a delay before traffic switches.
|
|
|
|
**Active-Active:** Both monk and kscloud1 are always in the Cloudflare Tunnel rotation.
|
|
Every request may hit either host at any time.
|
|
|
|
**Decision:** Active-Active
|
|
|
|
**Why:**
|
|
- No failover logic needed — both are always live
|
|
- Instant: if monk goes down, kscloud1 is already handling traffic
|
|
- Free: Cloudflare Tunnel active-active is included; health-check-based failover is paid
|
|
|
|
**The tradeoff:** Stateful apps with separate databases (Kavita, Karakeep) may show
|
|
different data depending on which host answers. Explicitly accepted — the priority is
|
|
uptime, not data consistency across hosts. Forgejo and Authentik share databases so
|
|
they are consistent.
|
|
|
|
---
|
|
|
|
## Why a Custom Portal Instead of a Pre-Built Dashboard?
|
|
|
|
**Options:**
|
|
- Homepage (gethomepage) — nice but limited customization
|
|
- Heimdall — similar limitations
|
|
- Custom static HTML/CSS/JS + nginx — full control, full ownership
|
|
|
|
**Decision:** Custom static site
|
|
|
|
**Why:**
|
|
- Complete visual control — the cyberpunk theme, layout, every card, every color
|
|
- Static files + nginx are extremely fast and reliable (no Node.js, no build step)
|
|
- nginx proxies the `/api/*` endpoints to the metrics API without CORS issues
|
|
- No dependency on external frameworks that can change or break
|
|
|
|
**The tradeoff:** More work to build and maintain. But you understand every line of it,
|
|
and you can explain exactly why every piece is there.
|
|
|
|
---
|
|
|
|
## Why Python + FastAPI for the Metrics API?
|
|
|
|
**Problem:** The portal needs live system stats (CPU, RAM, network), weather, and
|
|
Forgejo git activity. Static HTML can't provide these.
|
|
|
|
**Decision:** Python FastAPI with `psutil`
|
|
|
|
**Why:**
|
|
- `psutil` reads host system metrics in one line of Python
|
|
- FastAPI auto-generates API documentation and handles async requests well
|
|
- Python is readable — easy to understand and modify
|
|
- `async/await` means the API doesn't block while waiting for weather API responses
|
|
|
|
**Special requirements:**
|
|
- `network_mode: host` — container shares host network namespace so psutil sees real
|
|
network interfaces, not the container's virtual interface
|
|
- `pid: host` — container can read the host's `/proc` filesystem for accurate process stats
|
|
|
|
Without these flags, the API would report container-level stats instead of actual laptop stats.
|
|
|
|
---
|
|
|
|
## Why Forgejo Instead of GitHub or GitLab?
|
|
|
|
**Problem:** Need to store all homelab code, configs, and documentation in version control.
|
|
|
|
**Options:**
|
|
- GitHub: free, reliable, but your configs and docs are on someone else's server
|
|
- GitLab: self-hostable but heavy (4GB+ RAM for full install)
|
|
- Forgejo: lightweight GitHub-like self-hosted Git, fork of Gitea
|
|
|
|
**Decision:** Forgejo
|
|
|
|
**Why:**
|
|
- Self-hosted — configs and documentation stay on your infrastructure
|
|
- Very lightweight — uses less than 100MB RAM
|
|
- GitHub-compatible API — tools that work with GitHub also work with Forgejo
|
|
- Full UI with code review, issues, CI/CD (Forgejo Actions)
|
|
- Shows commit history and documentation to anyone you give access to
|
|
|
|
**The tradeoff:** You maintain it yourself. If Forgejo goes down, git operations fail.
|
|
Mitigated by kscloud1 running a replica and the shared Postgres.
|
|
|
|
---
|
|
|
|
## Why OSTicket for the Help Desk?
|
|
|
|
**What it replaced:** OpenProject (project management tool on tasks.kitestacks.com)
|
|
|
|
**Why OpenProject was removed:**
|
|
- OpenProject CE (Community Edition) requires an Enterprise Edition license for SSO
|
|
- The SSO button simply does not appear in CE — it is a hard paywall with no workaround
|
|
- OpenProject is also resource-heavy for what it provides
|
|
|
|
**Why OSTicket:**
|
|
- Lightweight and runs well on the existing stack
|
|
- Email integration works (SMTP via Gmail app password — confirmed working)
|
|
- Handles the ticket/task tracking use case without the licensing barrier
|
|
|
|
---
|
|
|
|
## Why BookStack for the Wiki?
|
|
|
|
**Problem:** Need a place for long-form documentation that's more structured than markdown files.
|
|
|
|
**Decision:** BookStack
|
|
|
|
**Why:**
|
|
- Clean, organized UI: Shelves → Books → Chapters → Pages hierarchy
|
|
- WYSIWYG editor — easy to write docs without markdown syntax
|
|
- Authentik OIDC SSO works natively
|
|
- API available — docs can be pushed programmatically from scripts or CI
|
|
|
|
**Key gotcha:** Cache directory must be writable by the container user.
|
|
`chown -R abc:users /config/www/framework/cache/` is required after first install.
|
|
|
|
---
|
|
|
|
## Why the Forgejo Shared Postgres?
|
|
|
|
**Problem:** With two connectors in active-active, Forgejo on monk and kscloud1 had
|
|
separate SQLite databases. Repos created on one weren't visible on the other.
|
|
|
|
**Fix:** Migrated both Forgejo instances to a single shared PostgreSQL database on kscloud1
|
|
(same shared server as Authentik's Postgres). Both connectors now serve identical Forgejo data.
|
|
|
|
**How it was done:**
|
|
- `forgejo dump --database postgres` — exported clean SQL from monk's Forgejo
|
|
- Dropped the pgloader schema (had wrong structure), reloaded the clean SQL
|
|
- Both compose files point to `authentik-postgres:5432` database `forgejo`, user `forgejo`
|
|
- kscloud1's Forgejo joined the `authentik_default` Docker network to reach authentik-postgres
|