kitestacks-homelab/homelab-mastery/architecture/overview.md
kenpat 1e8319ee75 docs: comprehensive homelab-mastery rewrite with full build guides
Complete documentation suite for KiteStacks covering all 11 services across
2-host active-active architecture. Includes beginner track (with AI, 8 files)
and advanced track (without AI, 7 files) with time estimates, real troubleshooting
cases, and command-by-command explanations. Updates certifications roadmap to
reflect July 7 2026 A+ Core 2 exam goal.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-19 01:08:43 -05:00

280 lines
12 KiB
Markdown

# KiteStacks Architecture — Full System Overview
**Last Updated:** 2026-06-19
---
## The Big Picture
```
INTERNET
┌──────▼──────┐
│ Cloudflare │ DNS + TLS termination
│ (edge) │ Tunnel routing
└──────┬──────┘
│ HTTPS only — home IP never exposed
┌──────────────┴──────────────┐
│ connector 1 │ connector 2
│ │
┌──────▼──────┐ ┌──────▼──────┐
│ MONK │ │ KSCLOUD1 │
│ (ThinkPad │ │ (Hetzner VPS│
│ T14s, home)│ │ Germany) │
│ │ │ │
│ Development │ │ ALWAYS LIVE │
│ Pushes to → │ │ Receives ← │
│ kscloud1 │ │ from monk │
└──────┬──────┘ └──────┬──────┘
│ │
└─────────── TAILSCALE ───────┘
(100.x.x.x range)
Encrypted peer-to-peer
┌────────────▼────────────┐
│ SHARED DATABASE LAYER │
│ hosted on kscloud1 │
│ │
│ PostgreSQL :5432 │
│ Redis :6379 │
│ │
│ Bound to Tailscale IP │
│ only — not public │
└─────────────────────────┘
```
**The key idea:** Cloudflare holds two persistent outbound connections — one from monk,
one from kscloud1. Every request to kitestacks.com arrives at Cloudflare, which routes
it to whichever connector responds. If monk goes offline, kscloud1 handles everything.
Your home IP is never involved.
---
## How Work Flows Between the Two Hosts
```
monk (dev) ──push──► kscloud1 (prod, always live)
```
- **monk** is where changes are made: editing config files, testing new services, writing code
- **kscloud1** receives those changes and is always serving live traffic
- If monk is off, kscloud1 continues serving the last pushed state — users see no downtime
- A third machine (Samurai desktop) is planned as a future second home connector
---
## The Eleven Public Services
| Service | Container | URL | What It Does |
|---------|-----------|-----|-------------|
| Portal | `homepage` | www.kitestacks.com | Custom homepage — links, live stats, cyberpunk theme |
| Authentik | `authentik` | auth.kitestacks.com | SSO identity provider — handles all logins |
| Forgejo | `forgejo` | gitforge.kitestacks.com | Self-hosted Git (like GitHub) |
| Open WebUI | `kite-openwebui` | ai.kitestacks.com | AI chat interface |
| Karakeep | `karakeep` | links.kitestacks.com | Bookmark and read-it-later manager |
| Kavita | `kavita` | kavita.kitestacks.com | eBook and manga reader |
| Grafana | `grafana` | grafana.kitestacks.com | Monitoring dashboards |
| Uptime Kuma | `uptime-kuma` | status.kitestacks.com | Public status page and uptime monitoring |
| BookStack | `bookstack` | wiki.kitestacks.com | Self-hosted wiki / docs platform |
| OSTicket | `osticket-app` | tasks.kitestacks.com | Help desk ticketing system |
| Portainer | `portainer` | portainer.kitestacks.com | Docker management dashboard |
## The Infrastructure Services (Internal Only)
| Container | What It Does |
|-----------|-------------|
| `cloudflared` | Cloudflare Tunnel connector — outbound connection to Cloudflare edge |
| `prometheus` | Metrics collector — scrapes node-exporter every 15 seconds |
| `node-exporter` | Exposes host CPU/RAM/disk/network metrics for Prometheus |
| `blackbox-exporter` | HTTP probe monitor — checks endpoints are returning 200 |
| `kite-litellm` | LLM proxy — routes AI requests to OpenRouter (many free models) |
| `kitestacks-metrics-api` | Python FastAPI — serves live stats and Forgejo activity to portal |
| `ntfy` | Push notification server — sends alerts to phone |
| `flux` | GitOps controller — watches Forgejo, deploys changes automatically |
| `authentik-worker` | Background job processor for Authentik |
| `authentik-ldap` | LDAP proxy layer for Authentik |
---
## How Traffic Flows — Step by Step
### Someone visits www.kitestacks.com
```
1. Browser → DNS lookup "www.kitestacks.com"
2. DNS returns Cloudflare's anycast IP (not your home IP)
3. Browser → HTTPS request to Cloudflare edge
4. Cloudflare reads Host header: "www.kitestacks.com"
5. Cloudflare routes request through active tunnel connector
(monk or kscloud1 — whichever responds first)
6. cloudflared resolves "homepage" via Docker DNS
7. Request hits nginx in the homepage container
8. nginx serves static HTML/CSS/JS from ./public/
9. Browser JavaScript calls /api/metrics and /api/activity
10. nginx proxies those to kitestacks-metrics-api (Python, host network)
11. metrics-api reads CPU/RAM via psutil (sees real host, not container)
12. metrics-api calls Forgejo API for recent commits
13. Browser renders complete page with live stats
```
### Someone clicks "Sign In with Authentik"
```
1. App (e.g. Grafana) redirects browser to:
https://auth.kitestacks.com/application/o/authorize/
?client_id=grafana&redirect_uri=...&response_type=code
2. Cloudflare routes this to a cloudflared connector
3. Authentik shows login page
4. User enters username + password
5. Authentik validates against shared Postgres (on kscloud1, over Tailscale)
6. Authentik creates an authorization code (row in DB) and redirects:
https://grafana.kitestacks.com/login/generic_oauth?code=abc123
7. Grafana backend POSTs to auth.kitestacks.com/application/o/token/
with code=abc123 and client_secret
8. THIS REQUEST may hit a DIFFERENT connector than step 2 did
→ This is why the shared DB matters: the code must exist in one DB,
not two separate ones that might be out of sync
9. Authentik finds code=abc123 in shared Postgres, validates it
10. Authentik returns JWT (access_token + id_token)
11. Grafana reads user's email from JWT, creates/updates local user
12. User is logged in — never re-enters password for other SSO apps
```
---
## The Shared Database — Why It Exists
After deploying two connectors (monk + kscloud1), users got `invalid_grant` errors when
signing in. The cause: each host had its own separate Authentik database. The OAuth2 flow
makes two separate HTTP requests:
1. `/authorize` → creates authorization code → stored in Database A
2. `/application/o/token/` → looks up authorization code → hits Database B → **not found**
Cloudflare load-balances requests, so steps 1 and 2 can hit different hosts.
**Fix:** Both connectors point to a single shared Postgres+Redis hosted on kscloud1.
It is bound only to kscloud1's Tailscale IP (`100.123.x.x`) — never the public IP.
Only devices on the Tailscale network can connect.
**Forgejo** also uses this shared Postgres (separate database on the same server).
Both monk's and kscloud1's Forgejo read from the same data, so git repos are consistent
regardless of which connector serves the request.
---
## The Docker Network
Every container joins the `kitestacks` external Docker bridge network:
```bash
# Create once on each host:
docker network create kitestacks
```
All service containers and the cloudflared container join this network. Docker provides
built-in DNS: when cloudflared needs to route to Grafana, it resolves the hostname `grafana`
to that container's IP address on the bridge network.
```
cloudflared → "grafana" → Docker DNS → 172.x.x.x:3000 → grafana container
```
Without this shared network, cloudflared cannot reach services by name.
---
## Why No Open Ports on the Home Router
Traditional approach: open port 80 and 443 on the router → NAT to home server → home IP in DNS.
Problems:
- Home IP is exposed publicly (DDoS target, ISP tracks it)
- Dynamic home IP breaks DNS when it changes
- Some ISPs block residential port 80/443
- Router misconfiguration = exposed server
**Cloudflare Tunnel approach:**
- cloudflared makes one outbound HTTPS connection to Cloudflare edge servers
- Cloudflare holds that connection open permanently
- All inbound traffic arrives over that existing outbound connection
- The home router sees only one outbound HTTPS connection — nothing unusual
- Home IP is never in DNS, never exposed
**Result:** A public website running on a home PC with zero router configuration and
no exposed home IP address.
---
## Tailscale — The Private Backbone
Tailscale creates an encrypted overlay network across all your devices.
Every device gets a stable `100.x.x.x` IP regardless of physical location.
```
monk 100.85.x.x ←── WireGuard ───► 100.123.x.x kscloud1
samurai 100.74.x.x ←── WireGuard ───► 100.123.x.x kscloud1
phone 100.x.x.x ←── WireGuard ───► 100.123.x.x kscloud1
```
Used in this homelab for:
1. **Shared Authentik DB:** kscloud1 Postgres and Redis are bound to `100.123.x.x` only.
Monk's Authentik connects to that address. Traffic is encrypted peer-to-peer.
2. **SSH admin access:** SSH to kscloud1 from anywhere using its Tailscale IP.
Even behind a hotel firewall or mobile data — Tailscale routes around it.
3. **Uptime monitoring:** The Conky desktop widget on monk reads Uptime Kuma status
from kscloud1 directly via Tailscale (not through Cloudflare), so it shows the
true kscloud1-side status.
---
## The Monitoring Stack
```
┌──────────────┐
monk's │ node-exporter│ ← exposes CPU/RAM/disk/network
node-exporter │ port 9100 │
└──────┬───────┘
│ scrape every 15s
┌──────▼───────┐
kscloud1's ───► │ prometheus │ (also scrapes kscloud1:9100 via public IP)
metrics └──────┬───────┘
┌──────▼───────┐
│ grafana │ ← visualize both hosts, switch via instance picker
└──────────────┘
Uptime Kuma → HTTP checks every 60s → all 13 public service URLs
Conky widget → reads Uptime Kuma API on kscloud1 → shows live dot per service
```
---
## The Portal Architecture
The portal is a custom static site — not a pre-built dashboard:
```
nginx container ("homepage")
├── / → static HTML/CSS/JS (cyberpunk theme, service cards)
└── /api/* → proxy_pass → kitestacks-metrics-api on host
kitestacks-metrics-api (Python FastAPI, network_mode: host, pid: host)
├── GET /api/metrics → psutil reads HOST CPU/RAM/disk/network
├── GET /api/weather → wttr.in API → current conditions
├── GET /api/activity → Forgejo API → recent commits across all repos
└── GET /api/health → {"ok": true}
```
`network_mode: host` — the container shares the host's network namespace.
Without it, psutil would report the container's stats, not the laptop's.
`pid: host` — the container can see the host's process table via `/proc`.
Without it, system stats would be wrong.