docs: clean up runbook and create dedicated debugging guide
This commit is contained in:
parent
bdec86b16f
commit
c4c6b49bf4
2 changed files with 53 additions and 73 deletions
82
RUNBOOK.md
82
RUNBOOK.md
|
|
@ -1407,77 +1407,13 @@ Do not commit `.env` files, webhook URLs, or database files to this repo.
|
|||
|
||||
---
|
||||
|
||||
## Phase 8: Forgejo Sync + osTicket Authentik LDAP SSO (2026-06-14/15)
|
||||
## Security Posture
|
||||
The entire KiteStacks homelab is secured by a Zero Trust architecture:
|
||||
1. **No Open Inbound Ports:** All public subdomains are routed through Cloudflare Tunnels (edge-to-container). The home router has 0 forwarded ports.
|
||||
2. **Network Isolation:** Internal communication between nodes (`monk` and `kscloud1`) happens strictly over a WireGuard-based Tailscale mesh network (`100.x.x.x`).
|
||||
3. **SSO Protection:** Authentik acts as the Identity Provider (OIDC/LDAP) and reverse proxy (Embedded Outpost) protecting all sensitive endpoints (`/scp/` for osTicket, Portainer, Grafana, Kite AI, etc.).
|
||||
4. **Standalone Bypasses:** The only apps fully public without SSO are the main Portal (`www`) and the read-only FluxCD GitOps Dashboard (`flux`), which was isolated into a standalone Nginx container specifically to decouple it from Authentik.
|
||||
|
||||
### Forgejo Sync (monk → kscloud1)
|
||||
|
||||
Monk is authoritative. kscloud1 Forgejo is a read replica synced every 6 hours.
|
||||
|
||||
**Sync script:** `~/kitestacks-live/docker/forgejo/sync-to-cloud.sh`
|
||||
**Cron:** `0 */6 * * *` on monk, logs to `/tmp/forgejo-sync.log`
|
||||
|
||||
Manual sync:
|
||||
```bash
|
||||
~/kitestacks-live/docker/forgejo/sync-to-cloud.sh
|
||||
```
|
||||
|
||||
To re-do a full restore from scratch (e.g., after kscloud1 rebuild):
|
||||
```bash
|
||||
# On monk: create dump
|
||||
docker exec -u git forgejo /app/gitea/gitea dump --type zip -f /tmp/forgejo-backup.zip
|
||||
docker cp forgejo:/tmp/forgejo-backup.zip /tmp/forgejo-backup.zip
|
||||
# Transfer and restore on cloud host — see claude-memory for detailed steps
|
||||
```
|
||||
|
||||
### osTicket Authentik LDAP SSO
|
||||
|
||||
Staff log into `tasks.kitestacks.com/scp/` using their **Authentik credentials** (not a separate osTicket password).
|
||||
|
||||
**Architecture:**
|
||||
```
|
||||
osticket-app → authentik-ldap-proxy:389 (socat) → authentik-ldap:3389 → auth.kitestacks.com
|
||||
```
|
||||
|
||||
**Services deployed:**
|
||||
- `~/kitestacks-live/docker/authentik-ldap/` — LDAP outpost + socat proxy on monk
|
||||
- `/opt/kitestacks/docker/authentik-ldap/` — LDAP outpost on kscloud1
|
||||
|
||||
**LDAP search account:** `cn=ldap-svc,ou=users,dc=ldap,dc=goauthentik,dc=io`
|
||||
Password stored in Authentik and in osTicket's `ost_config` (namespace `plugin.2`, key `bind_pw`, encrypted).
|
||||
|
||||
**auth-ldap.phar** at `/data/upload/include/plugins/auth-ldap.phar` inside the osticket-app container has been patched (original at `.phar.orig`). Do NOT replace it with the upstream version — the patch is required for PHP 7.3 + PEAR compatibility.
|
||||
|
||||
**If LDAP login stops working:**
|
||||
```bash
|
||||
# Check LDAP outpost is running and connected
|
||||
docker logs authentik-ldap --since 5m | grep -v debug
|
||||
docker logs authentik-ldap-proxy 2>&1 | tail -5
|
||||
|
||||
# Test bind from osticket-app container
|
||||
docker exec osticket-app php -r "
|
||||
\$c = @ldap_connect('authentik-ldap-proxy');
|
||||
ldap_set_option(\$c, LDAP_OPT_PROTOCOL_VERSION, 3);
|
||||
\$r = @ldap_bind(\$c, 'cn=ldap-svc,ou=users,dc=ldap,dc=goauthentik,dc=io', 'PASSWORD');
|
||||
echo \$r ? 'OK' : ldap_error(\$c);
|
||||
"
|
||||
# Verify kscloud1 outpost reachable
|
||||
nc -zv 100.123.254.52 3389
|
||||
```
|
||||
|
||||
**Reset a staff member's Authentik password:**
|
||||
```bash
|
||||
docker exec authentik ak shell -c "
|
||||
from authentik.core.models import User
|
||||
u = User.objects.get(username='kenpat7177')
|
||||
u.set_password('NewPassword123!')
|
||||
u.save()
|
||||
print('done')
|
||||
"
|
||||
```
|
||||
|
||||
**Clear osTicket login lockout:**
|
||||
```bash
|
||||
docker run --rm --network host mariadb:10.11 mysql \
|
||||
-h 100.123.254.52 -u osticket -p<DB_PASS> osticket \
|
||||
-e "DELETE FROM ost_session;"
|
||||
```
|
||||
## Troubleshooting
|
||||
For detailed diagnostics, password resets, and specific issue fixes (such as missing osTicket emails, Uptime Kuma connection loops, or Authentik token errors), please see:
|
||||
[**docs/DEBUGGING.md**](docs/DEBUGGING.md)
|
||||
|
|
|
|||
44
docs/DEBUGGING.md
Normal file
44
docs/DEBUGGING.md
Normal file
|
|
@ -0,0 +1,44 @@
|
|||
# KiteStacks Homelab - Debugging & Troubleshooting
|
||||
|
||||
This document contains solutions and diagnostic steps for known issues that have occurred during the setup and operation of the KiteStacks homelab.
|
||||
|
||||
---
|
||||
|
||||
## 1. osTicket: New User Activation Emails Not Sending
|
||||
**Symptom:** When a new user registers for a Help Desk account, they do not receive the activation email.
|
||||
**Root Cause:** osTicket runs in a Docker container without a local Mail Transfer Agent (MTA) like Postfix or Sendmail. By default, PHP's internal `mail()` function silently fails because it cannot route the email.
|
||||
**Fix:** You must configure an external SMTP server in the osTicket Admin Panel.
|
||||
1. Log into the osTicket Staff Control Panel (`/scp/`).
|
||||
2. Go to **Emails > Emails**.
|
||||
3. Select the default outbound email address (e.g., `noreply@kitestacks.com`).
|
||||
4. Scroll down to **SMTP Settings** and configure it to use a real mail provider (e.g., SendGrid, Mailgun, Amazon SES, or Gmail SMTP).
|
||||
5. Ensure **Authentication Required** is set to **Yes**.
|
||||
6. Save and send a test email.
|
||||
|
||||
## 2. Cloudflare Tunnel "Hmm. We're having trouble finding that site"
|
||||
**Symptom:** A subdomain is correctly configured in Cloudflare Zero Trust, but visiting the site returns a Cloudflare error.
|
||||
**Root Cause 1:** The internal service is down or restarting. Check the Docker container logs.
|
||||
**Root Cause 2:** Multi-node load balancing cache. Cloudflare balances requests between `monk` and `kscloud1`. If you update a container on `monk` but forget to update it on `kscloud1`, 50% of requests will fail or show stale data.
|
||||
**Fix:** Ensure Docker containers on both hosts are perfectly mirrored or explicitly configure Cloudflare to route only to the active host for that specific subdomain.
|
||||
|
||||
## 3. Authentik "invalid_grant" or "Code does not exist"
|
||||
**Symptom:** Logging into a service via Authentik SSO randomly fails with "invalid_grant".
|
||||
**Root Cause:** Initially, `monk` and `kscloud1` ran separate Authentik Postgres databases. Auth codes were generated on one node and consumed on the other, failing validation.
|
||||
**Fix:** The Authentik databases are now **Unified** over Tailscale. `monk` points its Postgres and Redis connections to `100.123.254.52` (kscloud1). Do not run local Postgres for Authentik on `monk`. If Tailscale goes down, SSO will fail.
|
||||
|
||||
## 4. Kavita "Sign in with Authentik" Button Missing
|
||||
**Symptom:** The Authentik OIDC login button does not appear on the Kavita login screen.
|
||||
**Root Cause:** Kavita stores OIDC settings in its internal SQLite database (`kavita.db`), not in an environment variable.
|
||||
**Fix:** The OIDC settings must be configured manually via the Kavita UI (Admin Settings -> OIDC). Direct SQL edits are overwritten by the Kavita container upon restart.
|
||||
|
||||
## 5. Portainer Password Reset
|
||||
**Symptom:** Admin password is lost.
|
||||
**Fix:** Stop the Portainer container. You must use a Go container with `bbolt` to patch the underlying BoltDB directly, or temporarily pass the `--admin-password` flag to the container entrypoint to reset it.
|
||||
|
||||
## 6. Uptime Kuma "Reconnecting to server..." Loop
|
||||
**Symptom:** The Uptime Kuma UI constantly shows "Reconnecting...".
|
||||
**Fix:** Uptime Kuma requires WebSockets. Ensure Cloudflare Tunnel does not aggressively cache the HTML/JS and that Nginx proxy timeouts are not aggressively closing the WebSocket connection.
|
||||
|
||||
## 7. Forgejo Authentication Failures (LDAP/OIDC)
|
||||
**Symptom:** Forgejo throws a 500 or redirect error after SSO login.
|
||||
**Fix:** Ensure the `ROOT_URL` in Forgejo's `app.ini` exactly matches the public domain (`https://gitforge.kitestacks.com/`), and that the Authentik Application Launch URL strictly matches the OAuth redirect URI configured in Forgejo.
|
||||
Loading…
Add table
Add a link
Reference in a new issue