claude-memory/project-kitestacks-migration.md at e6761897a81d40eb15e9aa5227c6b14bd78a07b9

kenpat fcd8def71a 2026-06-12: full KiteStacks session sync

- KiteStacks migration memory updated: OSticket live, Portainer SSO live
  on both monk+kscloud1, portainer.kitestacks.com HTTP 200, CF noTLSVerify
  fixed via API, auth code TTL bumped 1->10min, Karakeep redirect_uri fixed
- Oracle Cloud ARM migration next: user provisioning manually (Ampere A1,
  4 OCPU, 24GB RAM). OSticket x86-only issue to solve on Oracle side.
- CF API token kitestacks-dns-fix needs rolling (was exposed in chat)
- Portainer admin creds: monk=admin/n1t1MvVHCdcXWIIu, kscloud1=kenpat7177/same
- Added: feedback-forgejo-redaction, project-a-plus-core2 memories

2026-06-12 21:10:48 -05:00

30 KiB

Raw Blame History

name

description

metadata

project-kitestacks-migration

Migration of the live KiteStacks homelab/website from assassin (T14) to monk — COMPLETE. Plus full Hetzner cloud failover (kscloud1, 5.78.233.28) — COMPLETE. All 9 subdomains can run from any single host. Plus 2026-06-10 portal/SSO push: portal FluxCD+coming-soon changes deployed, Karakeep SSO fixed, OpenProject SSO blocked by EE license, Portainer SSO Authentik-side done (pending user manual steps).

node_type	type	originSessionId
memory	project	33992890-3940-4d4a-a94a-22b5621e9c1a

STATUS: MIGRATION + CLOUD FAILOVER COMPLETE (2026-06-10)

monk is the live production host. assassin (T14) is OFF. kscloud1 (Hetzner VPS, 5.78.233.28) is now a THIRD active Cloudflare Tunnel connector and runs a FULL replica of all 9 services, so the site stays up even if both monk and assassin are off (verified by user testing with home wifi off, from phone + mom's phone).

All 9 public subdomains (www, ai, auth, gitforge, grafana, kavita, links, status, tasks) verified returning correct status codes via the live tunnel with kscloud1 in rotation.

Governing principle (user's explicit words)

"leave the cloud backup on at all times" / "thats the point of it. if I am travelling my site will go down otherwise." -> kscloud1 runs PERMANENTLY as a 3rd connector, NOT cold standby. Cloudflare Tunnel load-balances ACTIVE-ACTIVE across all 3 connectors (no primary/backup priority). This means stateful apps (gitforge, openproject, authentik, karakeep, kavita, openwebui, etc.) may show DIFFERENT/STALE data depending on which connector serves a given request - EXPLICITLY ACCEPTED by user as the cost of guaranteed uptime. Fresh/separate databases on kscloud1 are fine; do not try to sync data between monk and kscloud1.

kscloud1 access

SSH: ssh -i ~/.ssh/id_ed25519_kscloud1 kenpat@5.78.233.28 (passwordless, key auth). sudo needs a password ("p12217177") and has no askpass helper - avoid sudo; most things doable as kenpat or via docker. All services live under /opt/kitestacks/docker/<service>/docker-compose.yml, same one-dir-per-app pattern as monk's ~/kitestacks-live/docker/.

kscloud1 services deployed (all `docker compose up -d`, joined to local `kitestacks` network)

cloudflared (3rd tunnel connector, same TUNNEL_TOKEN, connector id 78521d9f-71c0-4e3d-992f-bd1f77da1a8f)
homepage-backup (alias homepage) + caddy + kitestacks-metrics-api-backup - PRE-EXISTING
forgejo (alias forgejo) - PRE-EXISTING, separate DB from monk's (gitforge data inconsistent across connectors, accepted)
prometheus + node-exporter (job kscloud1-node)
grafana (alias grafana, port 3150) - Prometheus datasource (uid 000000001) + "Node Exporter Full" dashboard (id 1860) provisioned via ./provisioning/. OAuth->authentik config present but authentik on kscloud1 has a FRESH db (no provider apps configured) so OAuth login won't work there; local admin login works.
uptime-kuma (alias status->uptime-kuma) - kuma.db seeded by copying monk's admin user (same login: kenpat / same password hash) + monitors: kscloud1 self-ping, Google DNS, and HTTP checks for all 9 *.kitestacks.com subdomains (external monitoring of the live site).
kavita (alias kavita) - empty library (fresh)
karakeep + karakeep-chrome + karakeep-meilisearch (alias karakeep) - fresh meilisearch/db
authentik + authentik-worker + authentik-postgres + authentik-redis (alias on auth) - FRESH DB. Bootstrap admin: akadmin@kitestacks.com / password 6KlYpfCyYxbnKQNiOewN (set via AUTHENTIK_BOOTSTRAP_PASSWORD in .env). No OAuth provider apps exist yet (would need to be manually recreated in authentik UI for grafana/openwebui/karakeep/openproject SSO to work when kscloud1 is the active backend).
kite-litellm + kite-openwebui (alias ai->openwebui) - same .env/secrets as monk. OpenWebUI has ENABLE_SIGNUP=true (changed from monk's false) so kenpat can create a local admin account on first use, since authentik OAuth won't work with kscloud1's fresh authentik.
openproject (alias on tasks, port 8090:80 host - port 80 was taken by caddy) - FRESH db, self-initialized via the all-in-one image's bootstrap (took ~3-4 min). Empty/no projects yet.

monk-side changes made for cross-host monitoring

~/kitestacks-live/docker/prometheus/prometheus.yml: added scrape job kscloud1-node -> 5.78.233.28:9100 (kscloud1's node-exporter is exposed 0.0.0.0:9100, no firewall - reachable from monk's public IP). monk's grafana (the live one, "Node Exporter Full" dashboard now provisioned via ~/kitestacks-live/docker/grafana/provisioning/) shows BOTH t14-node (monk/"this pc") and kscloud1-node ("the cloud") via the instance picker.
kscloud1's prometheus only scrapes itself (kscloud1-node) - monk is behind home NAT, not reachable from kscloud1.

Resource notes (kscloud1: 3 vCPU, 3.7GB RAM + 6GB swap, 75GB disk)

With all services running: ~2.8-2.9GB RAM used, ~2.6-2.8GB swap used (of 6GB), ~835MB-1.2GB "available", disk 29GB/75GB used. Site is functional but under memory pressure - if BOTH monk and assassin are down for an extended period with real concurrent usage, expect sluggishness (esp. openproject/authentik/ openwebui). Not yet stress-tested under real failover load.

Key gotchas from THIS phase (cloud failover build-out)

kscloud1's kitestacks Docker network is LOCAL/separate from monk's (same name, no conflict). cloudflared on each host resolves container names against its own host's network.
Adding a new tunnel connector that lacks a backend for an ingress hostname -> 502 for requests routed there. If it has a DIFFERENT backend (e.g. forgejo) -> serves different data inconsistently. Both accepted/expected now that all 9 hostnames have backends on kscloud1.
port 80 on kscloud1 is owned by caddy (serves www-backup/git-backup.kitestacks.com direct A-records, pre-existing, unrelated to the tunnel) - openproject uses 8090:80 for its host port instead (internal container port 8080 is what cloudflared hits).
uptime-kuma / grafana have no simple file-based config API for monitors/datasources beyond grafana provisioning - used direct sqlite manipulation (docker exec ... sqlite3, or python3 sqlite3 module via a throwaway python:3-alpine container with the volume mounted) to seed uptime-kuma's kuma.db with users/monitors.
authentik first boot takes ~1-2 min (migrations); openproject first boot takes ~4-5 min (postgres initdb + Rails migrations + Puma boot), watch docker logs for "Listening on http://0.0.0.0:8080" before testing.

PROBLEM: Cloudflare Tunnel load-balances auth./kavita. active-active across monk and kscloud1. kscloud1's authentik had only the fresh akadmin bootstrap user (not kenpat7177) and kscloud1's kavita had ZERO users -> ~50% of requests showed "wrong password" on authentik and a "create admin account" (signup) screen on kavita instead of login. This contradicts the earlier "fresh DBs are fine" assumption - for IDENTITY apps it breaks login, so it was NOT acceptable. FIX APPLIED (one-time sync, same pattern as uptime-kuma's kuma.db seed):

pg_dump'd monk's authentik-postgres authentik db (--clean --if-exists), scp'd to kscloud1, stopped authentik+authentik-worker on kscloud1, restored via docker exec -i authentik-postgres psql -U authentik -d authentik < dump, restarted. Worked cleanly because AUTHENTIK_SECRET_KEY and PG_PASS were ALREADY IDENTICAL between monk's and kscloud1's authentik/.env.
For kavita: copying the raw kavita.db file via plain cp produced "database disk image is malformed" (WAL-mode db isn't standalone-consistent as a flat file copy even when -wal/-shm look small). FIX: use python3 sqlite3 Connection.backup() (via throwaway python:3-alpine container) to produce a consistent copy, THEN on kscloud1 stop kavita, rm the OLD kavita.db-shm/kavita.db-wal too (stale WAL files against new db = same corruption error), copy in the new kavita.db (chown root:root, chmod 644 to match original ownership - kavita container runs as root), restart.
Result: kscloud1 authentik now has kenpat7177 (matches monk), kscloud1 kavita now has kenpat7177 + acurrie (matches monk). Both connectors now return the same login screen/credentials. NOTE: this is a ONE-TIME sync, not continuous - if monk's users/passwords change later, kscloud1 will drift again and the same symptoms could return; re-run this sync if so.
kscloud1 kavita's library entries point at /books paths that don't exist on kscloud1 (no actual book files there) - login works fine, but browsing the library when served by kscloud1 will show entries with missing files. Same "stale data" tradeoff as gitforge, accepted.

Authentik shared Postgres+Redis over Tailscale (2026-06-10) - fixes "invalid_grant" SSO

PROBLEM: Even after the one-time DB sync above, "Sign in with Authentik" on Kavita could fail with "invalid_grant" / "Code does not exist". Root cause: monk and kscloud1 each ran their OWN authentik-postgres. OAuth2 authorization codes are short-lived per-flow rows in authentik_providers_oauth2_authorizationcode

if Cloudflare Tunnel's active-active routing sends /authorize to one connector and /application/o/token/ to the other, the code only exists in one of the two DBs -> invalid_grant. A one-time data sync can't fix this because the data is created fresh on every login attempt. FIX: Converted to a single shared Postgres+Redis (HA pattern), hosted on kscloud1, reachable ONLY over Tailscale:
Installed Tailscale on both monk and kscloud1 (same tailnet). kscloud1's tailscale IP is 100.123.254.52.
kscloud1's /opt/kitestacks/docker/authentik/docker-compose.yml: authentik-postgres now binds 100.123.254.52:5432:5432 (was unbound/internal-only), authentik-redis now binds 100.123.254.52:6379:6379. Both still also reachable on the local kitestacks docker network for kscloud1's own authentik+worker. Backup of pre-change file: docker-compose.yml.backup-before-shared-db-20260610-1138.
monk's ~/kitestacks-live/docker/authentik/docker-compose.yml: REMOVED the postgresql and redis services entirely. monk's authentik/authentik-worker now point AUTHENTIK_POSTGRESQL__HOST and AUTHENTIK_REDIS__HOST at 100.123.254.52 (kscloud1 over Tailscale), using the same PG_PASS / AUTHENTIK_SECRET_KEY as before (already identical between hosts).
monk's old local authentik-postgres/authentik-redis containers were STOPPED (not removed) - data dirs preserved under ~/kitestacks-live/docker/authentik/postgres in case of rollback, but no longer in use.
Result: BOTH connectors' authentik+worker now read/write the SAME db/redis, regardless of which one handles /authorize vs /application/o/token/. Verified both authentik+authentik-worker healthy on monk and kscloud1, OIDC discovery docs identical, user list matches (kenpat7177 etc.) on both. CONFIRMED BY USER 2026-06-10: "Sign in with Authentik" on Kavita now works (when monk's connector serves the request).

After the shared-Authentik-DB fix above, the button still didn't appear when Cloudflare routed kavita.kitestacks.com to kscloud1's connector. CAUSE: Kavita's OIDC config lives in ITS OWN db (kavita.db ServerSetting table, Key=40, a JSON blob with Authority/ClientId/Secret/Enabled), separate from Authentik's db. The earlier one-time kavita.db sync (see fix above) was taken BEFORE OIDC SSO was configured in monk's Kavita, so kscloud1's copy had Key=40 with empty Authority/Secret and "Enabled":false. FIX: copied monk's Key=40 JSON value verbatim into kscloud1's kavita.db (stop kavita, docker run --rm -v .../kavita/config:/data -v fix.sql:/fix.sql alpine + apk sqlite + sqlite3 /data/kavita.db < fix.sql with UPDATE ServerSetting SET Value='...' WHERE "Key"=40, restart kavita). NOTE: AspNetUserLogins (OIDC account-linkage table) is empty on BOTH monk and kscloud1 - Kavita creates this row on first OIDC login per-instance (matches existing local user by email since ProvisionAccounts=false), so no extra action needed there. GOTCHA: ServerSetting's PK column is "Key" (INTEGER), not Id - must quote it in sqlite ("Key") since KEY is a SQL reserved word. DRIFT WARNING: any future Kavita server-setting change (OIDC config, library paths, etc.) made on monk will NOT propagate to kscloud1's kavita.db automatically - same one-time-sync caveat as the user-table sync above.

UPDATE 2026-06-10 (RESOLVED via Kavita UI, not direct DB edit): Direct SQL edits to ServerSetting Key=40 got WIPED back to disabled/empty by Kavita on every container restart (RowVersion incremented +2 each time, Authority/Secret cleared, Enabled->false) - confirmed twice, even with a full WAL-consistent kavita.db replace from monk. Direct DB writes to this table do NOT survive a restart; only saves through Kavita's own Settings UI/API persist correctly. FIX: opened an SSH local port-forward (ssh -L 5099:localhost:5000 kenpat@5.78.233.28) so the user could reach kscloud1's Kavita directly at http://localhost:5099 (bypassing the Cloudflare load-balanced domain), logged in with their normal kenpat7177 Kavita password, and re-entered the OIDC config in Settings -> OIDC:

Authority: https://auth.kitestacks.com/application/o/kavita/ (MUST include trailing slash - Kavita validates that this exactly matches the issuer claim in Authentik's .well-known/openid-configuration, which has a trailing slash. Without it: "Kavita can load the OIDC configuration, but the issuer does not match".)
Client ID: kavita, Client Secret: (96-hex-char secret from Authentik's Kavita OAuth2 provider - watch for copy/paste truncation, verify length=96)
Enabled: true, ProviderName: authentik Saved via UI (RowVersion 8->12->14 across two saves to fix a 1-char-truncated secret), then docker compose restart kavita on kscloud1 - config SURVIVED this restart (unlike the direct-SQL attempts) and /api/settings/oidc now reports "enabled": true. SSH tunnel closed afterward (no firewall changes were made/needed). Set a temporary ApiKey on kenpat7177's kscloud1 kavita account during troubleshooting (for a Plugin/authenticate attempt that turned out to return 401 / unused) - left in place, harmless (grants API access to that user's own account only). TAKEAWAY FOR FUTURE KAVITA CONFIG CHANGES ON KSCLOUD1: always use the Kavita UI (via SSH port-forward to localhost:5000) rather than direct sqlite edits - direct edits to ServerSetting do not survive a restart. CONFIRMED BY USER 2026-06-10: "Sign in with Authentik" now works on Kavita regardless of which connector (monk/kscloud1) answers.

Kavita cover images missing on kscloud1 - FIXED 2026-06-10

After the kavita.db sync from monk, kscloud1's db referenced cover image files (e.g. v1_c1.png..v10_c10.png in ServerSetting/Series.CoverImage) that didn't exist on kscloud1's filesystem - kscloud1's config/covers/ dir was empty (monk has 9 files, ~1.3MB). Result: book/series cover thumbnails didn't load when kscloud1 served the request. FIX: tar'd monk's ~/kitestacks-live/docker/kavita/config/covers/ (owned 1000:1000), scp'd to kscloud1, extracted into /opt/kitestacks/docker/kavita/config/covers/ via a throwaway alpine container, chown -R 1000:1000. No kavita restart needed - covers are served as static files from disk. CONFIRMED BY USER: covers now load correctly. NOTE: this is another one-time sync (same drift caveat) - if new books/covers are added on monk later, they won't appear on kscloud1 unless re-synced (covers/ dir + kavita.db + actual book files under library/books, none of which exist on kscloud1 per the earlier "stale data" note). SECURITY NOTE: postgres/redis on kscloud1 are bound to the Tailscale interface IP only (100.123.254.52), not 0.0.0.0 - not exposed to the public internet. ROLLBACK: if Tailscale connectivity ever breaks, monk's authentik will fail to start (can't reach 100.123.254.52). To roll back: restore monk's docker-compose.yml from git/backup to use local postgresql/redis services again, restart monk's old authentik-postgres/authentik-redis containers (docker start authentik-postgres authentik-redis in ~/kitestacks-live/docker/authentik), docker compose up -d. Note this would mean monk's authentik db is now STALE (kscloud1's shared db has any logins/ changes since 2026-06-10) - would need a fresh pg_dump sync from kscloud1 first.

kscloud1 ufw blocks docker-bridge -> host port 8000 (metrics API) - FIXED 2026-06-10

kscloud1 has ufw active with default deny incoming/routed. The kitestacks-metrics-api-backup container (network_mode: host, binds 0.0.0.0:8000) was unreachable from homepage-backup via host.docker.internal:8000 (TCP timeout, not refused -> ufw drop), causing the homepage System Status widget to show 0%/"Offline" when kscloud1 served the request. FIXED by adding: sudo ufw allow from 172.16.0.0/12 to any port 8000 proto tcp (covers all docker bridge subnets on this host: 172.17-172.29.x.x). Verified homepage-backup -> host.docker.internal:8000/api/metrics now returns real CPU/RAM/storage/network data. kscloud1 sudo password is in the "kscloud1 access" section above - needed echo PASS | sudo -S <cmd> (no askpass helper, non-interactive sudo via -S works fine).

Portal SSO/coming-soon push + Karakeep fix + OpenProject EE blocker + Portainer SSO setup (2026-06-10)

Per user's original request: SSO for grafana/prometheus/portainer/cloudflare/uptime-kuma (authentik=source), KITEAI/LITELLM/OPENROUTER -> "coming soon", SSO for openproject/forgejo/ karakeep, FluxCD card in AI&Automation. Hard constraint: user does NOT use Cloudflare Zero Trust/Access ("costs money") - any Cloudflare work must avoid those products (note: the Tunnel UI itself lives under Cloudflare's "Zero Trust" dashboard section, but configuring a Tunnel public hostname there is NOT the same as enabling Zero Trust/Access - fine to use).

Portal UI changes - DEPLOYED to all 3 copies, verified live

Edited the AI & AUTOMATION panel (cards cards-3 -> cards cards-2, now 2x2): Kite AI and OpenRouter cards changed from external links to href="#" data-coming-soon="1" (LiteLLM was already coming-soon); added a 4th card "FluxCD" / "GitOps Automation" using /images/icons/fluxcd.png, also coming-soon (automation scripts with FluxCD+Prometheus+Grafana+node-exporter are a future project). Applied identically to:

~/kitestacks-live/docker/kitestacks-portal-test/public/index.html (monk, dev, port 3008)
~/kitestacks-live/docker/kitestacks-portal/public/index.html (monk, LIVE, served by "homepage" container 3005->3000 - this is the file that backs www.kitestacks.com)
/opt/kitestacks/docker/www-backup/kitestacks-portal/public/index.html (kscloud1, served by homepage-backup port 3015) Verified https://www.kitestacks.com returns "FluxCD" consistently (6/6 requests across both connectors). NOTE: Portainer card on the live portal is currently data-coming-soon="1" - update this to a real href="https://portainer.kitestacks.com" link (remove data-coming-soon) once the Portainer SSO manual steps below are completed. NOTE 2: "cloudflare should all be in the networking side" from the original request was never resolved - Cloudflare card is still in the INFRASTRUCTURE panel, not moved/renamed to a "NETWORKING" panel. Ambiguous, deprioritized, not revisited.

Karakeep SSO redirect_uri fix - DONE, confirmed working

Karakeep uses NextAuth.js with provider id "custom" (not "authentik") - actual OAuth callback path is /api/auth/callback/custom, but Authentik's Karakeep OAuth2Provider's _redirect_uris had the wrong path -> "Redirect URI Error". FIX: direct Postgres UPDATE to authentik_providers_oauth2_oauth2provider._redirect_uris (JSON column) on the shared kscloud1 authentik-postgres (100.123.254.52), wrapped in explicit BEGIN; UPDATE ...; COMMIT; (a bare single-statement -c "UPDATE..." reported "UPDATE 1" but did NOT persist on first attempt - cause unclear, explicit transaction fixed it). After the DB write, restarted authentik+authentik-worker on BOTH monk and kscloud1 and polled docker inspect --format '{{.State.Health.Status}}' until both reported "healthy" (~50s) before retesting - first retest hit a transient 502 because kscloud1's authentik was still "starting". CONFIRMED: Authentik now serves the login page (not "Redirect URI Error") for Karakeep SSO. PG_PASS GOTCHA: ~/kitestacks-live/docker/authentik/.env PG_PASS value ends in = - extract with cut -d= -f2- (NOT -f2, which truncates the trailing = and causes "password authentication failed"). REUSABLE PATTERN for any future direct Authentik DB edit: (1) wrap writes in explicit BEGIN/COMMIT, (2) restart authentik+authentik-worker on BOTH monk and kscloud1, (3) wait for health=healthy on both before testing.

OpenProject SSO - config bug fixed, but BLOCKED by Enterprise licensing (no further action possible)

~/kitestacks-live/docker/openproject/docker-compose.yml env vars were wrong in two ways: (1) extra "PROVIDERS_" segment in var names caused seed_oidc_provider = {"providers": {"authentik": {...}}} instead of {"authentik": {...}}, producing a broken stub provider record (slug= "providers", id=1, since deleted via Rails runner); (2) discovery_endpoint isn't read by ConfigurationMapper at all - replaced with explicit ISSUER/AUTHORIZATION__ENDPOINT/TOKEN__ENDPOINT/USERINFO__ENDPOINT/ END__SESSION__ENDPOINT/JWKS__URI vars (current docker-compose.yml has the corrected version, see file - all derived from https://auth.kitestacks.com/application/o/openproject/.well-known/openid-configuration). After fixing both, the seeder correctly creates provider slug="authentik", available=true, all fields correct - BUT the SSO button still does not appear on /login. CONFIRMED ROOT CAUSE (terminal, source-code-verified): OpenProject CE 2025/v15's OmniAuth SSO strategy (OpenProject::Plugins::AuthPlugin/OpenIDConnect) AND SAML (auth_saml/lib/open_project/auth_saml/engine.rb, enterprise_feature: "sso_auth_providers") are BOTH gated behind an Enterprise Edition license - "OmniAuth SSO strategy ... is only available for Enterprise Editions". No app/config-level workaround exists. Only remaining options: buy EE license, OR put a forward-auth proxy (oauth2-proxy / Authentik embedded outpost) in front of OpenProject - DEFERRED along with Prometheus/Uptime Kuma proxy work (see below) until Oracle VPS topology is decided. OpenProject container is healthy, /login returns 200, no projects yet.

Portainer SSO - Authentik side DONE, two manual steps PENDING (not yet done by user)

Per user: "yes continue with portainer" / "yes but make sure it is still secure" (approved exposing Portainer publicly via a NEW Cloudflare Tunnel hostname, with explicit requirement to keep it secure -> access restricted to the homelab-admin Authentik group). Created via docker exec authentik ak shell (Django ORM, no Authentik API token configured) on kscloud1's shared authentik-postgres:

OAuth2Provider "Portainer": client_id=portainer, client_secret=wTim3mrMwt34ko1RYMvK1RNnjwWOMi_d4r4cS6exr7DjozCrL5zKthHl-5KjargF, provider_id=9, redirect_uri=https://portainer.kitestacks.com (strict), scopes openid/email/profile, sub_mode=user_email, signing key + flows copied from existing providers (same pattern as Karakeep/Grafana).
Application "Portainer" (slug="portainer", meta_launch_url= https://portainer.kitestacks.com).
PolicyBinding restricting the Portainer application to Authentik group homelab-admin (UUID e21b0aa5-62e7-4b3a-8302-130b0ae148a5) - this is the "make sure it is still secure" piece (only homelab-admin members can SSO in).
Verified discovery doc resolves: https://auth.kitestacks.com/application/o/portainer/.well-known/openid-configuration. PENDING MANUAL STEPS (user must do via UI - confirmed portainer.kitestacks.com still returns 000 as of 2026-06-10):

Cloudflare dashboard -> Tunnel -> add Public Hostname portainer.kitestacks.com -> service https://portainer:9443 (HTTPS), enable "No TLS Verify". (This is in the Tunnel config UI, which Cloudflare happens to host under the "Zero Trust" nav section, but adding a Tunnel hostname is NOT enabling Zero Trust/Access - does not violate the no-Zero-Trust constraint.)
In Portainer -> Settings -> Authentication -> OAuth (Provider: Custom), on BOTH monk's and kscloud1's SEPARATE Portainer instances, configure:
- Client ID: portainer
- Client Secret: wTim3mrMwt34ko1RYMvK1RNnjwWOMi_d4r4cS6exr7DjozCrL5zKthHl-5KjargF
- Authorization URL: https://auth.kitestacks.com/application/o/authorize/
- Access Token URL: https://auth.kitestacks.com/application/o/token/
- Resource/Userinfo URL: https://auth.kitestacks.com/application/o/userinfo/
- Redirect URL: https://portainer.kitestacks.com
- Logout URL: https://auth.kitestacks.com/application/o/portainer/end-session/
- Scopes: openid email profile, User identifier claim: email AFTER both steps done: update the live portal's Portainer card (in the 3 files above) from data-coming-soon="1" to a real href="https://portainer.kitestacks.com" target="_blank" rel="noopener" link.

App-level SSO status summary (end of 2026-06-10 session)

Grafana: working (pre-existing). Forgejo: working (pre-existing). Karakeep: fixed this session, working. OpenProject: blocked by EE license (terminal at app level). Portainer: Authentik side done, waiting on user's 2 manual steps above. Prometheus + Uptime Kuma: DEFERRED - neither has native OAuth, need a forward-auth proxy (oauth2-proxy or Authentik embedded outpost) - deferred per user's "ok lets do smaller app level" (hold new infra until Oracle VPS decided). Cloudflare itself: no SSO concept applicable (it's Cloudflare's own dashboard login) - was always about the portal's Cloudflare card placement, see "Portal UI changes" note above.

Oracle VPS migration - PLANNED, upcoming (stated 2026-06-11)

User confirmed on 2026-06-11: "we are going to switch things soon from hetzner cloud to oracle soon." -> kscloud1 (Hetzner, 5.78.233.28) is intended to be REPLACED by an Oracle Cloud VPS in the near future ("soon", no firm date yet). Originally raised 2026-06-10 as exploratory ("how easy would it be to move everything to oracle vps after?"), now an actual plan. Implication: avoid investing further one-off/manual config work that's hard to redo (e.g. more one-time DB syncs, hand-edited sqlite, etc.) on kscloud1 if avoidable - prefer changes that are easy to replicate on a new host. When the Oracle VPS is provisioned, plan to follow the same pattern as the kscloud1 cloud-failover build-out (new Cloudflare Tunnel connector + full service replicas + shared Authentik/Postgres/Redis over Tailscale + the Forgejo FORGEJO_API_BASE-over-Tailscale pattern for the portal's Recent Activity, see "Recent Activity" fix below) - then retire kscloud1 the same way assassin/T14 was retired (decommission once Oracle replica verified working).

Prior migration gotchas (monk, kept for reference - see git history/old notes if needed)

rsync --files-from recursion bug, bind-mount postgres dirs come over empty as non-root (use pg_dumpall/pg_dump --clean from running container instead), pg_dumpall --clean across template1 breaks on client/server version mismatch (use single-db pg_dump+psql instead), grafana data dir needs chown 472:472, kite-litellm needed manual docker network connect kitestacks kite-litellm.

2026-06-12: SSO fixes + Portainer deployed on kscloud1

Root cause: monk reconnect race condition

When monk goes offline (user travels) and reconnects, Cloudflare starts routing some token exchange requests to monk while codes were created on kscloud1 during the offline window. Auth codes had a 60-second TTL, which expired before monk's Authentik fully started (~5 min startup). FIX: increased access_code_validity from minutes=1 to minutes=10 for ALL 9 OAuth2 providers in the shared Postgres DB. This gives enough buffer for monk's containers to start before codes expire. Command used (via python:3-alpine container): docker run --rm --network host -v /tmp/fix_auth.py:/fix.py python:3-alpine sh -c ... connecting to shared Postgres at 100.123.254.52.

Karakeep redirect_uri reverted and re-fixed

The Karakeep OAuth2Provider _redirect_uris had reverted back to the proxy pattern (/outpost.goauthentik.io/callback?...) instead of the correct NextAuth callback (https://links.kitestacks.com/api/auth/callback/custom). This caused "Redirect URI Error" from Authentik whenever SSO was attempted. Root cause unknown (possibly an Authentik blueprint or UI save that regenerated/overrode the field). FIX: same Postgres UPDATE pattern. WATCH: if this reverts again, check Authentik blueprints or if someone modified the Karakeep provider via the Authentik admin UI.

Portainer deployed on kscloud1

Created /opt/kitestacks/docker/portainer/docker-compose.yml (same image/config as monk's portainer). Container running as portainer, port 9443:9443, on kitestacks network. Volume is local (NOT shared with monk - fresh Portainer instance). STILL PENDING (user action in Cloudflare dashboard):

Tunnel ID: 5e60ea8e-a543-49b6-bab5-325f39441e00, Account: d0bb7673333fcd794622956f1662f785
Add hostname portainer.kitestacks.com → service https://portainer:9443, No TLS Verify STILL PENDING (user action in both Portainer UIs): configure OAuth (see prior notes in "Portainer SSO" section above for exact credentials). Portal card update (3 files) also still pending until tunnel+OAuth done.

Phase 2 Planned: Obsidian Mind Map → HTML Mind Map Sync

User wants to create an Obsidian mind map of the KiteStacks homelab that syncs/exports to a live HTML mind map embedded in the homelab portal or a standalone page. To be built after full Obsidian+samurai setup is complete.

30 KiB Raw Blame History