Operations
Operational playbooks & runbooks
This page collects the operational procedures for running the DataFlow AI Platform: the disaster-recovery failover runbook for the documented GKE topology, and a set of practical runbooks for the single-VPS production system that actually serves users today. Each procedure is a numbered, do-this-then-that sequence.
Before you start
Two facts shape every procedure here:
- Production is the single Debian VPS
etl.exai.cloud(57.129.120.73), running Docker Compose. There is no canary, no automated rollback, and no automated tests on this path. - The DR/failover runbook describes the GKE multi-region model, which is not what runs production. It is documented below for completeness and because it is the target topology — but a real VPS incident is handled with the Compose runbooks, not GKE failover.
Credentials live in source — handle with care
Several deploy scripts (deploy_to_vps.py, apply-nginx-fix.py, seed_keycloak_vps.py) and deploy/.env.production contain hardcoded or plaintext production secrets. This is a known security finding. When following these runbooks, prefer to source credentials from a secrets manager and treat rotation (see below) as a priority.
Runbook 1 — Disaster recovery / failover (GKE)
This is the documented multi-region DR procedure from docs/runbooks/dr-failover-runbook.md. It applies to the GKE topology, not the VPS.
Targets: RTO < 4 hours, RPO < 1 minute (Cloud SQL async replication lag).
Topology:
europe-central2 (PRIMARY) europe-west3 (DR)
┌──────────────────────────┐ ┌──────────────────────────┐
│ GKE dataflow-autopilot- │ │ GKE dataflow-autopilot-dr │
│ production │ │ (1 replica, passive, │
│ Cloud SQL primary │ async │ read-only) │
│ │ ──────▶ │ Cloud SQL read replica │
│ │ repl. │ dataflow-pg-dr-replica │
└────────────┬──────────────┘ └────────────┬─────────────┘
│ │
└────── DNS health-check failover ────┘
api.dataflow.polkomtel.pl
Failover procedure (7 steps)
- Verify primary is down. Check GKE cluster status, Cloud SQL status, the health endpoint, and the GCP status page. Failover requires on-call lead approval before proceeding.
- Promote the Cloud SQL replica. Run
gcloud sql instances promote-replica— this is irreversible. Poll the instance until its state isRUNNABLE. - Switch DR deployments to read-write. Patch the ConfigMap to set
DB_READ_ONLY=false. - Scale DR deployments up. Scale from 1 → 3 replicas and wait for the rollout to complete.
- Fail over DNS. Verify or force the DNS failover to the DR IP
10.2.0.100. - Smoke-test DR. Confirm pods are running, all 5 service health endpoints respond, and database connectivity works.
- Notify stakeholders. Send notifications via Slack, email, and PagerDuty.
Failback procedure (F1–F7)
- Verify the primary region has recovered.
- Build a fresh replica in
europe-central2from the current (promoted) primary. - Sync the new replica.
- Promote it back to primary.
- Scale the primary deployments up and scale the DR deployments down.
- Fail DNS back to the primary IP
10.0.0.100. - Re-establish the DR replica and update Terraform state.
Testing schedule
| Cadence | Activity |
|---|---|
| Continuous | Replication-lag monitoring |
| Weekly | Health-check verification |
| Monthly | DR infrastructure validation |
| Quarterly | Tabletop exercise + full failover drill |
Runbook 2 — Restart a service (VPS)
Use this when a single service is unhealthy but the host and infrastructure tier are fine.
- Confirm the symptom.
docker compose ps— identify which service is nothealthy. - Capture evidence first.
docker compose logs --tail=300 <service>and save the output before restarting, so the failure cause is not lost. - Restart the single service.
docker compose restart <service>. - Watch it come back.
docker compose logs -f <service>until startup completes (Flyway runs on startup for Kotlin services — watch for migration success). - Verify health.
curl -sf http://127.0.0.1:<host-port>/actuator/health | jq(for example8085for the gateway,8084for monitor-service). - Check dependents. If you restarted
postgres,keycloak, orkafka, restart the application services that depend on them, respecting start order: postgres → keycloak/kafka → app services → frontend. - Confirm end to end. Load
https://etl.exai.cloudand exercise an affected page.
Respect the start order
The Compose file enforces start order via healthchecks and depends_on. If you restart infrastructure (postgres, kafka, keycloak), a plain restart of just that container can leave app services connected to a stale endpoint. When in doubt, restart the dependent app services afterward.
Runbook 3 — Investigate a failed deploy (VPS)
Use this after python deploy/deploy_to_vps.py reports a failure or finishes ambiguously.
- Read the deploy log. Check
deploy/.deploy.logfor where the run stopped. Note: a WindowsUnicodeEncodeError(cp1252) can crash the local script while streaming build output — this is cosmetic and does not necessarily mean the server-side deploy failed. - SSH to the VPS and check container state.
docker compose ps— are all 8 services and the infra tier present andhealthy? - Identify the failed step. The script has 7 steps (archive → connect → upload → nginx config → build → start in waves → verify). Build failures (step 5) most often stem from VPS memory pressure during the Kotlin builds.
- Inspect build output. If an image failed to build, re-run that single build on the server:
docker compose build <service>(each Kotlin build can take several minutes; the script allows 900s). - Check Flyway migrations.
docker compose logs <kotlin-service> | grep -i flyway— migrations run on startup and a failed migration blocks the service. See Runbook 6 if migrations are stuck. - Validate nginx.
nginx -ton the host; if step 4 left a broken config, restore the timestamped backup. - Re-run or finish manually. Either re-run
deploy_to_vps.py(it does a full rebuild) or, if only a wave failed, start the remaining services:docker compose up -d. - Re-seed if needed. If Keycloak was recreated, re-run
python deploy/seed_keycloak_vps.py. - Verify.
curlthe frontend on127.0.0.1:3006and confirmhttps://etl.exai.cloudloads.
Runbook 4 — Recover a stuck pipeline
Use this when a pipeline run is hung, not progressing, or stuck in a non-terminal state.
- Find the run. Identify the
runIdand current status — via the Monitor UI or directly:docker compose exec postgres psql -U postgres -d dataflow_metadata \ -c "SELECT id, status, started_at FROM monitor_pipeline_runs ORDER BY started_at DESC LIMIT 20;" - Stream the run log. The pipeline-engine streams logs over WebSocket at
/api/v1/runs/{runId}/stream; the UI Log viewer renders this. Confirm whether the run is genuinely hung or just slow. - Check the engine.
docker compose logs --tail=300 pipeline-engine— look for task-level errors, DAG cycle/dangling detection failures, or thread-pool saturation. - Check self-healing. On task failure the engine's
SelfHealingServiceclassifies the failure and applies recovery strategies. Confirm whether recovery already ran and what it concluded. - Check downstream compute. If the pipeline uses Flink or Spark/Dataproc, verify the external job actually started — a stuck run can be a stuck external job, not an engine problem.
- Cancel cooperatively. The engine's
ExecutionContextsupports cooperative cancellation — cancel the run through the API/UI rather than killing the container. - If the engine itself is wedged, follow Runbook 2 to restart
pipeline-engine. In-flight runs will be lost; checkmonitor_pipeline_runsafterward to confirm the run is in a terminal state. - Re-run the pipeline once the cause is understood and addressed.
Runbook 5 — Rotate credentials
Use this to rotate the production secrets that currently live in deploy/.env.production and the deploy scripts.
- Inventory the secrets. The VPS production secrets are: Postgres password, Keycloak admin password, Grafana admin password, MinIO access/secret keys, and the demo-user passwords. The VPS SSH password is hardcoded in
deploy_to_vps.pyandapply-nginx-fix.py. - Generate new values. Use strong, randomly generated secrets.
- Update
deploy/.env.production. Replace the relevant values. This file is the source of truth for the Compose environment. - Rotate the datastore credentials in place. For Postgres, change the role password to match the new value before restarting dependent services, so they reconnect successfully.
- Rotate the VPS SSH password on the host, and update it in the deploy scripts (or, preferably, move it out of source into a secrets manager).
- Redeploy. Run
python deploy/deploy_to_vps.pyso the new.env.productionis uploaded and services restart with the new values. - Re-seed Keycloak. Run
python deploy/seed_keycloak_vps.py— it is idempotent and resets demo-user passwords and thedataflow-appclient config. - Verify. Confirm login works at
https://etl.exai.cloud, Grafana login works, anddocker compose psshows everythinghealthy.
Rotate everything if a script leaked
Because the SSH password and datastore passwords are committed in source, treat any exposure of the repository as a credential compromise. Rotate the full set — VPS SSH, Postgres, Keycloak admin, Grafana, MinIO — not just the one you suspect.
Runbook 6 — Repair a database migration
Use this when a Kotlin service fails to start because Flyway cannot apply or validate its migrations. This is most likely on the VPS, which started life with pre-Flyway tables.
- Identify the failing service.
docker compose logs <service> | grep -i flyway— the error names the failing version and history table (flyway_schema_history,flyway_schema_history_engine, orflyway_schema_history_monitor). - Confirm the hardening flags are set. The Compose environment should already set, for the affected service:
SPRING_FLYWAY_VALIDATE_ON_MIGRATE=false SPRING_FLYWAY_BASELINE_ON_MIGRATE=true SPRING_FLYWAY_BASELINE_VERSION=20.1 SPRING_FLYWAY_OUT_OF_ORDER=true SPRING_FLYWAY_REPAIR_ON_MIGRATE=true SPRING_FLYWAY_PLACEHOLDER_REPLACEMENT=falseREPAIR_ON_MIGRATE=truelets Flyway self-correct a divergent history on startup. - Make migrations idempotent. If a migration fails because an object already exists, run the rewriter:
It convertspython scripts/make-migrations-idempotent.py <migration-files>CREATE TABLE→CREATE TABLE IF NOT EXISTS(same for INDEX/SEQUENCE),ADD COLUMN IF NOT EXISTS, and prefixesCREATE TRIGGERwithDROP TRIGGER IF EXISTS. - Inspect the history table if needed:
A row withdocker compose exec postgres psql -U postgres -d dataflow_metadata \ -c "SELECT version, description, success FROM flyway_schema_history ORDER BY installed_rank DESC LIMIT 15;"success = falseis a failed migration that must be resolved or repaired. - Restart the service.
docker compose restart <service>and watch the logs — withREPAIR_ON_MIGRATEFlyway should converge on startup. - Mind the lineage ordering dependency.
lineage-servicehas Flyway disabled and reuses metadata-service's lineage tables (created by metadata V4/V48). If lineage fails, the root cause is usually thatmetadata-servicehas not applied those migrations yet — fix metadata first. - Verify. The service reaches
healthyand/actuator/healthreturnsUP.
Runbook 7 — Respond to an incident
A general procedure for an unplanned production issue on the VPS.
- Confirm and scope. Reproduce the symptom. Determine blast radius: one page, one service, or the whole platform? Run
docker compose ps. - Declare and communicate. For anything user-facing, declare an incident and notify stakeholders. (On the GKE path this is PagerDuty
#dataflow-incidents; on the VPS, use the team's agreed channel.) - Triage with observability. Use the observability checklist: gateway readiness, Grafana latency/5xx panels,
monitor_pipeline_runsfailure spikes,monitor_alerts. Capture theX-Request-Idof a failing request to correlate logs across services. - Stabilize before fixing. Prefer the fastest safe action that restores service — usually restarting the affected service (Runbook 2) — over a deep fix during the incident.
- Apply the right runbook. Failed deploy → Runbook 3. Stuck pipeline → Runbook 4. Migration failure → Runbook 6. nginx hang on JWT fetches → apply
python deploy/apply-nginx-fix.py(it backs up, validates withnginx -t, and reloads only on success). - Verify recovery. Load
https://etl.exai.cloud, exercise the affected workflow, and confirm health endpoints and alert streams are clean. - Communicate resolution. Notify stakeholders the incident is resolved.
- Write a post-incident review. Record the timeline, root cause, and follow-up actions — especially anything that should become a new or updated runbook.
The nginx upgrade-header incident
A recurring, well-understood failure mode: if /api/ nginx config hardcodes Connection "upgrade", JWT-authenticated browser fetches never receive the HTTP/2 END_STREAM frame and pages (marketplace, templates, my-pipelines, data-browser) spin forever. The fix is the conditional map $http_upgrade $etl_connection_upgrade { default upgrade; '' close; } applied by apply-nginx-fix.py. If you see hanging spinners after an nginx change, suspect this first.
Runbook quick reference
| Situation | Runbook | Key command |
|---|---|---|
| GKE region outage | 1 — DR failover | gcloud sql instances promote-replica |
| One service unhealthy | 2 — Restart service | docker compose restart <service> |
| Deploy failed/ambiguous | 3 — Investigate deploy | check deploy/.deploy.log, docker compose ps |
| Pipeline hung | 4 — Recover pipeline | cooperative cancel, then re-run |
| Secret rotation | 5 — Rotate credentials | edit .env.production, redeploy, re-seed |
| Service won't start on Flyway | 6 — Repair migration | make-migrations-idempotent.py |
| Unplanned production issue | 7 — Incident response | observability checklist + targeted runbook |
| nginx hang on JWT fetches | 7 (callout) | python deploy/apply-nginx-fix.py |