Operations

Operational playbooks & runbooks

This page collects the operational procedures for running the DataFlow AI Platform: the disaster-recovery failover runbook for the documented GKE topology, and a set of practical runbooks for the single-VPS production system that actually serves users today. Each procedure is a numbered, do-this-then-that sequence.


Before you start

Two facts shape every procedure here:

  • Production is the single Debian VPS etl.exai.cloud (57.129.120.73), running Docker Compose. There is no canary, no automated rollback, and no automated tests on this path.
  • The DR/failover runbook describes the GKE multi-region model, which is not what runs production. It is documented below for completeness and because it is the target topology — but a real VPS incident is handled with the Compose runbooks, not GKE failover.

Credentials live in source — handle with care

Several deploy scripts (deploy_to_vps.py, apply-nginx-fix.py, seed_keycloak_vps.py) and deploy/.env.production contain hardcoded or plaintext production secrets. This is a known security finding. When following these runbooks, prefer to source credentials from a secrets manager and treat rotation (see below) as a priority.


Runbook 1 — Disaster recovery / failover (GKE)

This is the documented multi-region DR procedure from docs/runbooks/dr-failover-runbook.md. It applies to the GKE topology, not the VPS.

Targets: RTO < 4 hours, RPO < 1 minute (Cloud SQL async replication lag).

Topology:

   europe-central2 (PRIMARY)            europe-west3 (DR)
   ┌──────────────────────────┐         ┌──────────────────────────┐
   │ GKE dataflow-autopilot-   │         │ GKE dataflow-autopilot-dr │
   │     production            │         │  (1 replica, passive,     │
   │ Cloud SQL primary         │  async  │   read-only)              │
   │                           │ ──────▶ │ Cloud SQL read replica    │
   │                           │  repl.  │  dataflow-pg-dr-replica   │
   └────────────┬──────────────┘         └────────────┬─────────────┘
                │                                     │
                └────── DNS health-check failover ────┘
                     api.dataflow.polkomtel.pl

Failover procedure (7 steps)

  1. Verify primary is down. Check GKE cluster status, Cloud SQL status, the health endpoint, and the GCP status page. Failover requires on-call lead approval before proceeding.
  2. Promote the Cloud SQL replica. Run gcloud sql instances promote-replica — this is irreversible. Poll the instance until its state is RUNNABLE.
  3. Switch DR deployments to read-write. Patch the ConfigMap to set DB_READ_ONLY=false.
  4. Scale DR deployments up. Scale from 1 → 3 replicas and wait for the rollout to complete.
  5. Fail over DNS. Verify or force the DNS failover to the DR IP 10.2.0.100.
  6. Smoke-test DR. Confirm pods are running, all 5 service health endpoints respond, and database connectivity works.
  7. Notify stakeholders. Send notifications via Slack, email, and PagerDuty.

Failback procedure (F1–F7)

  1. Verify the primary region has recovered.
  2. Build a fresh replica in europe-central2 from the current (promoted) primary.
  3. Sync the new replica.
  4. Promote it back to primary.
  5. Scale the primary deployments up and scale the DR deployments down.
  6. Fail DNS back to the primary IP 10.0.0.100.
  7. Re-establish the DR replica and update Terraform state.

Testing schedule

CadenceActivity
ContinuousReplication-lag monitoring
WeeklyHealth-check verification
MonthlyDR infrastructure validation
QuarterlyTabletop exercise + full failover drill

Runbook 2 — Restart a service (VPS)

Use this when a single service is unhealthy but the host and infrastructure tier are fine.

  1. Confirm the symptom. docker compose ps — identify which service is not healthy.
  2. Capture evidence first. docker compose logs --tail=300 <service> and save the output before restarting, so the failure cause is not lost.
  3. Restart the single service. docker compose restart <service>.
  4. Watch it come back. docker compose logs -f <service> until startup completes (Flyway runs on startup for Kotlin services — watch for migration success).
  5. Verify health. curl -sf http://127.0.0.1:<host-port>/actuator/health | jq (for example 8085 for the gateway, 8084 for monitor-service).
  6. Check dependents. If you restarted postgres, keycloak, or kafka, restart the application services that depend on them, respecting start order: postgres → keycloak/kafka → app services → frontend.
  7. Confirm end to end. Load https://etl.exai.cloud and exercise an affected page.

Respect the start order

The Compose file enforces start order via healthchecks and depends_on. If you restart infrastructure (postgres, kafka, keycloak), a plain restart of just that container can leave app services connected to a stale endpoint. When in doubt, restart the dependent app services afterward.


Runbook 3 — Investigate a failed deploy (VPS)

Use this after python deploy/deploy_to_vps.py reports a failure or finishes ambiguously.

  1. Read the deploy log. Check deploy/.deploy.log for where the run stopped. Note: a Windows UnicodeEncodeError (cp1252) can crash the local script while streaming build output — this is cosmetic and does not necessarily mean the server-side deploy failed.
  2. SSH to the VPS and check container state. docker compose ps — are all 8 services and the infra tier present and healthy?
  3. Identify the failed step. The script has 7 steps (archive → connect → upload → nginx config → build → start in waves → verify). Build failures (step 5) most often stem from VPS memory pressure during the Kotlin builds.
  4. Inspect build output. If an image failed to build, re-run that single build on the server: docker compose build <service> (each Kotlin build can take several minutes; the script allows 900s).
  5. Check Flyway migrations. docker compose logs <kotlin-service> | grep -i flyway — migrations run on startup and a failed migration blocks the service. See Runbook 6 if migrations are stuck.
  6. Validate nginx. nginx -t on the host; if step 4 left a broken config, restore the timestamped backup.
  7. Re-run or finish manually. Either re-run deploy_to_vps.py (it does a full rebuild) or, if only a wave failed, start the remaining services: docker compose up -d.
  8. Re-seed if needed. If Keycloak was recreated, re-run python deploy/seed_keycloak_vps.py.
  9. Verify. curl the frontend on 127.0.0.1:3006 and confirm https://etl.exai.cloud loads.

Runbook 4 — Recover a stuck pipeline

Use this when a pipeline run is hung, not progressing, or stuck in a non-terminal state.

  1. Find the run. Identify the runId and current status — via the Monitor UI or directly:
    docker compose exec postgres psql -U postgres -d dataflow_metadata \
      -c "SELECT id, status, started_at FROM monitor_pipeline_runs ORDER BY started_at DESC LIMIT 20;"
    
  2. Stream the run log. The pipeline-engine streams logs over WebSocket at /api/v1/runs/{runId}/stream; the UI Log viewer renders this. Confirm whether the run is genuinely hung or just slow.
  3. Check the engine. docker compose logs --tail=300 pipeline-engine — look for task-level errors, DAG cycle/dangling detection failures, or thread-pool saturation.
  4. Check self-healing. On task failure the engine's SelfHealingService classifies the failure and applies recovery strategies. Confirm whether recovery already ran and what it concluded.
  5. Check downstream compute. If the pipeline uses Flink or Spark/Dataproc, verify the external job actually started — a stuck run can be a stuck external job, not an engine problem.
  6. Cancel cooperatively. The engine's ExecutionContext supports cooperative cancellation — cancel the run through the API/UI rather than killing the container.
  7. If the engine itself is wedged, follow Runbook 2 to restart pipeline-engine. In-flight runs will be lost; check monitor_pipeline_runs afterward to confirm the run is in a terminal state.
  8. Re-run the pipeline once the cause is understood and addressed.

Runbook 5 — Rotate credentials

Use this to rotate the production secrets that currently live in deploy/.env.production and the deploy scripts.

  1. Inventory the secrets. The VPS production secrets are: Postgres password, Keycloak admin password, Grafana admin password, MinIO access/secret keys, and the demo-user passwords. The VPS SSH password is hardcoded in deploy_to_vps.py and apply-nginx-fix.py.
  2. Generate new values. Use strong, randomly generated secrets.
  3. Update deploy/.env.production. Replace the relevant values. This file is the source of truth for the Compose environment.
  4. Rotate the datastore credentials in place. For Postgres, change the role password to match the new value before restarting dependent services, so they reconnect successfully.
  5. Rotate the VPS SSH password on the host, and update it in the deploy scripts (or, preferably, move it out of source into a secrets manager).
  6. Redeploy. Run python deploy/deploy_to_vps.py so the new .env.production is uploaded and services restart with the new values.
  7. Re-seed Keycloak. Run python deploy/seed_keycloak_vps.py — it is idempotent and resets demo-user passwords and the dataflow-app client config.
  8. Verify. Confirm login works at https://etl.exai.cloud, Grafana login works, and docker compose ps shows everything healthy.

Rotate everything if a script leaked

Because the SSH password and datastore passwords are committed in source, treat any exposure of the repository as a credential compromise. Rotate the full set — VPS SSH, Postgres, Keycloak admin, Grafana, MinIO — not just the one you suspect.


Runbook 6 — Repair a database migration

Use this when a Kotlin service fails to start because Flyway cannot apply or validate its migrations. This is most likely on the VPS, which started life with pre-Flyway tables.

  1. Identify the failing service. docker compose logs <service> | grep -i flyway — the error names the failing version and history table (flyway_schema_history, flyway_schema_history_engine, or flyway_schema_history_monitor).
  2. Confirm the hardening flags are set. The Compose environment should already set, for the affected service:
    SPRING_FLYWAY_VALIDATE_ON_MIGRATE=false
    SPRING_FLYWAY_BASELINE_ON_MIGRATE=true
    SPRING_FLYWAY_BASELINE_VERSION=20.1
    SPRING_FLYWAY_OUT_OF_ORDER=true
    SPRING_FLYWAY_REPAIR_ON_MIGRATE=true
    SPRING_FLYWAY_PLACEHOLDER_REPLACEMENT=false
    
    REPAIR_ON_MIGRATE=true lets Flyway self-correct a divergent history on startup.
  3. Make migrations idempotent. If a migration fails because an object already exists, run the rewriter:
    python scripts/make-migrations-idempotent.py <migration-files>
    
    It converts CREATE TABLECREATE TABLE IF NOT EXISTS (same for INDEX/SEQUENCE), ADD COLUMN IF NOT EXISTS, and prefixes CREATE TRIGGER with DROP TRIGGER IF EXISTS.
  4. Inspect the history table if needed:
    docker compose exec postgres psql -U postgres -d dataflow_metadata \
      -c "SELECT version, description, success FROM flyway_schema_history ORDER BY installed_rank DESC LIMIT 15;"
    
    A row with success = false is a failed migration that must be resolved or repaired.
  5. Restart the service. docker compose restart <service> and watch the logs — with REPAIR_ON_MIGRATE Flyway should converge on startup.
  6. Mind the lineage ordering dependency. lineage-service has Flyway disabled and reuses metadata-service's lineage tables (created by metadata V4/V48). If lineage fails, the root cause is usually that metadata-service has not applied those migrations yet — fix metadata first.
  7. Verify. The service reaches healthy and /actuator/health returns UP.

Runbook 7 — Respond to an incident

A general procedure for an unplanned production issue on the VPS.

  1. Confirm and scope. Reproduce the symptom. Determine blast radius: one page, one service, or the whole platform? Run docker compose ps.
  2. Declare and communicate. For anything user-facing, declare an incident and notify stakeholders. (On the GKE path this is PagerDuty #dataflow-incidents; on the VPS, use the team's agreed channel.)
  3. Triage with observability. Use the observability checklist: gateway readiness, Grafana latency/5xx panels, monitor_pipeline_runs failure spikes, monitor_alerts. Capture the X-Request-Id of a failing request to correlate logs across services.
  4. Stabilize before fixing. Prefer the fastest safe action that restores service — usually restarting the affected service (Runbook 2) — over a deep fix during the incident.
  5. Apply the right runbook. Failed deploy → Runbook 3. Stuck pipeline → Runbook 4. Migration failure → Runbook 6. nginx hang on JWT fetches → apply python deploy/apply-nginx-fix.py (it backs up, validates with nginx -t, and reloads only on success).
  6. Verify recovery. Load https://etl.exai.cloud, exercise the affected workflow, and confirm health endpoints and alert streams are clean.
  7. Communicate resolution. Notify stakeholders the incident is resolved.
  8. Write a post-incident review. Record the timeline, root cause, and follow-up actions — especially anything that should become a new or updated runbook.

The nginx upgrade-header incident

A recurring, well-understood failure mode: if /api/ nginx config hardcodes Connection "upgrade", JWT-authenticated browser fetches never receive the HTTP/2 END_STREAM frame and pages (marketplace, templates, my-pipelines, data-browser) spin forever. The fix is the conditional map $http_upgrade $etl_connection_upgrade { default upgrade; '' close; } applied by apply-nginx-fix.py. If you see hanging spinners after an nginx change, suspect this first.


Runbook quick reference

SituationRunbookKey command
GKE region outage1 — DR failovergcloud sql instances promote-replica
One service unhealthy2 — Restart servicedocker compose restart <service>
Deploy failed/ambiguous3 — Investigate deploycheck deploy/.deploy.log, docker compose ps
Pipeline hung4 — Recover pipelinecooperative cancel, then re-run
Secret rotation5 — Rotate credentialsedit .env.production, redeploy, re-seed
Service won't start on Flyway6 — Repair migrationmake-migrations-idempotent.py
Unplanned production issue7 — Incident responseobservability checklist + targeted runbook
nginx hang on JWT fetches7 (callout)python deploy/apply-nginx-fix.py
Previous
Observability