Operations

Deployment & rollout

DataFlow AI has two distinct deployment realities: a documented CI/CD path that builds images in GitHub Actions and rolls them out to GKE Autopilot with canary stages and auto-rollback, and the actual live production system — a single Debian VPS running Docker Compose. This page documents both, honestly, so operators know which one runs in front of users.


Two deployment models

The repository contains two separate, inconsistent ways to deploy the platform. Knowing which one is real matters before you touch production.

ModelTargetProvisioned byRolloutMigrationsRuns production today?
(a) Documented CI/CDGKE Autopilot clusters (dev / staging / prod)GitHub Actions → GCP Artifact Registry → Kustomize → kubectlCanary 10% → 50% → 100% with auto-rollbackGo /app/migrate binary via kubectl execNo — aspirational
(b) Actual productionSingle Debian VPS etl.exai.cloud (57.129.120.73)Hand-written Python scripts (deploy/deploy_to_vps.py) over SSHNone — full rebuild + ordered restartSpring Boot Flyway on service startupYes

Know which path is live

The GKE workflows reference dataflow.polkomtel.internal, GCP regions, and a Go migration tool. None of that runs production. Production is the VPS at etl.exai.cloud, deployed with Docker Compose. Treat the GKE path as a target architecture, not the current state.

The two paths diverge in fundamental ways:

  • GKE uses Kustomize overlays per environment; the VPS uses a single docker-compose.yml.
  • GKE runs a dedicated Go /app/migrate tool; the VPS runs Flyway-on-startup with aggressive repair flags.
  • GKE has canary stages, smoke tests, integration tests, and automated rollback; the VPS path has none of those.
  • GKE assumes multi-region Cloud SQL; the VPS has a single PostgreSQL container.

Choosing a target topology

This page covers how the platform is built and shipped. It does not cover which infrastructure topology to deploy onto — On-Premises, full GCP cloud, or the recommended Hybrid — nor the sizing and cost of each.

For topology selection, infrastructure requirements, capacity sizing, the full GCP service cost breakdown, and a decision guide, see Deployment scenarios & sizing. In short: the source documents recommend the Hybrid topology for Polkomtel (source databases stay on-prem, the platform runs on GCP), with a 3-Year TCO of roughly $606K–$741K.

Intended topology vs. current reality

The deployment-scenarios page describes the intended GKE-based GCP architecture. The build and rollout mechanics documented on this page reflect what runs production today — a single Docker Compose VPS. Read both together to understand the gap between target and current state.


Service inventory

Eight deployable services plus the infrastructure tier make up a full deployment.

Application services

ServiceStackBuildContainer portVPS host port
api-gatewayKotlin / Spring Cloud Gatewayplatform/Dockerfile BUILD_MODULE=api-gateway80808085
metadata-serviceKotlin / Spring BootBUILD_MODULE=metadata-service80808181
pipeline-engineKotlin / Spring BootBUILD_MODULE=pipeline-engine80808082
lineage-serviceKotlin / Spring BootBUILD_MODULE=lineage-service80808083
monitor-serviceKotlin / Spring BootBUILD_MODULE=monitor-service80808084
copilotPython 3.12 / FastAPIai-services/copilot/Dockerfile80008090
migration-enginePython 3.12 / FastAPIai-services/migration-engine/Dockerfile80008091
frontendReact / Vite + nginxfrontend/Dockerfile803006

The Go CLI (backend/cli) ships as release binaries, not as a deployed service. The Gradle modules pushdown-sql and connector-sdk are referenced by the build and Prometheus config but are not standalone Compose services.

Infrastructure services

These run alongside the application containers via Docker Compose:

  • PostgreSQL 15pgvector/pgvector:pg15
  • Keycloak 24 — OIDC identity provider
  • Zookeeper + Kafkaconfluentinc/cp-* 7.7.1 (single broker, RF=1 — not HA)
  • Prometheusv2.54.1
  • Grafana11.2.2
  • MinIO — S3-compatible object storage
  • Redis 7-alpinenot deployed on the VPS; the gateway falls back to in-memory rate limiting

Build architecture

Kotlin services

All five Kotlin services build from a single multi-stage backend/platform/Dockerfile, selected by the BUILD_MODULE build-arg.

  • Stage 1gradle:8.5-jdk21. Build files are copied first for layer caching, dependencies are pre-downloaded, then gradle :${BUILD_MODULE}:bootJar -x test --no-daemon runs the in-process Kotlin compiler with GRADLE_OPTS=-Xmx2g.
  • Stage 2eclipse-temurin:21-jre-alpine. A non-root dataflow user, container-aware JVM (MaxRAMPercentage=75), and a Spring Boot Actuator healthcheck on /actuator/health.

Dockerfile.local is a runtime-only variant — it expects JARs pre-built on the host (./gradlew bootJar -x test --parallel) and bypasses Docker DNS issues during builds.

Python services

copilot and migration-engine each have their own Dockerfile, built from their own context. Copilot loads LLM secrets from a gitignored .env.local (required: false), so its absence does not break builds.

Frontend

frontend/Dockerfile is multi-stage: node:22-alpine runs vite build, and the runtime is nginx:alpine serving /app/dist with a custom nginx.conf for SPA routing.


The CI pipeline

Eight GitHub Actions workflows live in .github/workflows/. The most important for deployment are described below.

ci.yml — Continuous Integration

Triggered on push to main/develop and PRs to main. It uses dorny/paths-filter to detect which monorepo parts changed (kotlin / python / go / frontend / terraform / docker / openapi) and conditionally runs language-matrixed jobs:

JobWhat it runs
kotlin-buildJDK 21 + Gradle: compile → Detekt lint → unit tests → integrationTest → build; uploads reports + JARs
python-testMatrix [copilot, migration-engine], Python 3.12 + Poetry 1.8.4: Ruff lint/format, mypy (non-blocking), pytest with coverage. PR coverage gate thresholdAll 0.70 / thresholdNew 0.80
go-buildGo 1.22: build / vet / test with -race, golangci-lint
frontend-buildNode 20: ESLint, tsc -b --noEmit, build, vitest coverage
terraform-validateterraform fmt/validate per environment + module, TFLint
docker-compose-validatedocker compose config, hadolint
openapi-validateRedocly lint of 5 OpenAPI specs + oasdiff breaking-change check
ci-statusAggregate gate for branch protection

zero-tolerance.yml — Compliance gate

On PR/push to main/develop, runs scripts/detect-fake-code.js against changed .ts/.tsx/.js/.jsx files, enforcing the "no fake code" rule, plus frontend typecheck/lint/test/build.

security-scan.yml — Security

Weekly cron (Mon 06:00 UTC) plus push/PR to main. Jobs:

  • gitleaks — secret detection
  • Trivy — per-service container scan, CRITICAL,HIGH, exit-code 1
  • Snyk — dependency scan
  • OWASP dependency-check — JVM dependencies
  • Python safety — Python dependency vulnerabilities
  • Checkov — Terraform misconfiguration scan
  • license-check — fails the build on GPL/AGPL licenses

codeql-analysis.yml runs static analysis as well.

Deploy and release workflows

  • deploy-dev.yml — on push to develop: builds and pushes all 8 images (dev-<sha>, dev-latest), runs kustomize edit set image on the dev overlay, kubectl apply --prune, waits for rollout, runs smoke tests, Slack-notifies.
  • deploy-staging.yml — on push to main / v*.*.*-rc* tag / manual: builds + pushes 8 images, creates a pre-deploy Cloud SQL backup, applies the staging overlay, runs DB migrations via kubectl exec deployment/api-gateway -- /app/migrate --direction=up --environment=staging, then smoke and integration tests.
  • deploy-production.ymlworkflow_dispatch only; the canary rollout (see below).
  • release.yml — on v*.*.* tags: builds semver-tagged release images, builds Go CLI binaries for 5 OS/arch combos, generates a conventional-commit changelog, and creates a GitHub Release with SHA256SUMS.txt.

Authentication to GCP uses Workload Identity Federation (google-github-actions/auth@v2, id-token: write) — there are no static GCP keys in the repo.


Canary rollout (GKE production path)

The documented production deploy (deploy-production.yml) is a manual workflow_dispatch with inputs: image_tag (must be a tested staging tag), canary_weight_step1 (default 10%), canary_weight_step2 (default 50%), and skip_canary.

  ┌──────────────────────────────────────────────────────────────────┐
  │ 1. pre-deploy-validation                                          │
  │    verify images in Artifact Registry · verify Trivy scan ·        │
  │    check current prod pod health                                  │
  └────────────────────────────┬─────────────────────────────────────┘

  ┌──────────────────────────────────────────────────────────────────┐
  │ 2. database-backup   gcloud sql backups create dataflow-prod-db   │
  └────────────────────────────┬─────────────────────────────────────┘

  ┌──────────────────────────────────────────────────────────────────┐
  │ 3. canary-stage-1 (10%)                                           │
  │    clone each deployment as ${svc}-canary · scale by weight ·      │
  │    5-minute health observation                                    │
  │    ABORT if  >10 failures  OR  >2 pod restarts  ──────────┐        │
  └────────────────────────────┬──────────────────────────────┼───────┘
                               ▼                              │
  ┌──────────────────────────────────────────────────────────┐│
  │ 4. canary-stage-2 (50%)                                   ││
  │    scale canary to 50% · 10-minute observation            ││
  └────────────────────────────┬──────────────────────────────┼───────┘
                               ▼                              │
  ┌──────────────────────────────────────────────────────────┐│
  │ 5. full-deploy (100%)                                     ││
  │    kustomize edit set image · DB migrations (dry-run then ││
  │    real) · kubectl apply --prune · wait rollout · delete  ││
  │    canary deployments                                     ││
  └────────────────────────────┬──────────────────────────────┼───────┘
                               ▼                              │
  ┌──────────────────────────────────────────────────────────┐│
  │ 6. post-deploy-verify                                     ││
  │    all pods healthy · smoke tests vs ingress IP ·         ││
  │    replica counts match                                   ││
  └────────────────────────────┬──────────────────────────────┘│
                               ▼            on failure()       ▼
  ┌──────────────────────────────────────────────────────────────────┐
  │ 7. rollback   kubectl rollout undo all deployments · remove        │
  │    canaries · verify                                              │
  └────────────────────────────┬─────────────────────────────────────┘

  ┌──────────────────────────────────────────────────────────────────┐
  │ 8. notify   Slack on every outcome · PagerDuty critical if         │
  │    deploy failed AND rollback also failed                         │
  └──────────────────────────────────────────────────────────────────┘

Canary stage 1 clones each deployment as ${svc}-canary, scales it by the configured weight, and observes health for 5 minutes — aborting if more than 10 failures or more than 2 pod restarts occur. Stage 2 raises the canary to 50% and observes for 10 minutes. Only then does full-deploy cut over to 100% and delete the canary deployments.

Canary applies only to GKE

The 10% → 50% → 100% canary, the smoke/integration test gates, and the automated kubectl rollout undo exist only in the GKE workflows. The VPS production path has no canary and no automated rollback.


Production deployment to the VPS — step by step

This is the path that actually ships production. It is driven by deploy/deploy_to_vps.py, a 7-step Python/paramiko script.

Topology

        Internet  (HTTPS :443)


  ┌───────────────────────────────────────────────────────┐
  │  Debian VPS  etl.exai.cloud  (57.129.120.73)           │
  │                                                        │
  │   System nginx  ── TLS termination (Plesk cert) ──┐    │
  │      /auth/  → 127.0.0.1:4180  (Keycloak)         │    │
  │      /api/   → 127.0.0.1:8085  (api-gateway)      │    │
  │      /       → 127.0.0.1:3006  (frontend)         │    │
  │                       │                            │   │
  │   ┌───────────────────▼────────────────────────┐   │   │
  │   │  docker compose — network dataflow-network  │   │   │
  │   │  8 app services + postgres · keycloak ·     │   │   │
  │   │  kafka · zookeeper · prometheus · grafana · │   │   │
  │   │  minio                                      │   │   │
  │   └─────────────────────────────────────────────┘   │   │
  │   (shared host: itsm.exai.cloud, openmeet.exai.cloud)│   │
  └───────────────────────────────────────────────────────┘

The 7 steps of deploy_to_vps.py

  1. Archivegit archive HEAD produces deploy/dataflow-source.tar.gz (~17.4 MB).
  2. Connect — SSH/SFTP into the VPS.
  3. Upload — push the tarball, rm -rf /home/debian/dataflow, untar it; upload .env.production to both backend/.env and frontend/.env.production; upload etl-dataflow.conf.
  4. nginx config — install the config to /etc/nginx/conf.d/etl-dataflow.conf and run nginx -t.
  5. Build images — build one image at a time (frontend first, then the 5 Kotlin services at a 900s timeout each to manage VPS memory, then the 2 Python services).
  6. Start in wavespostgres (sleep 15) → zookeeper kafka (sleep 20) → keycloak prometheus grafana minio (sleep 15) → docker compose up -d (all); then reload system nginx.
  7. Verifydocker compose ps and curl the frontend on 127.0.0.1:3006.

To run a full production deploy:

python deploy/deploy_to_vps.py

Hardcoded credentials

deploy_to_vps.py contains the VPS host, user, and password as hardcoded literals in source. This is a known security finding. Do not treat the script as a model — credentials belong in a secrets manager. The same applies to deploy/.env.production (plaintext DB / Keycloak / Grafana / MinIO passwords) and seed_keycloak_vps.py.

Build images one at a time

Step 5 deliberately builds images sequentially. The VPS is memory-constrained and shares the host with unrelated Plesk vhosts, so a parallel build of all five Kotlin services would exhaust RAM. Each Kotlin build runs with a 900-second timeout.

The last recorded run (.deploy.log, May 19) failed locally during step 5 — a Windows UnicodeEncodeError (cp1252) crashed the Python script while streaming the frontend build output. This is a cosmetic local crash, not a server-side failure; deploy completeness for that run is uncertain and should be verified manually.

Production environment file

deploy/.env.production offsets every port to coexist with an itsm stack on the same VPS:

ComponentProduction port
Postgres5433
Keycloak4180
Prometheus9095
Grafana4001
MinIO9002 / 9003

REDIS is not deployed (the gateway uses in-memory rate limiting fallback), ANTHROPIC_API_KEY is empty (copilot AI features disabled), and CORS_ALLOWED_ORIGINS is https://etl.exai.cloud.


nginx — TLS termination and routing

System nginx on the VPS (not a container) terminates TLS using a Plesk certificate at /opt/psa/var/certificates/scfht5otmidur05fohWXaj and reverse-proxies to the containers.

PathProxied toNotes
/auth/127.0.0.1:4180Keycloak container
/api/127.0.0.1:8085/api/api-gateway; 300s timeouts, WebSocket upgrade, SSE (proxy_buffering off)
/127.0.0.1:3006frontend container
port 80301 redirect to HTTPS

There is no oauth2-proxy. Authentication is Keycloak OIDC directly — the frontend SPA performs the OIDC flow, and nginx simply proxies /auth/ to the Keycloak container.

The nginx upgrade-header hotfix

apply-nginx-fix.py resolves a real production incident. The original config hardcoded Connection "upgrade" on /api/, which left nginx in connection-tunnel mode and never emitted the HTTP/2 END_STREAM frame for JWT-authenticated browser fetches. The symptom: marketplace, templates, my-pipelines, and data-browser pages spun forever.

The fix introduces a conditional map:

map $http_upgrade $etl_connection_upgrade {
    default upgrade;
    ''      close;
}

The script backs up the old config with a timestamp, validates with nginx -t, and only reloads if validation passes. To apply it:

python deploy/apply-nginx-fix.py

Shared-host nginx warnings

Because etl.exai.cloud shares the VPS with itsm.exai.cloud and openmeet.exai.cloud, the server name conflicts with other Plesk vhosts and nginx emits warnings. These are expected on this host and do not indicate a broken deploy.


Database migrations — Flyway on startup

On the VPS path there is no separate migration job. Each Spring Boot service runs Flyway on startup against the shared PostgreSQL instance.

ServiceMigration locationVersionsHistory table
metadata-servicedb/migration/metadata/V1–V51+flyway_schema_history
monitor-servicedb/migration/monitor/V9–V25flyway_schema_history_monitor
pipeline-enginedb/migration/engine/V1, V3–V9flyway_schema_history_engine

Because the VPS originally had pre-Flyway tables, the Compose environment hardens Flyway to converge regardless of starting DB state:

SPRING_FLYWAY_VALIDATE_ON_MIGRATE=false
SPRING_FLYWAY_BASELINE_ON_MIGRATE=true
SPRING_FLYWAY_BASELINE_VERSION=20.1
SPRING_FLYWAY_OUT_OF_ORDER=true
SPRING_FLYWAY_REPAIR_ON_MIGRATE=true
SPRING_FLYWAY_PLACEHOLDER_REPLACEMENT=false

In addition, scripts/make-migrations-idempotent.py rewrites migration DDL in place — CREATE TABLECREATE TABLE IF NOT EXISTS (same for INDEX/SEQUENCE), ADD COLUMN IF NOT EXISTS, and CREATE TRIGGER prefixed with DROP TRIGGER IF EXISTS.

The GKE path instead runs a Go binary /app/migrate --direction=up --environment=… via kubectl exec — a separate, divergent migration tool.


Post-deploy seeding

After a VPS deploy, several idempotent scripts seed the runtime:

  • deploy/seed_keycloak_vps.py — hits the Keycloak admin REST API on :4180. It enables unmanagedAttributePolicy=ENABLED, sets a workspace_id attribute on 4 demo users, adds an oidc-usermodel-attribute-mapper so workspace_id ships in every JWT, updates the dataflow-app client redirect URIs / webOrigins / PKCE (S256), resets demo-user passwords, and verifies a JWT grant.
  • deploy/seed_monitor_data.sql — seeds monitor_alerts and monitor_pipeline_runs with realistic telecom data for the e2e Playwright sweep.
  • deploy/create_user.py, setup_keycloak.py, ssh_cmd.py — supporting one-off scripts.

Software lifecycle summary

The documented CI/CD lifecycle: code (conventional commits, CODEOWNERS, Dependabot) → build & test (ci.yml + zero-tolerance.yml + security-scan.yml) → image build & publish to Artifact Registry → deploy (develop→dev auto, main→staging auto, prod via manual canary) → migrate → verify with smoke/integration tests → monitor and auto-rollback → release on semver tags.

The actual VPS lifecycle: git archive HEAD → SSH upload → docker compose build on the server → ordered docker compose up -d → Flyway runs on startup → manual Keycloak/monitor seeding → nginx reload. Hotfixes are applied via targeted scripts. There is no canary, no automated rollback, and no automated tests on the VPS.

Previous
Monitor & AI API