Operations
Deployment & rollout
DataFlow AI has two distinct deployment realities: a documented CI/CD path that builds images in GitHub Actions and rolls them out to GKE Autopilot with canary stages and auto-rollback, and the actual live production system — a single Debian VPS running Docker Compose. This page documents both, honestly, so operators know which one runs in front of users.
Two deployment models
The repository contains two separate, inconsistent ways to deploy the platform. Knowing which one is real matters before you touch production.
| Model | Target | Provisioned by | Rollout | Migrations | Runs production today? |
|---|---|---|---|---|---|
| (a) Documented CI/CD | GKE Autopilot clusters (dev / staging / prod) | GitHub Actions → GCP Artifact Registry → Kustomize → kubectl | Canary 10% → 50% → 100% with auto-rollback | Go /app/migrate binary via kubectl exec | No — aspirational |
| (b) Actual production | Single Debian VPS etl.exai.cloud (57.129.120.73) | Hand-written Python scripts (deploy/deploy_to_vps.py) over SSH | None — full rebuild + ordered restart | Spring Boot Flyway on service startup | Yes |
Know which path is live
The GKE workflows reference dataflow.polkomtel.internal, GCP regions, and a Go migration tool. None of that runs production. Production is the VPS at etl.exai.cloud, deployed with Docker Compose. Treat the GKE path as a target architecture, not the current state.
The two paths diverge in fundamental ways:
- GKE uses Kustomize overlays per environment; the VPS uses a single
docker-compose.yml. - GKE runs a dedicated Go
/app/migratetool; the VPS runs Flyway-on-startup with aggressive repair flags. - GKE has canary stages, smoke tests, integration tests, and automated rollback; the VPS path has none of those.
- GKE assumes multi-region Cloud SQL; the VPS has a single PostgreSQL container.
Choosing a target topology
This page covers how the platform is built and shipped. It does not cover which infrastructure topology to deploy onto — On-Premises, full GCP cloud, or the recommended Hybrid — nor the sizing and cost of each.
For topology selection, infrastructure requirements, capacity sizing, the full GCP service cost breakdown, and a decision guide, see Deployment scenarios & sizing. In short: the source documents recommend the Hybrid topology for Polkomtel (source databases stay on-prem, the platform runs on GCP), with a 3-Year TCO of roughly $606K–$741K.
Intended topology vs. current reality
The deployment-scenarios page describes the intended GKE-based GCP architecture. The build and rollout mechanics documented on this page reflect what runs production today — a single Docker Compose VPS. Read both together to understand the gap between target and current state.
Service inventory
Eight deployable services plus the infrastructure tier make up a full deployment.
Application services
| Service | Stack | Build | Container port | VPS host port |
|---|---|---|---|---|
api-gateway | Kotlin / Spring Cloud Gateway | platform/Dockerfile BUILD_MODULE=api-gateway | 8080 | 8085 |
metadata-service | Kotlin / Spring Boot | BUILD_MODULE=metadata-service | 8080 | 8181 |
pipeline-engine | Kotlin / Spring Boot | BUILD_MODULE=pipeline-engine | 8080 | 8082 |
lineage-service | Kotlin / Spring Boot | BUILD_MODULE=lineage-service | 8080 | 8083 |
monitor-service | Kotlin / Spring Boot | BUILD_MODULE=monitor-service | 8080 | 8084 |
copilot | Python 3.12 / FastAPI | ai-services/copilot/Dockerfile | 8000 | 8090 |
migration-engine | Python 3.12 / FastAPI | ai-services/migration-engine/Dockerfile | 8000 | 8091 |
frontend | React / Vite + nginx | frontend/Dockerfile | 80 | 3006 |
The Go CLI (backend/cli) ships as release binaries, not as a deployed service. The Gradle modules pushdown-sql and connector-sdk are referenced by the build and Prometheus config but are not standalone Compose services.
Infrastructure services
These run alongside the application containers via Docker Compose:
- PostgreSQL 15 —
pgvector/pgvector:pg15 - Keycloak 24 — OIDC identity provider
- Zookeeper + Kafka —
confluentinc/cp-* 7.7.1(single broker, RF=1 — not HA) - Prometheus —
v2.54.1 - Grafana —
11.2.2 - MinIO — S3-compatible object storage
- Redis 7-alpine — not deployed on the VPS; the gateway falls back to in-memory rate limiting
Build architecture
Kotlin services
All five Kotlin services build from a single multi-stage backend/platform/Dockerfile, selected by the BUILD_MODULE build-arg.
- Stage 1 —
gradle:8.5-jdk21. Build files are copied first for layer caching, dependencies are pre-downloaded, thengradle :${BUILD_MODULE}:bootJar -x test --no-daemonruns the in-process Kotlin compiler withGRADLE_OPTS=-Xmx2g. - Stage 2 —
eclipse-temurin:21-jre-alpine. A non-rootdataflowuser, container-aware JVM (MaxRAMPercentage=75), and a Spring Boot Actuator healthcheck on/actuator/health.
Dockerfile.local is a runtime-only variant — it expects JARs pre-built on the host (./gradlew bootJar -x test --parallel) and bypasses Docker DNS issues during builds.
Python services
copilot and migration-engine each have their own Dockerfile, built from their own context. Copilot loads LLM secrets from a gitignored .env.local (required: false), so its absence does not break builds.
Frontend
frontend/Dockerfile is multi-stage: node:22-alpine runs vite build, and the runtime is nginx:alpine serving /app/dist with a custom nginx.conf for SPA routing.
The CI pipeline
Eight GitHub Actions workflows live in .github/workflows/. The most important for deployment are described below.
ci.yml — Continuous Integration
Triggered on push to main/develop and PRs to main. It uses dorny/paths-filter to detect which monorepo parts changed (kotlin / python / go / frontend / terraform / docker / openapi) and conditionally runs language-matrixed jobs:
| Job | What it runs |
|---|---|
kotlin-build | JDK 21 + Gradle: compile → Detekt lint → unit tests → integrationTest → build; uploads reports + JARs |
python-test | Matrix [copilot, migration-engine], Python 3.12 + Poetry 1.8.4: Ruff lint/format, mypy (non-blocking), pytest with coverage. PR coverage gate thresholdAll 0.70 / thresholdNew 0.80 |
go-build | Go 1.22: build / vet / test with -race, golangci-lint |
frontend-build | Node 20: ESLint, tsc -b --noEmit, build, vitest coverage |
terraform-validate | terraform fmt/validate per environment + module, TFLint |
docker-compose-validate | docker compose config, hadolint |
openapi-validate | Redocly lint of 5 OpenAPI specs + oasdiff breaking-change check |
ci-status | Aggregate gate for branch protection |
zero-tolerance.yml — Compliance gate
On PR/push to main/develop, runs scripts/detect-fake-code.js against changed .ts/.tsx/.js/.jsx files, enforcing the "no fake code" rule, plus frontend typecheck/lint/test/build.
security-scan.yml — Security
Weekly cron (Mon 06:00 UTC) plus push/PR to main. Jobs:
- gitleaks — secret detection
- Trivy — per-service container scan,
CRITICAL,HIGH,exit-code 1 - Snyk — dependency scan
- OWASP dependency-check — JVM dependencies
- Python
safety— Python dependency vulnerabilities - Checkov — Terraform misconfiguration scan
- license-check — fails the build on GPL/AGPL licenses
codeql-analysis.yml runs static analysis as well.
Deploy and release workflows
deploy-dev.yml— on push todevelop: builds and pushes all 8 images (dev-<sha>,dev-latest), runskustomize edit set imageon thedevoverlay,kubectl apply --prune, waits for rollout, runs smoke tests, Slack-notifies.deploy-staging.yml— on push tomain/v*.*.*-rc*tag / manual: builds + pushes 8 images, creates a pre-deploy Cloud SQL backup, applies thestagingoverlay, runs DB migrations viakubectl exec deployment/api-gateway -- /app/migrate --direction=up --environment=staging, then smoke and integration tests.deploy-production.yml—workflow_dispatchonly; the canary rollout (see below).release.yml— onv*.*.*tags: builds semver-tagged release images, builds Go CLI binaries for 5 OS/arch combos, generates a conventional-commit changelog, and creates a GitHub Release withSHA256SUMS.txt.
Authentication to GCP uses Workload Identity Federation (google-github-actions/auth@v2, id-token: write) — there are no static GCP keys in the repo.
Canary rollout (GKE production path)
The documented production deploy (deploy-production.yml) is a manual workflow_dispatch with inputs: image_tag (must be a tested staging tag), canary_weight_step1 (default 10%), canary_weight_step2 (default 50%), and skip_canary.
┌──────────────────────────────────────────────────────────────────┐
│ 1. pre-deploy-validation │
│ verify images in Artifact Registry · verify Trivy scan · │
│ check current prod pod health │
└────────────────────────────┬─────────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────────────┐
│ 2. database-backup gcloud sql backups create dataflow-prod-db │
└────────────────────────────┬─────────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────────────┐
│ 3. canary-stage-1 (10%) │
│ clone each deployment as ${svc}-canary · scale by weight · │
│ 5-minute health observation │
│ ABORT if >10 failures OR >2 pod restarts ──────────┐ │
└────────────────────────────┬──────────────────────────────┼───────┘
▼ │
┌──────────────────────────────────────────────────────────┐│
│ 4. canary-stage-2 (50%) ││
│ scale canary to 50% · 10-minute observation ││
└────────────────────────────┬──────────────────────────────┼───────┘
▼ │
┌──────────────────────────────────────────────────────────┐│
│ 5. full-deploy (100%) ││
│ kustomize edit set image · DB migrations (dry-run then ││
│ real) · kubectl apply --prune · wait rollout · delete ││
│ canary deployments ││
└────────────────────────────┬──────────────────────────────┼───────┘
▼ │
┌──────────────────────────────────────────────────────────┐│
│ 6. post-deploy-verify ││
│ all pods healthy · smoke tests vs ingress IP · ││
│ replica counts match ││
└────────────────────────────┬──────────────────────────────┘│
▼ on failure() ▼
┌──────────────────────────────────────────────────────────────────┐
│ 7. rollback kubectl rollout undo all deployments · remove │
│ canaries · verify │
└────────────────────────────┬─────────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────────────┐
│ 8. notify Slack on every outcome · PagerDuty critical if │
│ deploy failed AND rollback also failed │
└──────────────────────────────────────────────────────────────────┘
Canary stage 1 clones each deployment as ${svc}-canary, scales it by the configured weight, and observes health for 5 minutes — aborting if more than 10 failures or more than 2 pod restarts occur. Stage 2 raises the canary to 50% and observes for 10 minutes. Only then does full-deploy cut over to 100% and delete the canary deployments.
Canary applies only to GKE
The 10% → 50% → 100% canary, the smoke/integration test gates, and the automated kubectl rollout undo exist only in the GKE workflows. The VPS production path has no canary and no automated rollback.
Production deployment to the VPS — step by step
This is the path that actually ships production. It is driven by deploy/deploy_to_vps.py, a 7-step Python/paramiko script.
Topology
Internet (HTTPS :443)
│
▼
┌───────────────────────────────────────────────────────┐
│ Debian VPS etl.exai.cloud (57.129.120.73) │
│ │
│ System nginx ── TLS termination (Plesk cert) ──┐ │
│ /auth/ → 127.0.0.1:4180 (Keycloak) │ │
│ /api/ → 127.0.0.1:8085 (api-gateway) │ │
│ / → 127.0.0.1:3006 (frontend) │ │
│ │ │ │
│ ┌───────────────────▼────────────────────────┐ │ │
│ │ docker compose — network dataflow-network │ │ │
│ │ 8 app services + postgres · keycloak · │ │ │
│ │ kafka · zookeeper · prometheus · grafana · │ │ │
│ │ minio │ │ │
│ └─────────────────────────────────────────────┘ │ │
│ (shared host: itsm.exai.cloud, openmeet.exai.cloud)│ │
└───────────────────────────────────────────────────────┘
The 7 steps of deploy_to_vps.py
- Archive —
git archive HEADproducesdeploy/dataflow-source.tar.gz(~17.4 MB). - Connect — SSH/SFTP into the VPS.
- Upload — push the tarball,
rm -rf /home/debian/dataflow, untar it; upload.env.productionto bothbackend/.envandfrontend/.env.production; uploadetl-dataflow.conf. - nginx config — install the config to
/etc/nginx/conf.d/etl-dataflow.confand runnginx -t. - Build images — build one image at a time (frontend first, then the 5 Kotlin services at a 900s timeout each to manage VPS memory, then the 2 Python services).
- Start in waves —
postgres(sleep 15) →zookeeper kafka(sleep 20) →keycloak prometheus grafana minio(sleep 15) →docker compose up -d(all); then reload system nginx. - Verify —
docker compose psandcurlthe frontend on127.0.0.1:3006.
To run a full production deploy:
python deploy/deploy_to_vps.py
Hardcoded credentials
deploy_to_vps.py contains the VPS host, user, and password as hardcoded literals in source. This is a known security finding. Do not treat the script as a model — credentials belong in a secrets manager. The same applies to deploy/.env.production (plaintext DB / Keycloak / Grafana / MinIO passwords) and seed_keycloak_vps.py.
Build images one at a time
Step 5 deliberately builds images sequentially. The VPS is memory-constrained and shares the host with unrelated Plesk vhosts, so a parallel build of all five Kotlin services would exhaust RAM. Each Kotlin build runs with a 900-second timeout.
The last recorded run (.deploy.log, May 19) failed locally during step 5 — a Windows UnicodeEncodeError (cp1252) crashed the Python script while streaming the frontend build output. This is a cosmetic local crash, not a server-side failure; deploy completeness for that run is uncertain and should be verified manually.
Production environment file
deploy/.env.production offsets every port to coexist with an itsm stack on the same VPS:
| Component | Production port |
|---|---|
| Postgres | 5433 |
| Keycloak | 4180 |
| Prometheus | 9095 |
| Grafana | 4001 |
| MinIO | 9002 / 9003 |
REDIS is not deployed (the gateway uses in-memory rate limiting fallback), ANTHROPIC_API_KEY is empty (copilot AI features disabled), and CORS_ALLOWED_ORIGINS is https://etl.exai.cloud.
nginx — TLS termination and routing
System nginx on the VPS (not a container) terminates TLS using a Plesk certificate at /opt/psa/var/certificates/scfht5otmidur05fohWXaj and reverse-proxies to the containers.
| Path | Proxied to | Notes |
|---|---|---|
/auth/ | 127.0.0.1:4180 | Keycloak container |
/api/ | 127.0.0.1:8085/api/ | api-gateway; 300s timeouts, WebSocket upgrade, SSE (proxy_buffering off) |
/ | 127.0.0.1:3006 | frontend container |
| port 80 | — | 301 redirect to HTTPS |
There is no oauth2-proxy. Authentication is Keycloak OIDC directly — the frontend SPA performs the OIDC flow, and nginx simply proxies /auth/ to the Keycloak container.
The nginx upgrade-header hotfix
apply-nginx-fix.py resolves a real production incident. The original config hardcoded Connection "upgrade" on /api/, which left nginx in connection-tunnel mode and never emitted the HTTP/2 END_STREAM frame for JWT-authenticated browser fetches. The symptom: marketplace, templates, my-pipelines, and data-browser pages spun forever.
The fix introduces a conditional map:
map $http_upgrade $etl_connection_upgrade {
default upgrade;
'' close;
}
The script backs up the old config with a timestamp, validates with nginx -t, and only reloads if validation passes. To apply it:
python deploy/apply-nginx-fix.py
Shared-host nginx warnings
Because etl.exai.cloud shares the VPS with itsm.exai.cloud and openmeet.exai.cloud, the server name conflicts with other Plesk vhosts and nginx emits warnings. These are expected on this host and do not indicate a broken deploy.
Database migrations — Flyway on startup
On the VPS path there is no separate migration job. Each Spring Boot service runs Flyway on startup against the shared PostgreSQL instance.
| Service | Migration location | Versions | History table |
|---|---|---|---|
metadata-service | db/migration/metadata/ | V1–V51+ | flyway_schema_history |
monitor-service | db/migration/monitor/ | V9–V25 | flyway_schema_history_monitor |
pipeline-engine | db/migration/engine/ | V1, V3–V9 | flyway_schema_history_engine |
Because the VPS originally had pre-Flyway tables, the Compose environment hardens Flyway to converge regardless of starting DB state:
SPRING_FLYWAY_VALIDATE_ON_MIGRATE=false
SPRING_FLYWAY_BASELINE_ON_MIGRATE=true
SPRING_FLYWAY_BASELINE_VERSION=20.1
SPRING_FLYWAY_OUT_OF_ORDER=true
SPRING_FLYWAY_REPAIR_ON_MIGRATE=true
SPRING_FLYWAY_PLACEHOLDER_REPLACEMENT=false
In addition, scripts/make-migrations-idempotent.py rewrites migration DDL in place — CREATE TABLE → CREATE TABLE IF NOT EXISTS (same for INDEX/SEQUENCE), ADD COLUMN IF NOT EXISTS, and CREATE TRIGGER prefixed with DROP TRIGGER IF EXISTS.
The GKE path instead runs a Go binary /app/migrate --direction=up --environment=… via kubectl exec — a separate, divergent migration tool.
Post-deploy seeding
After a VPS deploy, several idempotent scripts seed the runtime:
deploy/seed_keycloak_vps.py— hits the Keycloak admin REST API on:4180. It enablesunmanagedAttributePolicy=ENABLED, sets aworkspace_idattribute on 4 demo users, adds anoidc-usermodel-attribute-mappersoworkspace_idships in every JWT, updates thedataflow-appclient redirect URIs / webOrigins / PKCE (S256), resets demo-user passwords, and verifies a JWT grant.deploy/seed_monitor_data.sql— seedsmonitor_alertsandmonitor_pipeline_runswith realistic telecom data for the e2e Playwright sweep.deploy/create_user.py,setup_keycloak.py,ssh_cmd.py— supporting one-off scripts.
Software lifecycle summary
The documented CI/CD lifecycle: code (conventional commits, CODEOWNERS, Dependabot) → build & test (ci.yml + zero-tolerance.yml + security-scan.yml) → image build & publish to Artifact Registry → deploy (develop→dev auto, main→staging auto, prod via manual canary) → migrate → verify with smoke/integration tests → monitor and auto-rollback → release on semver tags.
The actual VPS lifecycle: git archive HEAD → SSH upload → docker compose build on the server → ordered docker compose up -d → Flyway runs on startup → manual Keycloak/monitor seeding → nginx reload. Hotfixes are applied via targeted scripts. There is no canary, no automated rollback, and no automated tests on the VPS.