Introduction
Installation & local setup
This guide walks through running the full DataFlow AI platform on a development machine — installing the prerequisites, understanding the monorepo layout, bringing the whole stack up with Docker Compose, and building each component individually from source.
Prerequisites
DataFlow AI is a polyglot monorepo, so building everything from source requires four language toolchains. To simply run the platform, Docker and Docker Compose are the only hard requirement — the Compose build handles every toolchain inside containers.
| Tool | Version | Needed for |
|---|---|---|
| Docker + Docker Compose | recent | Running the full stack (required) |
| JDK (Temurin) | 21 | Building the Kotlin / Spring Boot platform services |
| Node.js | 20 | Building the React frontend |
| Python | 3.12 | Building the FastAPI AI services |
| Go | 1.22 | Building the dataflow CLI |
| Gradle | 8.5 (wrapper included) | JVM multi-module build |
| Poetry | 1.8.4 | Python dependency management |
You only need Docker to run it
If your goal is just to run DataFlow AI locally, install Docker and skip ahead to Running the stack with Docker Compose. The JDK, Node, Python, and Go toolchains are only needed when you want to build or iterate on a component outside its container.
Verifying your toolchains
Before building from source, confirm each toolchain resolves the expected major version:
docker --version
docker compose version
java -version # should report 21
node --version # should report v20.x
python --version # should report 3.12.x
go version # should report go1.22
Repository layout
The monorepo is organized by language and runtime. The top-level directories you will work with are:
polcomtel/
├── backend/
│ ├── platform/ Kotlin / Spring Boot 3.3.5 multi-module Gradle build
│ │ ├── common/ shared security, models, exception handling (library)
│ │ ├── connector-sdk/ connector framework + 21 connector impls (library)
│ │ ├── pushdown-sql/ JSqlParser-based SQL dialect transpiler
│ │ ├── api-gateway/ reactive Spring Cloud Gateway — single ingress
│ │ ├── metadata-service/ catalog, connections, governance, GDPR, MCP server
│ │ ├── pipeline-engine/ pipeline DAG compilation, execution, scheduling
│ │ ├── lineage-service/ dataset/column lineage, OpenLineage ingestion
│ │ ├── monitor-service/ alerts, metrics, cost, SLA, notifications, SSE
│ │ └── integration-tests/ cross-service integration test module
│ ├── ai-services/ Python 3.12 / FastAPI services
│ │ ├── copilot/ NL-to-pipeline, NL-to-SQL, RAG, chat, RCA
│ │ └── migration-engine/ Informatica / Alteryx → DataFlow YAML conversion
│ ├── cli/ Go 1.22 Cobra command-line tool (`dataflow`)
│ ├── infrastructure/ Kubernetes manifests + Helm charts + Terraform
│ └── docker-compose.yml orchestrates all 12 services for local / VPS
├── frontend/ React 19 + Vite 7 + TypeScript 5.9 SPA
├── browser-extension/ companion browser extension
└── deploy/ VPS deployment scripts
The Gradle build root is backend/platform. Of its nine modules, five are deployable services (api-gateway, metadata-service, pipeline-engine, lineage-service, monitor-service) and the rest are libraries or test modules built into the services.
Running the stack with Docker Compose
backend/docker-compose.yml orchestrates 12 containers — eight infrastructure components plus the six application services and the frontend — on a single Docker bridge network named dataflow-network. Every container resolves its peers by container name.
Configure environment variables
Secrets are read from a gitignored .env (and .env.local) file in backend/. Create one before starting the stack:
cd backend
cp .env.example .env
At minimum, set the values used by the platform:
| Variable | Purpose |
|---|---|
POSTGRES_PASSWORD | Password for the shared PostgreSQL instance |
ANTHROPIC_API_KEY | Claude API key for the copilot (in .env.local) |
LLM_PROVIDER | LLM backend: anthropic, openrouter, or local |
KAFKA_BOOTSTRAP_SERVERS | Internal Kafka listener — kafka:29092 |
CORS_ALLOWED_ORIGINS | Origins permitted by the gateway CORS filter |
DATAFLOW_GATEWAY_DEV_PERMIT_READS | Dev-only flag permitting unauthenticated reads (defaults to true in Compose) |
Disable dev-permit-reads outside development
DATAFLOW_GATEWAY_DEV_PERMIT_READS defaults to true in the Compose file, which permits unauthenticated GET requests and copilot/search POSTs. This is convenient for local development but must be set to false in any environment that is exposed beyond your machine.
The copilot loads LLM provider keys from .env.local with required: false, so the stack still starts (with the copilot's RAG features degraded) when keys are absent.
Start the stack
From backend/, bring everything up:
docker compose up -d
Startup order is enforced by container healthchecks and depends_on: PostgreSQL comes up first, then Keycloak, Kafka, and Redis, then the six application services, and finally the frontend. Two persistent volumes — pgdata and miniodata — survive restarts.
Flyway migrations run automatically on each JVM service's startup. Watch the logs until the application services report healthy:
docker compose logs -f api-gateway metadata-service pipeline-engine
Stopping and resetting
docker compose down # stop containers, keep volumes
docker compose down -v # stop and DELETE pgdata + miniodata volumes
Service and infrastructure ports
Inside the Docker network every JVM service listens on 8080 and the AI services on 8000. The host-side published ports below are for direct developer access only — normal traffic always flows through the gateway.
Application services
| Service | Tech | Host port |
|---|---|---|
| api-gateway | Kotlin / Spring Cloud Gateway | 8085 |
| metadata-service | Kotlin / Spring MVC + JPA | 8181 |
| pipeline-engine | Kotlin / Spring MVC + JPA | 8082 |
| lineage-service | Kotlin / Spring MVC + JPA | 8083 |
| monitor-service | Kotlin / Spring MVC + JPA | 8084 |
| copilot | Python 3.12 / FastAPI | 8090 |
| migration-engine | Python 3.12 / FastAPI | 8091 |
| frontend | React 19 / Vite, served by nginx | 3006 |
Infrastructure components
| Component | Image | Host port(s) |
|---|---|---|
| PostgreSQL (pgvector) | pgvector/pgvector:pg15 | 5432 |
| Keycloak | quay.io/keycloak/keycloak:24.0 | 8180 |
| Kafka | confluentinc/cp-kafka:7.7.1 | 9092 / 29092 |
| Zookeeper | confluentinc/cp-zookeeper:7.7.1 | 2181 |
| Redis | redis:7-alpine | 6379 |
| MinIO | minio/minio | 9000 (API) / 9001 (console) |
| Prometheus | prom/prometheus:v2.54.1 | 9090 |
| Grafana | grafana/grafana:11.2.2 | 3001 |
Accessing the running platform
Once the stack is healthy, open these URLs in a browser:
| URL | What it serves |
|---|---|
http://localhost:3006 | The DataFlow AI web application |
http://localhost:8085 | The API gateway — the /api/v1 surface |
http://localhost:8180 | Keycloak admin console (realm dataflow) |
http://localhost:9001 | MinIO console |
http://localhost:3001 | Grafana dashboards |
http://localhost:9090 | Prometheus |
Logging in goes through Keycloak: the SPA performs an OIDC redirect to Keycloak, the dataflow realm authenticates the user, and keycloak-js exchanges the authorization code for a JWT. The gateway validates that JWT and injects X-User-* identity headers on every downstream call.
A quick gateway health check:
curl -s http://localhost:8085/actuator/health
Building components from source
You can build and run any component outside its container for faster iteration. The Compose stack can continue to provide the infrastructure (PostgreSQL, Keycloak, Kafka, Redis) while you run a single service from source.
JVM platform services (Gradle)
The Kotlin services share a single multi-module Gradle build rooted at backend/platform. Use the included wrapper:
cd backend/platform
./gradlew build # compile + Detekt + tests for all modules
./gradlew :metadata-service:bootRun # run one service from source
./gradlew :api-gateway:test # run a single module's tests
The shared platform/Dockerfile is multi-stage — a gradle:8.5-jdk21 builder produces the jar, which runs on an eclipse-temurin:21-jre-alpine runtime as a non-root user. A BUILD_MODULE build-arg selects which Gradle module to package.
AI services (Poetry)
Each Python service is an independent Poetry project. From backend/ai-services/copilot (or migration-engine):
poetry install # install dependencies
poetry run uvicorn app.main:app --reload # run the FastAPI service
poetry run pytest # run the test suite
poetry run ruff check . # lint
The Python services use a multi-stage python:3.12-slim image and run under uvicorn.
Frontend (Vite)
The React SPA builds with Vite. From frontend/:
npm ci # install exactly what's in package-lock.json
npm run dev # Vite dev server with hot reload
npm run build # production build — runs `tsc -b && vite build`
npm test # Vitest unit tests
The dev server proxies API calls; the SPA's axios client uses a base URL of /api/v1, so it expects the gateway to be reachable.
Go CLI (Go modules)
The dataflow CLI builds with the standard Go toolchain. From backend/cli:
go build ./... # build the CLI
go vet ./... # static checks
go test -race ./... # tests with the race detector
Troubleshooting
| Symptom | Likely cause and fix |
|---|---|
| Services fail Flyway validation on startup | The shared DB has diverged migration histories. Compose deliberately sets permissive Flyway flags (OUT_OF_ORDER, REPAIR_ON_MIGRATE, VALIDATE_ON_MIGRATE: false) to converge them — run docker compose down -v and start fresh if it persists. |
Copilot reports rag_mode: "unavailable" | pgvector is unreachable or the embeddings schema failed to initialize. Confirm the postgres container is healthy. |
| Login redirect loops | The dataflow Keycloak realm was not imported. Confirm realm-export.json is mounted and Keycloak finished its import on startup. |
| Frontend shows a blank page or stuck spinner | The gateway is not reachable on 8085, or CORS_ALLOWED_ORIGINS does not include the frontend origin. |
| lineage-service starts before its tables exist | lineage-service has Flyway disabled and reuses metadata-service's lineage tables (created by metadata migrations V4/V48). Ensure metadata-service started and migrated first. |
Production deployment
This page covers local and development setup. For the production topology — the single-VPS Docker Compose deployment and the documented GKE/Kustomize path — see the Deployment guide.