Introduction

Installation & local setup

This guide walks through running the full DataFlow AI platform on a development machine — installing the prerequisites, understanding the monorepo layout, bringing the whole stack up with Docker Compose, and building each component individually from source.


Prerequisites

DataFlow AI is a polyglot monorepo, so building everything from source requires four language toolchains. To simply run the platform, Docker and Docker Compose are the only hard requirement — the Compose build handles every toolchain inside containers.

ToolVersionNeeded for
Docker + Docker ComposerecentRunning the full stack (required)
JDK (Temurin)21Building the Kotlin / Spring Boot platform services
Node.js20Building the React frontend
Python3.12Building the FastAPI AI services
Go1.22Building the dataflow CLI
Gradle8.5 (wrapper included)JVM multi-module build
Poetry1.8.4Python dependency management

You only need Docker to run it

If your goal is just to run DataFlow AI locally, install Docker and skip ahead to Running the stack with Docker Compose. The JDK, Node, Python, and Go toolchains are only needed when you want to build or iterate on a component outside its container.

Verifying your toolchains

Before building from source, confirm each toolchain resolves the expected major version:

docker --version
docker compose version
java -version        # should report 21
node --version       # should report v20.x
python --version     # should report 3.12.x
go version           # should report go1.22

Repository layout

The monorepo is organized by language and runtime. The top-level directories you will work with are:

polcomtel/
├── backend/
│   ├── platform/           Kotlin / Spring Boot 3.3.5 multi-module Gradle build
│   │   ├── common/             shared security, models, exception handling (library)
│   │   ├── connector-sdk/      connector framework + 21 connector impls (library)
│   │   ├── pushdown-sql/       JSqlParser-based SQL dialect transpiler
│   │   ├── api-gateway/        reactive Spring Cloud Gateway — single ingress
│   │   ├── metadata-service/   catalog, connections, governance, GDPR, MCP server
│   │   ├── pipeline-engine/    pipeline DAG compilation, execution, scheduling
│   │   ├── lineage-service/    dataset/column lineage, OpenLineage ingestion
│   │   ├── monitor-service/    alerts, metrics, cost, SLA, notifications, SSE
│   │   └── integration-tests/  cross-service integration test module
│   ├── ai-services/        Python 3.12 / FastAPI services
│   │   ├── copilot/            NL-to-pipeline, NL-to-SQL, RAG, chat, RCA
│   │   └── migration-engine/   Informatica / Alteryx → DataFlow YAML conversion
│   ├── cli/                Go 1.22 Cobra command-line tool (`dataflow`)
│   ├── infrastructure/     Kubernetes manifests + Helm charts + Terraform
│   └── docker-compose.yml  orchestrates all 12 services for local / VPS
├── frontend/               React 19 + Vite 7 + TypeScript 5.9 SPA
├── browser-extension/      companion browser extension
└── deploy/                 VPS deployment scripts

The Gradle build root is backend/platform. Of its nine modules, five are deployable services (api-gateway, metadata-service, pipeline-engine, lineage-service, monitor-service) and the rest are libraries or test modules built into the services.


Running the stack with Docker Compose

backend/docker-compose.yml orchestrates 12 containers — eight infrastructure components plus the six application services and the frontend — on a single Docker bridge network named dataflow-network. Every container resolves its peers by container name.

Configure environment variables

Secrets are read from a gitignored .env (and .env.local) file in backend/. Create one before starting the stack:

cd backend
cp .env.example .env

At minimum, set the values used by the platform:

VariablePurpose
POSTGRES_PASSWORDPassword for the shared PostgreSQL instance
ANTHROPIC_API_KEYClaude API key for the copilot (in .env.local)
LLM_PROVIDERLLM backend: anthropic, openrouter, or local
KAFKA_BOOTSTRAP_SERVERSInternal Kafka listener — kafka:29092
CORS_ALLOWED_ORIGINSOrigins permitted by the gateway CORS filter
DATAFLOW_GATEWAY_DEV_PERMIT_READSDev-only flag permitting unauthenticated reads (defaults to true in Compose)

Disable dev-permit-reads outside development

DATAFLOW_GATEWAY_DEV_PERMIT_READS defaults to true in the Compose file, which permits unauthenticated GET requests and copilot/search POSTs. This is convenient for local development but must be set to false in any environment that is exposed beyond your machine.

The copilot loads LLM provider keys from .env.local with required: false, so the stack still starts (with the copilot's RAG features degraded) when keys are absent.

Start the stack

From backend/, bring everything up:

docker compose up -d

Startup order is enforced by container healthchecks and depends_on: PostgreSQL comes up first, then Keycloak, Kafka, and Redis, then the six application services, and finally the frontend. Two persistent volumes — pgdata and miniodata — survive restarts.

Flyway migrations run automatically on each JVM service's startup. Watch the logs until the application services report healthy:

docker compose logs -f api-gateway metadata-service pipeline-engine

Stopping and resetting

docker compose down            # stop containers, keep volumes
docker compose down -v         # stop and DELETE pgdata + miniodata volumes

Service and infrastructure ports

Inside the Docker network every JVM service listens on 8080 and the AI services on 8000. The host-side published ports below are for direct developer access only — normal traffic always flows through the gateway.

Application services

ServiceTechHost port
api-gatewayKotlin / Spring Cloud Gateway8085
metadata-serviceKotlin / Spring MVC + JPA8181
pipeline-engineKotlin / Spring MVC + JPA8082
lineage-serviceKotlin / Spring MVC + JPA8083
monitor-serviceKotlin / Spring MVC + JPA8084
copilotPython 3.12 / FastAPI8090
migration-enginePython 3.12 / FastAPI8091
frontendReact 19 / Vite, served by nginx3006

Infrastructure components

ComponentImageHost port(s)
PostgreSQL (pgvector)pgvector/pgvector:pg155432
Keycloakquay.io/keycloak/keycloak:24.08180
Kafkaconfluentinc/cp-kafka:7.7.19092 / 29092
Zookeeperconfluentinc/cp-zookeeper:7.7.12181
Redisredis:7-alpine6379
MinIOminio/minio9000 (API) / 9001 (console)
Prometheusprom/prometheus:v2.54.19090
Grafanagrafana/grafana:11.2.23001

Accessing the running platform

Once the stack is healthy, open these URLs in a browser:

URLWhat it serves
http://localhost:3006The DataFlow AI web application
http://localhost:8085The API gateway — the /api/v1 surface
http://localhost:8180Keycloak admin console (realm dataflow)
http://localhost:9001MinIO console
http://localhost:3001Grafana dashboards
http://localhost:9090Prometheus

Logging in goes through Keycloak: the SPA performs an OIDC redirect to Keycloak, the dataflow realm authenticates the user, and keycloak-js exchanges the authorization code for a JWT. The gateway validates that JWT and injects X-User-* identity headers on every downstream call.

A quick gateway health check:

curl -s http://localhost:8085/actuator/health

Building components from source

You can build and run any component outside its container for faster iteration. The Compose stack can continue to provide the infrastructure (PostgreSQL, Keycloak, Kafka, Redis) while you run a single service from source.

JVM platform services (Gradle)

The Kotlin services share a single multi-module Gradle build rooted at backend/platform. Use the included wrapper:

cd backend/platform
./gradlew build                          # compile + Detekt + tests for all modules
./gradlew :metadata-service:bootRun       # run one service from source
./gradlew :api-gateway:test               # run a single module's tests

The shared platform/Dockerfile is multi-stage — a gradle:8.5-jdk21 builder produces the jar, which runs on an eclipse-temurin:21-jre-alpine runtime as a non-root user. A BUILD_MODULE build-arg selects which Gradle module to package.

AI services (Poetry)

Each Python service is an independent Poetry project. From backend/ai-services/copilot (or migration-engine):

poetry install                            # install dependencies
poetry run uvicorn app.main:app --reload  # run the FastAPI service
poetry run pytest                         # run the test suite
poetry run ruff check .                   # lint

The Python services use a multi-stage python:3.12-slim image and run under uvicorn.

Frontend (Vite)

The React SPA builds with Vite. From frontend/:

npm ci          # install exactly what's in package-lock.json
npm run dev     # Vite dev server with hot reload
npm run build   # production build — runs `tsc -b && vite build`
npm test        # Vitest unit tests

The dev server proxies API calls; the SPA's axios client uses a base URL of /api/v1, so it expects the gateway to be reachable.

Go CLI (Go modules)

The dataflow CLI builds with the standard Go toolchain. From backend/cli:

go build ./...                # build the CLI
go vet ./...                  # static checks
go test -race ./...           # tests with the race detector

Troubleshooting

SymptomLikely cause and fix
Services fail Flyway validation on startupThe shared DB has diverged migration histories. Compose deliberately sets permissive Flyway flags (OUT_OF_ORDER, REPAIR_ON_MIGRATE, VALIDATE_ON_MIGRATE: false) to converge them — run docker compose down -v and start fresh if it persists.
Copilot reports rag_mode: "unavailable"pgvector is unreachable or the embeddings schema failed to initialize. Confirm the postgres container is healthy.
Login redirect loopsThe dataflow Keycloak realm was not imported. Confirm realm-export.json is mounted and Keycloak finished its import on startup.
Frontend shows a blank page or stuck spinnerThe gateway is not reachable on 8085, or CORS_ALLOWED_ORIGINS does not include the frontend origin.
lineage-service starts before its tables existlineage-service has Flyway disabled and reuses metadata-service's lineage tables (created by metadata migrations V4/V48). Ensure metadata-service started and migrated first.

Production deployment

This page covers local and development setup. For the production topology — the single-VPS Docker Compose deployment and the documented GKE/Kustomize path — see the Deployment guide.

Previous
Business value & ROI