Introduction
Technology stack
DataFlow AI is a polyglot ETL/data-integration platform — Kotlin/Spring microservices, Python/FastAPI AI services, a Go CLI, and a React/Vite single-page application. This page is the authoritative inventory of every technology in the monorepo, verified against the project's build manifests.
Source of truth
Every version on this page is verified against build.gradle.kts, libs.versions.toml, pyproject.toml, go.mod, package.json, docker-compose.yml, and the .github/workflows/ definitions. Treat these manifests as canonical if a version ever drifts from this page.
Languages & runtimes
The platform spans five languages, each chosen for a specific subsystem.
| Language | Version | Used for | Source |
|---|---|---|---|
| Kotlin | 2.0.21 | Platform microservices (Spring Boot) | backend/platform/gradle/libs.versions.toml |
| Java (JVM) | 21 (Temurin) | JVM runtime / build target for Kotlin services | Dockerfile, CI setup-java |
| Python | 3.12 (^3.12) | AI services (Copilot, Migration Engine) | pyproject.toml |
| Go | 1.22 | The dataflow CLI | backend/cli/go.mod |
| TypeScript | ~5.9.3 | Frontend SPA | frontend/package.json |
| Node.js | 20 | Frontend build / CI runtime | CI setup-node |
Backend frameworks (Kotlin / JVM)
The five deployable JVM services and two library modules are built on Spring Boot. The api-gateway is reactive (WebFlux / Spring Cloud Gateway); the other services use Spring MVC.
| Component | Version | Purpose |
|---|---|---|
| Spring Boot | 3.3.5 | Microservice framework — web, JPA, security, actuator, validation, websocket |
| Spring Cloud | 2023.0.3 | Cloud platform BOM |
| Spring Cloud Gateway | via BOM | Reactive API gateway routing |
| Spring Security OAuth2 Resource Server + JOSE | via BOM | JWT validation |
| Spring Dependency Management plugin | 1.1.6 | BOM-based dependency resolution |
| Kotlin Coroutines | 1.7.3 | Async support in the common module |
| Jackson | 2.17.2 (core 2.16.1 in common) | JSON / YAML / CSV serialization (jackson-module-kotlin, jsr310) |
AI / Python frameworks
The copilot and migration-engine services are Python 3.12 FastAPI applications managed with Poetry.
| Component | Version | Purpose |
|---|---|---|
| FastAPI | ^0.115.0 | HTTP API for Copilot & Migration Engine |
| Uvicorn (standard) | ^0.32.0 | ASGI server |
| Anthropic SDK | ^0.39.0 | Claude LLM client (default model claude-sonnet-4-6-20251001) |
| Pydantic Settings | ^2.6.0 | Typed configuration |
| httpx | ^0.27.0 | Async HTTP client |
| sentence-transformers | ^3.3.0 (optional ml extra) | Embeddings for RAG |
| numpy | ^1.26.0 | Vector math (Copilot) |
| lxml | ^5.3.0 | Informatica / Alteryx XML workflow parsing (Migration Engine) |
| python-multipart | ^0.0.12 | File uploads (Migration Engine) |
| pyyaml | ^6.0 | Pipeline YAML handling |
Pluggable LLM providers
The copilot abstracts the LLM behind an LLMProvider interface selected at runtime by the LLM_PROVIDER environment variable:
| Provider | Detail |
|---|---|
anthropic | Claude — default model claude-sonnet-4-6-20251001 |
openrouter | OpenRouter — openai/gpt-oss-120b:free |
local | Local LLM via Ollama — llama3.2 |
Frontend stack
The SPA is a React 19 application bundled by Vite, with TanStack libraries for data and grids, Zustand for client state, and Tailwind CSS 4 for styling.
| Component | Version | Purpose |
|---|---|---|
| React | 19.2.0 | UI library |
| React DOM | 19.2.0 | DOM renderer |
| React Router DOM | 7.13.1 | Routing |
| Vite | 7.3.1 | Build tool / dev server |
| TanStack React Query | 5.90.21 | Server-state / data fetching |
| TanStack React Table | 8.21.3 | Data grids |
| TanStack React Virtual | 3.13.21 | List virtualization |
| Zustand | 5.0.11 | Client state (persisted persona store) |
| React Hook Form | 7.71.2 | Forms |
@hookform/resolvers | 5.2.2 | Form schema-resolver bridge |
| Zod | 4.3.6 | Schema validation |
@xyflow/react | 12.10.1 | Pipeline DAG canvas |
@monaco-editor/react | 4.7.0 | Code / SQL editor |
| Recharts | 3.7.0 | Charts |
| Tailwind CSS | 4.2.1 (@tailwindcss/vite) | Styling |
| i18next | 25.8.14 | Internationalization |
| react-i18next | 16.5.5 | React i18n bindings |
| keycloak-js | 26.2.3 | OIDC login in the browser |
| axios | 1.13.6 | HTTP client |
| lucide-react / cmdk / clsx / date-fns | — | Icons, command palette, utilities |
Go CLI libraries
The dataflow CLI is built with Cobra and Viper.
| Library | Version | Purpose |
|---|---|---|
| spf13/cobra | 1.8.1 | CLI command framework |
| spf13/viper | 1.19.0 | Configuration |
| zalando/go-keyring | 0.2.6 | Credential storage |
| olekukonko/tablewriter | 0.0.5 | Terminal table output |
| gopkg.in/yaml.v3 | 3.0.1 | YAML parsing |
Connectors, data formats & streaming libraries
The connector SDK and pipeline engine depend on a broad set of data-format, migration, and streaming libraries.
| Library | Version | Purpose |
|---|---|---|
| Flyway (core + postgresql) | 10.17.3 | DB migrations |
| HikariCP | 6.2.1 | JDBC connection pooling |
| jSqlParser | 5.0 | SQL pushdown parsing |
| OpenLineage | 1.24.2 | Lineage events |
| Apache Flink clients | 1.18.1 | Job submission to the K8s Flink runtime |
| Apache Parquet (avro / hadoop) | 1.13.1 | Columnar file format |
| Apache Hadoop (common / client) | 3.3.6 | Parquet / HDFS dependency |
| Apache Avro | 1.11.3 | Serialization |
| Apache POI (poi / ooxml) | 5.2.5 | Excel I/O |
| Debezium (api / embedded) | 2.5.4.Final | Change Data Capture |
| Apache PDFBox | 3.0.1 | Pipeline assessment report rendering |
| Eclipse JGit | 6.8.0 | Git repo management in the pipeline engine |
| ICU4J | 74.2 | Encoding detection |
| Woodstox | 6.5.1 | Streaming XML |
JDBC drivers (connector-sdk)
| Driver | Version |
|---|---|
| Teradata | 20.00.00.43 |
| Snowflake | 4.0.2 |
| SAP HANA (ngdbc) | 2.28.6 |
| Oracle (ojdbc11) | 23.7.0 |
| Microsoft SQL Server | 12.8.1 |
| PostgreSQL | 42.7.6 |
| MySQL | 9.2.0 |
| jTDS | 1.3.1 |
| Databricks | 3.3.1 |
NoSQL & streaming clients
| Client | Version |
|---|---|
| MongoDB driver | 5.6.0 |
| Kafka clients | 3.9.0 |
| Confluent kafka-avro-serializer | 7.8.0 |
| Azure Event Hubs | 5.19.2 |
| Google Cloud Pub/Sub | 1.133.1 |
Cloud SDKs
| SDK | Version | Components used |
|---|---|---|
| Google Cloud BOM | 26.78.0 | Storage, BigQuery 2.44.0, Billing 2.53.0 |
| AWS SDK v2 BOM | 2.31.1 | S3 + transfer manager |
| Azure SDK BOM | 1.2.33 | Blob Storage, Identity |
| Kubernetes Java client | 21.0.2 | K8s API access (Flink job submission) |
Databases & storage
All JVM and AI services point at the same PostgreSQL instance and dataflow_metadata database; logical isolation is by per-service Flyway histories and table prefixes.
| Component | Version | Role |
|---|---|---|
| PostgreSQL (pgvector) | pgvector/pgvector:pg15 | Primary RDBMS (dataflow_metadata), shared by metadata / lineage / Keycloak; pgvector extension backs RAG embeddings |
| Redis | 7-alpine | Rate limiting, caching, session store (reactive driver in the gateway) |
| MinIO | latest | S3 / GCS-compatible object storage |
| asyncpg (Python) | 0.30.0 | Copilot async DB access |
| pgvector (Python) | 0.3.5 | Copilot vector access |
Message broker & coordination
Kafka is the event-streaming backbone for connector and CDC workloads — not the inter-service command bus, which is synchronous HTTP.
| Component | Version | Role |
|---|---|---|
| Apache Kafka (Confluent) | cp-kafka:7.7.1 | Event streaming backbone |
| ZooKeeper (Confluent) | cp-zookeeper:7.7.1 | Kafka coordination |
Observability
| Component | Version | Role |
|---|---|---|
| Prometheus | v2.54.1 | Metrics collection |
| Grafana | 11.2.2 | Dashboards |
| Spring Boot Actuator | via BOM | Health / metrics endpoints (/actuator/health) |
| OpenTelemetry API | 1.42.1 | Tracing instrumentation |
| OpenLineage | 1.24.2 | Data lineage events |
| SLF4J | 2.0.11 | JVM logging facade |
| Logback | 1.4.14 | JVM logging implementation |
| kotlin-logging-jvm | 3.0.5 | Kotlin logging facade |
Authentication & security
Identity is anchored on Keycloak and validated twice — once at the gateway, again at each resource server (defense-in-depth).
| Component | Version | Role |
|---|---|---|
| Keycloak (server) | 24.0 | OIDC / SSO identity provider, realm dataflow |
| Keycloak admin-client | 24.0.4 | Programmatic realm administration |
| Spring Security OAuth2 Resource Server + JOSE | via Spring Boot 3.3.5 | JWT bearer validation on every service |
| keycloak-js | 26.2.3 | Browser-side OIDC (frontend) |
The gateway also applies a PII masking filter, CORS, and Redis-backed rate limiting. A development bypass exists via the DATAFLOW_GATEWAY_DEV_PERMIT_READS flag.
Build tools
| Tool | Version | Scope |
|---|---|---|
| Gradle (Kotlin DSL) | 8.5 | Kotlin multi-module build (dataflow-platform) |
| Poetry | 1.8.4 | Python dependency / build (poetry-core backend) |
| Go modules | go 1.22 | CLI build |
| Vite | 7.3.1 | Frontend bundling (tsc -b && vite build) |
| npm | Node 20 | Frontend package management (package-lock.json) |
The dataflow-platform Gradle build comprises nine modules: common, api-gateway, metadata-service, pipeline-engine, connector-sdk, pushdown-sql, lineage-service, monitor-service, and integration-tests.
Testing & quality tooling
Each language stack has its own test runner, linter, and static-analysis gate enforced in CI.
| Tool | Version | Scope |
|---|---|---|
| JUnit Jupiter | 5.10.1 | Kotlin / JVM unit tests |
| MockK | 1.13.12 (1.13.9 in connector-sdk) | Kotlin mocking |
| Testcontainers (junit / postgresql) | 1.20.2 | JVM integration tests |
| AssertJ | 3.25.1 | Fluent assertions |
| Reactor Test | via BOM | Reactive gateway tests |
| Detekt | Gradle plugin | Kotlin static analysis (CI gate) |
| pytest + pytest-asyncio | 8.3 / 0.24 | Python tests (asyncio_mode=auto) |
| Ruff | 0.7.0 | Python lint + format |
| mypy | CI | Python type check (non-blocking) |
Vitest + @vitest/coverage-v8 | 3.2.x | Frontend unit tests |
| Testing Library (react / jest-dom / user-event) | 16.3 / 6.6 / 14.6 | Component testing |
| Playwright | 1.58.2 | Frontend E2E (port 4173) |
| MSW | 2.12.10 | API mocking |
| ESLint + typescript-eslint | 9.39.1 / 8.48.0 | Frontend lint |
| golangci-lint | v1.57 | Go lint |
| Redocly CLI / oasdiff | CI | OpenAPI spec lint + breaking-change detection |
| hadolint | v3.1.0 | Dockerfile lint |
| TFLint | v0.50.3 | Terraform lint |
Containers & orchestration
| Component | Detail |
|---|---|
| Docker / Docker Compose | backend/docker-compose.yml orchestrates 12 services — postgres, keycloak, zookeeper, kafka, prometheus, grafana, minio, redis, plus 6 app services |
| JVM service image | Multi-stage: gradle:8.5-jdk21 builder → eclipse-temurin:21-jre-alpine runtime; non-root user, container-aware JVM opts, actuator healthcheck |
| Python service image | Multi-stage python:3.12-slim with Poetry; runs uvicorn |
| Kubernetes | Java client 21.0.2; backend/infrastructure/k8s + Helm charts present |
| Terraform | v1.7.5 (validated in CI per-environment) |
| Flink runtime | Runs on Kubernetes (not embedded); jobs submitted via the Flink client |
CI / CD
Continuous integration runs on GitHub Actions with path-filtered monorepo jobs, so each language stack only rebuilds when its files change.
| Aspect | Detail |
|---|---|
| Platform | GitHub Actions — 8 workflows (ci, codeql-analysis, security-scan, zero-tolerance, deploy-dev/staging/production, release) |
| CI strategy | Path-filtered monorepo jobs via dorny/paths-filter@v3 — kotlin, python, go, frontend, terraform, docker, openapi |
| Kotlin job | JDK 21, Gradle setup-gradle@v4; compile → Detekt → unit tests → integrationTest → build; uploads jars + JUnit reports |
| Python job | Matrix (copilot, migration-engine); Poetry install → Ruff lint/format → mypy → pytest with coverage (thresholds: all 70%, new 80%) |
| Go job | Go 1.22; build / vet / race tests with coverage; golangci-lint |
| Frontend job | Node 20; npm ci → ESLint → tsc -b → build → Vitest coverage |
| Validation jobs | Terraform validate/fmt/TFLint; docker compose config + hadolint; OpenAPI lint + breaking-change check |
| Gate | A ci-status aggregate job for branch protection (skipped == OK, failure == block) |
| Deployment | Separate dev / staging / production workflows; canary-style production deploys |
Two deployment realities
The repository contains two deployment paths that are not consistent. The GitHub Actions workflows describe a GKE Autopilot / Kustomize path, but the production system actually running is a single Debian VPS provisioned by hand-written Python scripts that run Docker Compose. See the Deployment guide for the full picture.
Architecture summary
Pulling the stack together, DataFlow AI is a polyglot microservices monorepo:
backend/platform— 7 Kotlin / Spring Boot 3.3.5 modules on JVM 21:api-gateway(reactive Spring Cloud Gateway),metadata-service,pipeline-engine,lineage-service,monitor-service, plus the sharedcommonlibrary and theconnector-sdklibrary (JDBC / streaming / cloud connectors), andpushdown-sql.backend/ai-services— 2 Python 3.12 FastAPI services:copilot(NL-to-pipeline, NL-to-SQL, RAG, chat) andmigration-engine(Informatica / Alteryx migration).backend/cli— a Go 1.22 Cobra CLI.frontend— a React 19 + Vite 7 + TypeScript 5.9 SPA with Keycloak SSO.- Infrastructure — PostgreSQL 15 + pgvector, Redis 7, Kafka 7.7.1, MinIO, Keycloak 24, Prometheus + Grafana; Docker Compose for local, and Kubernetes + Helm + Terraform for the cloud path.
For how these pieces talk to each other — request flows, routing, identity propagation, and event streaming — continue to the Architecture guide.