Introduction

Technology stack

DataFlow AI is a polyglot ETL/data-integration platform — Kotlin/Spring microservices, Python/FastAPI AI services, a Go CLI, and a React/Vite single-page application. This page is the authoritative inventory of every technology in the monorepo, verified against the project's build manifests.

Source of truth

Every version on this page is verified against build.gradle.kts, libs.versions.toml, pyproject.toml, go.mod, package.json, docker-compose.yml, and the .github/workflows/ definitions. Treat these manifests as canonical if a version ever drifts from this page.


Languages & runtimes

The platform spans five languages, each chosen for a specific subsystem.

LanguageVersionUsed forSource
Kotlin2.0.21Platform microservices (Spring Boot)backend/platform/gradle/libs.versions.toml
Java (JVM)21 (Temurin)JVM runtime / build target for Kotlin servicesDockerfile, CI setup-java
Python3.12 (^3.12)AI services (Copilot, Migration Engine)pyproject.toml
Go1.22The dataflow CLIbackend/cli/go.mod
TypeScript~5.9.3Frontend SPAfrontend/package.json
Node.js20Frontend build / CI runtimeCI setup-node

Backend frameworks (Kotlin / JVM)

The five deployable JVM services and two library modules are built on Spring Boot. The api-gateway is reactive (WebFlux / Spring Cloud Gateway); the other services use Spring MVC.

ComponentVersionPurpose
Spring Boot3.3.5Microservice framework — web, JPA, security, actuator, validation, websocket
Spring Cloud2023.0.3Cloud platform BOM
Spring Cloud Gatewayvia BOMReactive API gateway routing
Spring Security OAuth2 Resource Server + JOSEvia BOMJWT validation
Spring Dependency Management plugin1.1.6BOM-based dependency resolution
Kotlin Coroutines1.7.3Async support in the common module
Jackson2.17.2 (core 2.16.1 in common)JSON / YAML / CSV serialization (jackson-module-kotlin, jsr310)

AI / Python frameworks

The copilot and migration-engine services are Python 3.12 FastAPI applications managed with Poetry.

ComponentVersionPurpose
FastAPI^0.115.0HTTP API for Copilot & Migration Engine
Uvicorn (standard)^0.32.0ASGI server
Anthropic SDK^0.39.0Claude LLM client (default model claude-sonnet-4-6-20251001)
Pydantic Settings^2.6.0Typed configuration
httpx^0.27.0Async HTTP client
sentence-transformers^3.3.0 (optional ml extra)Embeddings for RAG
numpy^1.26.0Vector math (Copilot)
lxml^5.3.0Informatica / Alteryx XML workflow parsing (Migration Engine)
python-multipart^0.0.12File uploads (Migration Engine)
pyyaml^6.0Pipeline YAML handling

Pluggable LLM providers

The copilot abstracts the LLM behind an LLMProvider interface selected at runtime by the LLM_PROVIDER environment variable:

ProviderDetail
anthropicClaude — default model claude-sonnet-4-6-20251001
openrouterOpenRouter — openai/gpt-oss-120b:free
localLocal LLM via Ollama — llama3.2

Frontend stack

The SPA is a React 19 application bundled by Vite, with TanStack libraries for data and grids, Zustand for client state, and Tailwind CSS 4 for styling.

ComponentVersionPurpose
React19.2.0UI library
React DOM19.2.0DOM renderer
React Router DOM7.13.1Routing
Vite7.3.1Build tool / dev server
TanStack React Query5.90.21Server-state / data fetching
TanStack React Table8.21.3Data grids
TanStack React Virtual3.13.21List virtualization
Zustand5.0.11Client state (persisted persona store)
React Hook Form7.71.2Forms
@hookform/resolvers5.2.2Form schema-resolver bridge
Zod4.3.6Schema validation
@xyflow/react12.10.1Pipeline DAG canvas
@monaco-editor/react4.7.0Code / SQL editor
Recharts3.7.0Charts
Tailwind CSS4.2.1 (@tailwindcss/vite)Styling
i18next25.8.14Internationalization
react-i18next16.5.5React i18n bindings
keycloak-js26.2.3OIDC login in the browser
axios1.13.6HTTP client
lucide-react / cmdk / clsx / date-fnsIcons, command palette, utilities

Go CLI libraries

The dataflow CLI is built with Cobra and Viper.

LibraryVersionPurpose
spf13/cobra1.8.1CLI command framework
spf13/viper1.19.0Configuration
zalando/go-keyring0.2.6Credential storage
olekukonko/tablewriter0.0.5Terminal table output
gopkg.in/yaml.v33.0.1YAML parsing

Connectors, data formats & streaming libraries

The connector SDK and pipeline engine depend on a broad set of data-format, migration, and streaming libraries.

LibraryVersionPurpose
Flyway (core + postgresql)10.17.3DB migrations
HikariCP6.2.1JDBC connection pooling
jSqlParser5.0SQL pushdown parsing
OpenLineage1.24.2Lineage events
Apache Flink clients1.18.1Job submission to the K8s Flink runtime
Apache Parquet (avro / hadoop)1.13.1Columnar file format
Apache Hadoop (common / client)3.3.6Parquet / HDFS dependency
Apache Avro1.11.3Serialization
Apache POI (poi / ooxml)5.2.5Excel I/O
Debezium (api / embedded)2.5.4.FinalChange Data Capture
Apache PDFBox3.0.1Pipeline assessment report rendering
Eclipse JGit6.8.0Git repo management in the pipeline engine
ICU4J74.2Encoding detection
Woodstox6.5.1Streaming XML

JDBC drivers (connector-sdk)

DriverVersion
Teradata20.00.00.43
Snowflake4.0.2
SAP HANA (ngdbc)2.28.6
Oracle (ojdbc11)23.7.0
Microsoft SQL Server12.8.1
PostgreSQL42.7.6
MySQL9.2.0
jTDS1.3.1
Databricks3.3.1

NoSQL & streaming clients

ClientVersion
MongoDB driver5.6.0
Kafka clients3.9.0
Confluent kafka-avro-serializer7.8.0
Azure Event Hubs5.19.2
Google Cloud Pub/Sub1.133.1

Cloud SDKs

SDKVersionComponents used
Google Cloud BOM26.78.0Storage, BigQuery 2.44.0, Billing 2.53.0
AWS SDK v2 BOM2.31.1S3 + transfer manager
Azure SDK BOM1.2.33Blob Storage, Identity
Kubernetes Java client21.0.2K8s API access (Flink job submission)

Databases & storage

All JVM and AI services point at the same PostgreSQL instance and dataflow_metadata database; logical isolation is by per-service Flyway histories and table prefixes.

ComponentVersionRole
PostgreSQL (pgvector)pgvector/pgvector:pg15Primary RDBMS (dataflow_metadata), shared by metadata / lineage / Keycloak; pgvector extension backs RAG embeddings
Redis7-alpineRate limiting, caching, session store (reactive driver in the gateway)
MinIOlatestS3 / GCS-compatible object storage
asyncpg (Python)0.30.0Copilot async DB access
pgvector (Python)0.3.5Copilot vector access

Message broker & coordination

Kafka is the event-streaming backbone for connector and CDC workloads — not the inter-service command bus, which is synchronous HTTP.

ComponentVersionRole
Apache Kafka (Confluent)cp-kafka:7.7.1Event streaming backbone
ZooKeeper (Confluent)cp-zookeeper:7.7.1Kafka coordination

Observability

ComponentVersionRole
Prometheusv2.54.1Metrics collection
Grafana11.2.2Dashboards
Spring Boot Actuatorvia BOMHealth / metrics endpoints (/actuator/health)
OpenTelemetry API1.42.1Tracing instrumentation
OpenLineage1.24.2Data lineage events
SLF4J2.0.11JVM logging facade
Logback1.4.14JVM logging implementation
kotlin-logging-jvm3.0.5Kotlin logging facade

Authentication & security

Identity is anchored on Keycloak and validated twice — once at the gateway, again at each resource server (defense-in-depth).

ComponentVersionRole
Keycloak (server)24.0OIDC / SSO identity provider, realm dataflow
Keycloak admin-client24.0.4Programmatic realm administration
Spring Security OAuth2 Resource Server + JOSEvia Spring Boot 3.3.5JWT bearer validation on every service
keycloak-js26.2.3Browser-side OIDC (frontend)

The gateway also applies a PII masking filter, CORS, and Redis-backed rate limiting. A development bypass exists via the DATAFLOW_GATEWAY_DEV_PERMIT_READS flag.


Build tools

ToolVersionScope
Gradle (Kotlin DSL)8.5Kotlin multi-module build (dataflow-platform)
Poetry1.8.4Python dependency / build (poetry-core backend)
Go modulesgo 1.22CLI build
Vite7.3.1Frontend bundling (tsc -b && vite build)
npmNode 20Frontend package management (package-lock.json)

The dataflow-platform Gradle build comprises nine modules: common, api-gateway, metadata-service, pipeline-engine, connector-sdk, pushdown-sql, lineage-service, monitor-service, and integration-tests.


Testing & quality tooling

Each language stack has its own test runner, linter, and static-analysis gate enforced in CI.

ToolVersionScope
JUnit Jupiter5.10.1Kotlin / JVM unit tests
MockK1.13.12 (1.13.9 in connector-sdk)Kotlin mocking
Testcontainers (junit / postgresql)1.20.2JVM integration tests
AssertJ3.25.1Fluent assertions
Reactor Testvia BOMReactive gateway tests
DetektGradle pluginKotlin static analysis (CI gate)
pytest + pytest-asyncio8.3 / 0.24Python tests (asyncio_mode=auto)
Ruff0.7.0Python lint + format
mypyCIPython type check (non-blocking)
Vitest + @vitest/coverage-v83.2.xFrontend unit tests
Testing Library (react / jest-dom / user-event)16.3 / 6.6 / 14.6Component testing
Playwright1.58.2Frontend E2E (port 4173)
MSW2.12.10API mocking
ESLint + typescript-eslint9.39.1 / 8.48.0Frontend lint
golangci-lintv1.57Go lint
Redocly CLI / oasdiffCIOpenAPI spec lint + breaking-change detection
hadolintv3.1.0Dockerfile lint
TFLintv0.50.3Terraform lint

Containers & orchestration

ComponentDetail
Docker / Docker Composebackend/docker-compose.yml orchestrates 12 services — postgres, keycloak, zookeeper, kafka, prometheus, grafana, minio, redis, plus 6 app services
JVM service imageMulti-stage: gradle:8.5-jdk21 builder → eclipse-temurin:21-jre-alpine runtime; non-root user, container-aware JVM opts, actuator healthcheck
Python service imageMulti-stage python:3.12-slim with Poetry; runs uvicorn
KubernetesJava client 21.0.2; backend/infrastructure/k8s + Helm charts present
Terraformv1.7.5 (validated in CI per-environment)
Flink runtimeRuns on Kubernetes (not embedded); jobs submitted via the Flink client

CI / CD

Continuous integration runs on GitHub Actions with path-filtered monorepo jobs, so each language stack only rebuilds when its files change.

AspectDetail
PlatformGitHub Actions — 8 workflows (ci, codeql-analysis, security-scan, zero-tolerance, deploy-dev/staging/production, release)
CI strategyPath-filtered monorepo jobs via dorny/paths-filter@v3 — kotlin, python, go, frontend, terraform, docker, openapi
Kotlin jobJDK 21, Gradle setup-gradle@v4; compile → Detekt → unit tests → integrationTest → build; uploads jars + JUnit reports
Python jobMatrix (copilot, migration-engine); Poetry install → Ruff lint/format → mypy → pytest with coverage (thresholds: all 70%, new 80%)
Go jobGo 1.22; build / vet / race tests with coverage; golangci-lint
Frontend jobNode 20; npm ci → ESLint → tsc -b → build → Vitest coverage
Validation jobsTerraform validate/fmt/TFLint; docker compose config + hadolint; OpenAPI lint + breaking-change check
GateA ci-status aggregate job for branch protection (skipped == OK, failure == block)
DeploymentSeparate dev / staging / production workflows; canary-style production deploys

Two deployment realities

The repository contains two deployment paths that are not consistent. The GitHub Actions workflows describe a GKE Autopilot / Kustomize path, but the production system actually running is a single Debian VPS provisioned by hand-written Python scripts that run Docker Compose. See the Deployment guide for the full picture.


Architecture summary

Pulling the stack together, DataFlow AI is a polyglot microservices monorepo:

  • backend/platform — 7 Kotlin / Spring Boot 3.3.5 modules on JVM 21: api-gateway (reactive Spring Cloud Gateway), metadata-service, pipeline-engine, lineage-service, monitor-service, plus the shared common library and the connector-sdk library (JDBC / streaming / cloud connectors), and pushdown-sql.
  • backend/ai-services — 2 Python 3.12 FastAPI services: copilot (NL-to-pipeline, NL-to-SQL, RAG, chat) and migration-engine (Informatica / Alteryx migration).
  • backend/cli — a Go 1.22 Cobra CLI.
  • frontend — a React 19 + Vite 7 + TypeScript 5.9 SPA with Keycloak SSO.
  • Infrastructure — PostgreSQL 15 + pgvector, Redis 7, Kafka 7.7.1, MinIO, Keycloak 24, Prometheus + Grafana; Docker Compose for local, and Kubernetes + Helm + Terraform for the cloud path.

For how these pieces talk to each other — request flows, routing, identity propagation, and event streaming — continue to the Architecture guide.

Previous
Installation & setup