ETL on DataFlow
Data & pipeline lifecycle
A pipeline on the DataFlow AI Platform travels a long path: it is authored in the Design Studio, validated, scheduled, compiled into a DAG, executed task by task, monitored, traced for lineage, certified by governance and eventually retired. This page follows that path end to end, and also covers the database migration and software release lifecycles that support it.
Overview
Three lifecycles run in parallel on the platform, and they intersect at predictable points.
| Lifecycle | What moves through it | Owned by |
|---|---|---|
| Pipeline lifecycle | A pipeline definition — draft to retirement | metadata-service + pipeline-engine |
| Run lifecycle | A single execution of a pipeline | pipeline-engine, observed by monitor-service |
| Data lifecycle | The datasets a pipeline produces — cataloged, traced, governed | metadata-service + lineage-service |
A supporting Flyway migration lifecycle evolves the database schema, and a software release lifecycle ships the platform itself. The remainder of this page treats each in turn.
AUTHOR ──► VALIDATE ──► SCHEDULE ──► EXECUTE ──► MONITOR ──► LINEAGE ──► GOVERN ──► RETIRE
(draft) (DSL + (cron / (DAG run) (alerts, (dataset/ (certify, (archive)
client backfill) cost,SLA) column) contracts)
checks)
Stage 1 — Authoring (draft)
A pipeline begins life as a definition. There are three entry points, all of which produce the same artifact: a DataFlow YAML pipeline definition.
| Entry point | Tool | Notes |
|---|---|---|
| Visual design | DesignStudio page, React Flow canvas | Drag-and-drop nodes; the canvas serialises to YAML |
| Guided creation | CreatePipelineWizard (/pipelines/new) | Step-by-step wizard for common shapes |
| Migration import | migration-engine | Converts a legacy Informatica / Alteryx / SSIS / DataStage job to DataFlow YAML |
| Template instantiation | PipelineTemplates gallery | Starts from a ready-made template, e.g. subscriber-360.yaml |
The YAML DSL is parsed by dsl/PipelineYamlParser.kt against dsl/PipelineYamlSchema.kt in pipeline-engine. Catalog-aware authoring is supported because metadata-service already holds discovered schemas — connectors call discoverSchema() and the resulting tables and columns are persisted in the catalog, so the designer can reference real datasets.
A draft pipeline and its versions are stored as pipelines and pipeline_versions entities in metadata-service (created by Flyway metadata migrations V1–V2). Each save in the Design Studio creates a new version; pipeline-engine's git/PipelineVersionControl.kt and GitRepositoryManager.kt back this with a Git history, and YamlDiffEngine.kt produces the diffs shown in the GitDiffViewer component.
Versioning is Git-backed
Pipeline definitions are version-controlled in a real Git repository managed by pipeline-engine. The VersionHistory and GitDiffViewer components in the Design Studio surface that history; the /api/v1/git endpoints expose repos, history, diff, rollback, branches and merge.
Stage 2 — Validation
Before a pipeline can run, it is validated — twice.
- Client-side. The SPA service
services/pipelineValidator.tschecks the definition in the browser as the user edits, giving immediate feedback in the Design Studio. - Server-side. When a run is requested, pipeline-engine's
validation/PipelineValidator.ktvalidates the parsed definition, andvalidation/ParameterResolver.ktresolves built-in variables, environment variables and runtime parameters.
Structural validation continues into DAG construction: dag/DagBuilder.kt builds an ExecutionDAG and rejects definitions with cycles or dangling dependencies.
Data-shape validation is a separate concern handled by governance: data contracts (/api/v1/contracts, Flyway metadata V16) declare the expected schema and semantics of a dataset, and the migration workflow exposes a dedicated ValidationSuitePage for converted pipelines.
Stage 3 — Scheduling
A validated pipeline can be run on demand or placed on a schedule. Scheduling is owned by pipeline-engine's scheduler/ package.
| Component | Role |
|---|---|
scheduler/CronScheduler.kt | Cron-based recurring schedules |
scheduler/BackfillManager.kt | Backfill jobs over historical partitions |
scheduler/DependencyChain.kt | Cross-pipeline dependency ordering |
scheduler/RetryPolicy.kt | Retry behaviour for failed runs |
Schedules, backfill jobs and dependencies are persisted as engine_schedules, engine_backfill_jobs / engine_backfill_partitions and dependency-chain entities (Flyway engine V1 and V3). The SchedulerController at /api/v1/scheduler exposes schedule CRUD plus pause and resume, a /next-runs preview, and backfill CRUD with pause, resume and cancel.
Europe/Warsaw by default
Pipeline schedules default to the Europe/Warsaw timezone — the platform is built for Polkomtel, a Polish telecom operator, and GDPR / Polish-compliance modules assume the same.
Pipelines may also be triggered through external orchestrators. The orchestrator/ package provides AirflowClient, AirflowDagGenerator, an AutomateNowAdapter and a WebhookCallbackService, exposed via /api/v1/orchestrator.
Stage 4 — Execution (DAG run)
A run begins when POST /api/v1/pipelines/{id}/run reaches pipeline-engine's ExecutionController. The execution/PipelineRunner.kt orchestrates the run:
POST /pipelines/{id}/run
│
▼
PipelineRunner
1. PipelineYamlParser — parse YAML DSL
2. ParameterResolver — resolve params / env / runtime vars
3. PipelineValidator — validate the definition
4. DagBuilder — build ExecutionDAG (cycle / dangling detection)
5. execute tasks in topological order
├─ parallel within each DAG level (fixed thread pool)
└─ ExecutionContext supports cooperative cancellation
6. aggregate results
│
├─ ExecutionEventPublisher / PipelineRunLogPublisher → live events
├─ OpenLineageEmitter → lineage-service /openlineage/events
└─ on task failure → SelfHealingService
Each run is recorded as an engine_execution_runs entity (with a run_id and a task_states JSONB column, Flyway engine V4). For streaming or large-compute work the engine submits jobs to external engines — flink/FlinkJobSubmitter or spark/DataprocJobSubmitter.
Self-healing
When a task fails, healing/SelfHealingService.kt runs FailureClassifier to categorise the failure and applies a matching strategy from RecoveryStrategies. Healing attempts are persisted (engine_healing_attempts, Flyway engine V8) and surfaced through the HealingController and the SelfHealingDashboard page.
Run states
A run progresses through a defined set of states.
┌──────────┐
│ QUEUED │ run requested, awaiting a slot
└────┬─────┘
▼
┌──────────┐
│ RUNNING │ tasks executing in topological order
└────┬─────┘
│
┌─────────┼──────────┬──────────────┐
▼ ▼ ▼ ▼
┌────────┐┌────────┐┌──────────┐ ┌──────────┐
│SUCCESS ││ FAILED ││CANCELLED │ │ HEALING │
└────────┘└───┬────┘└──────────┘ └────┬─────┘
│ cancellation via │ recovery strategy
│ ExecutionContext │ applied; may
│ ▼ re-enter RUNNING
└──────────────────────────┘
| Run state | Meaning |
|---|---|
QUEUED | The run has been requested and is waiting for an execution slot. |
RUNNING | The DAG is executing; tasks run in topological order, parallel within a level. |
SUCCESS | All tasks completed successfully. |
FAILED | A task failed and could not be recovered. |
CANCELLED | The run was cancelled cooperatively via ExecutionContext. |
HEALING | SelfHealingService is applying a recovery strategy; the run may re-enter RUNNING. |
Cancellation is exposed at POST /api/v1/runs/{runId}/cancel; run status at GET /api/v1/runs/{runId}.
Stage 5 — Monitoring
Once a run completes, monitor-service ingests its metrics and evaluates them. Monitoring is continuous, not a one-off step.
| Concern | monitor-service component | Surfaced at |
|---|---|---|
| Run status & logs | PipelineRunLogService, PipelineRunLogController | /monitor/runs, LogViewerPage |
| Alerts | AlertService, AlertDefinitionService, SseAlertController | /monitor/alerts |
| Cost | PipelineCostService, CostAnomalyDetector | /monitor/costs |
| SLA / SLO | SlaBurnRateService | /monitor/sla |
| Freshness | FreshnessTracker | /monitor/freshness |
| Performance | PerformanceMetricsService | /monitor/performance |
| Anomaly baselines | AnomalyBaselineService | /monitor/anomalies |
Run events stream live to the SPA two ways: a WebSocket route (/api/v1/runs/{runId}/stream, served by PipelineStatusWebSocket) carries task-state and log updates to the log viewer, while Server-Sent Events (SseAlertController at /api/v1/monitor/sse) carry alerts and metrics. When a run produces bad data, service/QuarantineService.kt and the data_quarantine records (Flyway engine V7) hold it back from downstream consumption until a steward approves, rejects or edits it from the DataQuarantine page.
Stage 6 — Lineage capture
As a run executes, pipeline-engine's lineage/OpenLineageEmitter.kt emits OpenLineage RunEvents. lineage-service ingests them through its OpenLineageEventController at /api/v1/lineage/openlineage and builds dataset- and column-level lineage.
pipeline-engine lineage-service
─────────────── ───────────────
OpenLineageEmitter ──RunEvent──► OpenLineageEventController
│
LineageGraphBuilder
│
┌───────────┼────────────┐
▼ ▼ ▼
DatasetLineage ColumnLineage ImpactAnalyzer
│ │ │
└───── PropagationWorker ┘
(tags / PII propagation)
lineage-service has Flyway disabled — it reuses the lineage tables that metadata-service creates (Flyway metadata V4 and V48). It records dataset and column lineage, supports findLineageAsOf time-travel queries, runs ImpactAnalyzer for downstream impact analysis, and uses PropagationWorker to propagate tags and PII classifications across the graph. The lineage graph is presented in the LineageExplorer page.
Stage 7 — Governance & certification
Datasets and pipelines are governed by metadata-service. Governance is where data is certified fit for use.
| Activity | Mechanism | UI |
|---|---|---|
| Review & approval | PolicyEngine, ApprovalWorkflow, governance reviews | ReviewQueue |
| Data contracts | /api/v1/contracts, Flyway metadata V16 | DataContracts |
| Quality monitoring | QualityController, quality rules / results / scores | QualityMonitoring |
| Endorsements | /api/v1/governance/endorsements | governance hub |
| Business glossary | glossary entities, Flyway metadata V10 | BusinessGlossary |
| Schema evolution | /api/v1/governance/schema-changes | SchemaEvolution |
| Audit trail | hash-chained immutable audit log, AuditChainVerifierService | AuditTrail |
| Compliance | GdprService, PolishComplianceService, DSAR, masking | ComplianceDashboard, DsarRequestsPage |
The audit log is a hash-chained immutable record (Flyway metadata V27 and V29) verified by AuditChainVerifierService. PII is protected throughout: the gateway's PiiMaskingFilter masks responses, metadata-service runs DynamicMaskingService and masking policies (Flyway metadata V17), and a PII classifier feeds the lineage-service PII propagation.
GDPR data-subject access requests follow their own sub-lifecycle: a DSAR (gdpr_dsar_requests, Flyway metadata V1) is raised, and erasure is orchestrated by pipeline-engine's erasure/ErasureOrchestrator.kt, which records erasure_runs and erasure_steps (Flyway engine V9).
Stage 8 — Retirement
A pipeline that is no longer needed is retired. Its definition and version history remain in metadata-service for audit, while the schedule is paused or removed through the SchedulerController and the pipeline is no longer eligible for execution.
Data retirement is governed separately. RetentionPolicyEnforcer in metadata-service applies retention policies, GDPR erasure removes personal data on request, and the data marketplace (Flyway metadata V20) can withdraw a data product so consumers no longer discover it. Historical run records and lineage are retained for impact analysis and compliance reporting even after a pipeline is retired.
ACTIVE ──► DEPRECATED ──► RETIRED ──► (definition + history retained for audit)
│ │
schedule paused removed from execution;
consumers warned data product withdrawn;
retention policy applies
Pipeline state summary
Bringing the stages together, a pipeline definition moves through these states.
DRAFT ──► VALIDATED ──► SCHEDULED ──► ACTIVE ──► DEPRECATED ──► RETIRED
│ │ │ │
│ │ │ └─ each trigger spawns a RUN
│ │ │ (QUEUED → RUNNING → SUCCESS/FAILED)
│ │ └─ cron / backfill / dependency / orchestrator
│ └─ DSL + DAG + contract validation
└─ authored in Design Studio / wizard / migration / template
| Pipeline state | Meaning |
|---|---|
DRAFT | Authored, not yet validated; saved as a pipeline_version. |
VALIDATED | Passed DSL, DAG and contract validation; runnable on demand. |
SCHEDULED | Bound to a cron schedule, backfill or dependency chain. |
ACTIVE | In regular operation; each trigger produces a run. |
DEPRECATED | Marked for removal; schedule paused, consumers warned. |
RETIRED | No longer executed; definition and history retained for audit. |
Flyway database migration lifecycle
The platform's database schema evolves through Flyway migrations. The shared PostgreSQL database is logically partitioned by per-service migration histories.
| Service | Flyway location | History table | Range |
|---|---|---|---|
| metadata-service | db/migration/metadata | flyway_schema_history | V1–V54 |
| pipeline-engine | db/migration/engine | flyway_schema_history_engine | V1–V9 |
| monitor-service | db/migration/monitor | flyway_schema_history_monitor | V9–V25 |
| lineage-service | — Flyway disabled — | — | reuses metadata tables |
A migration follows this lifecycle from authoring to live schema:
WRITE ──► COMMIT ──► STARTUP ──► APPLY ──► RECORD ──► VALIDATE
V__NN in db/ service run insert ddl-auto
numbered migration/ boots pending row into checks the
SQL file {service}/ migrations history live schema
table matches
- Write — a developer adds a numbered
V__NNSQL file under the service'sdb/migrationfolder. - Commit — the file ships with the service build.
- Startup — on service boot, Flyway compares the migration files with the service's history table.
- Apply — pending migrations run in order.
- Record — each applied migration is recorded as a row in the per-service history table.
- Validate — metadata-service runs JPA
ddl-auto: validate; pipeline-engine and lineage-service useddl-auto: none.
Permissive Flyway settings
The compose deployment runs Flyway with BASELINE_ON_MIGRATE, OUT_OF_ORDER, REPAIR_ON_MIGRATE, VALIDATE_ON_MIGRATE: false and BASELINE_VERSION: 20.1. These let a shared database with overlapping, historically diverged migration histories converge on startup — but they are a fragility risk, and migration ordering matters because lineage-service depends on metadata-service having already applied its lineage migrations (V4 and V48).
Software release lifecycle
The platform itself ships through a build → test → deploy → monitor cycle.
BUILD ──────► TEST ──────► DEPLOY ──────► MONITOR
Gradle JVM vitest / docker compose Prometheus
modules + playwright up on the VPS scrape →
Vite frontend / msw + (git archive, Grafana;
+ Python backend Flyway on SSE alerts
services tests startup)
| Phase | What happens |
|---|---|
| Build | The JVM services share one platform/Dockerfile with a BUILD_MODULE build-arg selecting the Gradle module. The Vite frontend and the two Python services have their own build contexts. |
| Test | The frontend runs vitest unit tests, playwright end-to-end tests (with a VITE_AUTH_MODE=dev bypass) and msw-mocked API tests; backend services run their own test suites. |
| Deploy | The live system is a single Debian VPS provisioned by deploy/deploy_to_vps.py, which SSHes in, untars a git archive and runs docker compose up. Flyway migrations run on service startup; nginx terminates TLS. |
| Monitor | Prometheus scrapes /actuator/prometheus and /metrics, feeding Grafana dashboards; monitor-service emits alerts over SSE. |
Two deployment paths
The repository documents a GitHub Actions → GCP Artifact Registry → GKE Autopilot path with canary rollout, but the system actually running in production is the single-VPS Docker Compose topology. The release lifecycle described here reflects the live VPS path.
Putting it together
The three lifecycles interlock: a pipeline is authored as YAML, validated in the browser and on the server, scheduled through cron or backfill, executed as a topologically-ordered DAG run, monitored continuously by monitor-service, traced by lineage-service as OpenLineage events, governed and certified by metadata-service, and finally retired with its history retained for audit. Underneath, Flyway migrations evolve the schema on every service startup, and the platform itself moves through build, test, deploy and monitor on each release.