ETL on DataFlow

Data & pipeline lifecycle

A pipeline on the DataFlow AI Platform travels a long path: it is authored in the Design Studio, validated, scheduled, compiled into a DAG, executed task by task, monitored, traced for lineage, certified by governance and eventually retired. This page follows that path end to end, and also covers the database migration and software release lifecycles that support it.


Overview

Three lifecycles run in parallel on the platform, and they intersect at predictable points.

LifecycleWhat moves through itOwned by
Pipeline lifecycleA pipeline definition — draft to retirementmetadata-service + pipeline-engine
Run lifecycleA single execution of a pipelinepipeline-engine, observed by monitor-service
Data lifecycleThe datasets a pipeline produces — cataloged, traced, governedmetadata-service + lineage-service

A supporting Flyway migration lifecycle evolves the database schema, and a software release lifecycle ships the platform itself. The remainder of this page treats each in turn.

  AUTHOR ──► VALIDATE ──► SCHEDULE ──► EXECUTE ──► MONITOR ──► LINEAGE ──► GOVERN ──► RETIRE
   (draft)    (DSL +       (cron /      (DAG run)   (alerts,    (dataset/   (certify,   (archive)
              client       backfill)                cost,SLA)   column)     contracts)
              checks)

Stage 1 — Authoring (draft)

A pipeline begins life as a definition. There are three entry points, all of which produce the same artifact: a DataFlow YAML pipeline definition.

Entry pointToolNotes
Visual designDesignStudio page, React Flow canvasDrag-and-drop nodes; the canvas serialises to YAML
Guided creationCreatePipelineWizard (/pipelines/new)Step-by-step wizard for common shapes
Migration importmigration-engineConverts a legacy Informatica / Alteryx / SSIS / DataStage job to DataFlow YAML
Template instantiationPipelineTemplates galleryStarts from a ready-made template, e.g. subscriber-360.yaml

The YAML DSL is parsed by dsl/PipelineYamlParser.kt against dsl/PipelineYamlSchema.kt in pipeline-engine. Catalog-aware authoring is supported because metadata-service already holds discovered schemas — connectors call discoverSchema() and the resulting tables and columns are persisted in the catalog, so the designer can reference real datasets.

A draft pipeline and its versions are stored as pipelines and pipeline_versions entities in metadata-service (created by Flyway metadata migrations V1–V2). Each save in the Design Studio creates a new version; pipeline-engine's git/PipelineVersionControl.kt and GitRepositoryManager.kt back this with a Git history, and YamlDiffEngine.kt produces the diffs shown in the GitDiffViewer component.

Versioning is Git-backed

Pipeline definitions are version-controlled in a real Git repository managed by pipeline-engine. The VersionHistory and GitDiffViewer components in the Design Studio surface that history; the /api/v1/git endpoints expose repos, history, diff, rollback, branches and merge.


Stage 2 — Validation

Before a pipeline can run, it is validated — twice.

  1. Client-side. The SPA service services/pipelineValidator.ts checks the definition in the browser as the user edits, giving immediate feedback in the Design Studio.
  2. Server-side. When a run is requested, pipeline-engine's validation/PipelineValidator.kt validates the parsed definition, and validation/ParameterResolver.kt resolves built-in variables, environment variables and runtime parameters.

Structural validation continues into DAG construction: dag/DagBuilder.kt builds an ExecutionDAG and rejects definitions with cycles or dangling dependencies.

Data-shape validation is a separate concern handled by governance: data contracts (/api/v1/contracts, Flyway metadata V16) declare the expected schema and semantics of a dataset, and the migration workflow exposes a dedicated ValidationSuitePage for converted pipelines.


Stage 3 — Scheduling

A validated pipeline can be run on demand or placed on a schedule. Scheduling is owned by pipeline-engine's scheduler/ package.

ComponentRole
scheduler/CronScheduler.ktCron-based recurring schedules
scheduler/BackfillManager.ktBackfill jobs over historical partitions
scheduler/DependencyChain.ktCross-pipeline dependency ordering
scheduler/RetryPolicy.ktRetry behaviour for failed runs

Schedules, backfill jobs and dependencies are persisted as engine_schedules, engine_backfill_jobs / engine_backfill_partitions and dependency-chain entities (Flyway engine V1 and V3). The SchedulerController at /api/v1/scheduler exposes schedule CRUD plus pause and resume, a /next-runs preview, and backfill CRUD with pause, resume and cancel.

Europe/Warsaw by default

Pipeline schedules default to the Europe/Warsaw timezone — the platform is built for Polkomtel, a Polish telecom operator, and GDPR / Polish-compliance modules assume the same.

Pipelines may also be triggered through external orchestrators. The orchestrator/ package provides AirflowClient, AirflowDagGenerator, an AutomateNowAdapter and a WebhookCallbackService, exposed via /api/v1/orchestrator.


Stage 4 — Execution (DAG run)

A run begins when POST /api/v1/pipelines/{id}/run reaches pipeline-engine's ExecutionController. The execution/PipelineRunner.kt orchestrates the run:

  POST /pipelines/{id}/run


  PipelineRunner
    1. PipelineYamlParser   — parse YAML DSL
    2. ParameterResolver    — resolve params / env / runtime vars
    3. PipelineValidator    — validate the definition
    4. DagBuilder           — build ExecutionDAG (cycle / dangling detection)
    5. execute tasks in topological order
         ├─ parallel within each DAG level (fixed thread pool)
         └─ ExecutionContext supports cooperative cancellation
    6. aggregate results

        ├─ ExecutionEventPublisher / PipelineRunLogPublisher  → live events
        ├─ OpenLineageEmitter  → lineage-service /openlineage/events
        └─ on task failure → SelfHealingService

Each run is recorded as an engine_execution_runs entity (with a run_id and a task_states JSONB column, Flyway engine V4). For streaming or large-compute work the engine submits jobs to external engines — flink/FlinkJobSubmitter or spark/DataprocJobSubmitter.

Self-healing

When a task fails, healing/SelfHealingService.kt runs FailureClassifier to categorise the failure and applies a matching strategy from RecoveryStrategies. Healing attempts are persisted (engine_healing_attempts, Flyway engine V8) and surfaced through the HealingController and the SelfHealingDashboard page.

Run states

A run progresses through a defined set of states.

        ┌──────────┐
        │  QUEUED  │  run requested, awaiting a slot
        └────┬─────┘

        ┌──────────┐
        │ RUNNING  │  tasks executing in topological order
        └────┬─────┘

   ┌─────────┼──────────┬──────────────┐
   ▼         ▼          ▼              ▼
┌────────┐┌────────┐┌──────────┐ ┌──────────┐
│SUCCESS ││ FAILED ││CANCELLED │ │ HEALING  │
└────────┘└───┬────┘└──────────┘ └────┬─────┘
              │  cancellation via      │ recovery strategy
              │  ExecutionContext      │ applied; may
              │                        ▼ re-enter RUNNING
              └──────────────────────────┘
Run stateMeaning
QUEUEDThe run has been requested and is waiting for an execution slot.
RUNNINGThe DAG is executing; tasks run in topological order, parallel within a level.
SUCCESSAll tasks completed successfully.
FAILEDA task failed and could not be recovered.
CANCELLEDThe run was cancelled cooperatively via ExecutionContext.
HEALINGSelfHealingService is applying a recovery strategy; the run may re-enter RUNNING.

Cancellation is exposed at POST /api/v1/runs/{runId}/cancel; run status at GET /api/v1/runs/{runId}.


Stage 5 — Monitoring

Once a run completes, monitor-service ingests its metrics and evaluates them. Monitoring is continuous, not a one-off step.

Concernmonitor-service componentSurfaced at
Run status & logsPipelineRunLogService, PipelineRunLogController/monitor/runs, LogViewerPage
AlertsAlertService, AlertDefinitionService, SseAlertController/monitor/alerts
CostPipelineCostService, CostAnomalyDetector/monitor/costs
SLA / SLOSlaBurnRateService/monitor/sla
FreshnessFreshnessTracker/monitor/freshness
PerformancePerformanceMetricsService/monitor/performance
Anomaly baselinesAnomalyBaselineService/monitor/anomalies

Run events stream live to the SPA two ways: a WebSocket route (/api/v1/runs/{runId}/stream, served by PipelineStatusWebSocket) carries task-state and log updates to the log viewer, while Server-Sent Events (SseAlertController at /api/v1/monitor/sse) carry alerts and metrics. When a run produces bad data, service/QuarantineService.kt and the data_quarantine records (Flyway engine V7) hold it back from downstream consumption until a steward approves, rejects or edits it from the DataQuarantine page.


Stage 6 — Lineage capture

As a run executes, pipeline-engine's lineage/OpenLineageEmitter.kt emits OpenLineage RunEvents. lineage-service ingests them through its OpenLineageEventController at /api/v1/lineage/openlineage and builds dataset- and column-level lineage.

  pipeline-engine                       lineage-service
  ───────────────                       ───────────────
  OpenLineageEmitter ──RunEvent──►  OpenLineageEventController

                                    LineageGraphBuilder

                              ┌───────────┼────────────┐
                              ▼           ▼            ▼
                       DatasetLineage  ColumnLineage  ImpactAnalyzer
                              │           │            │
                              └─────  PropagationWorker ┘
                                  (tags / PII propagation)

lineage-service has Flyway disabled — it reuses the lineage tables that metadata-service creates (Flyway metadata V4 and V48). It records dataset and column lineage, supports findLineageAsOf time-travel queries, runs ImpactAnalyzer for downstream impact analysis, and uses PropagationWorker to propagate tags and PII classifications across the graph. The lineage graph is presented in the LineageExplorer page.


Stage 7 — Governance & certification

Datasets and pipelines are governed by metadata-service. Governance is where data is certified fit for use.

ActivityMechanismUI
Review & approvalPolicyEngine, ApprovalWorkflow, governance reviewsReviewQueue
Data contracts/api/v1/contracts, Flyway metadata V16DataContracts
Quality monitoringQualityController, quality rules / results / scoresQualityMonitoring
Endorsements/api/v1/governance/endorsementsgovernance hub
Business glossaryglossary entities, Flyway metadata V10BusinessGlossary
Schema evolution/api/v1/governance/schema-changesSchemaEvolution
Audit trailhash-chained immutable audit log, AuditChainVerifierServiceAuditTrail
ComplianceGdprService, PolishComplianceService, DSAR, maskingComplianceDashboard, DsarRequestsPage

The audit log is a hash-chained immutable record (Flyway metadata V27 and V29) verified by AuditChainVerifierService. PII is protected throughout: the gateway's PiiMaskingFilter masks responses, metadata-service runs DynamicMaskingService and masking policies (Flyway metadata V17), and a PII classifier feeds the lineage-service PII propagation.

GDPR data-subject access requests follow their own sub-lifecycle: a DSAR (gdpr_dsar_requests, Flyway metadata V1) is raised, and erasure is orchestrated by pipeline-engine's erasure/ErasureOrchestrator.kt, which records erasure_runs and erasure_steps (Flyway engine V9).


Stage 8 — Retirement

A pipeline that is no longer needed is retired. Its definition and version history remain in metadata-service for audit, while the schedule is paused or removed through the SchedulerController and the pipeline is no longer eligible for execution.

Data retirement is governed separately. RetentionPolicyEnforcer in metadata-service applies retention policies, GDPR erasure removes personal data on request, and the data marketplace (Flyway metadata V20) can withdraw a data product so consumers no longer discover it. Historical run records and lineage are retained for impact analysis and compliance reporting even after a pipeline is retired.

  ACTIVE ──► DEPRECATED ──► RETIRED ──► (definition + history retained for audit)
              │                │
       schedule paused   removed from execution;
       consumers warned  data product withdrawn;
                         retention policy applies

Pipeline state summary

Bringing the stages together, a pipeline definition moves through these states.

  DRAFT ──► VALIDATED ──► SCHEDULED ──► ACTIVE ──► DEPRECATED ──► RETIRED
   │            │            │           │
   │            │            │           └─ each trigger spawns a RUN
   │            │            │              (QUEUED → RUNNING → SUCCESS/FAILED)
   │            │            └─ cron / backfill / dependency / orchestrator
   │            └─ DSL + DAG + contract validation
   └─ authored in Design Studio / wizard / migration / template
Pipeline stateMeaning
DRAFTAuthored, not yet validated; saved as a pipeline_version.
VALIDATEDPassed DSL, DAG and contract validation; runnable on demand.
SCHEDULEDBound to a cron schedule, backfill or dependency chain.
ACTIVEIn regular operation; each trigger produces a run.
DEPRECATEDMarked for removal; schedule paused, consumers warned.
RETIREDNo longer executed; definition and history retained for audit.

Flyway database migration lifecycle

The platform's database schema evolves through Flyway migrations. The shared PostgreSQL database is logically partitioned by per-service migration histories.

ServiceFlyway locationHistory tableRange
metadata-servicedb/migration/metadataflyway_schema_historyV1–V54
pipeline-enginedb/migration/engineflyway_schema_history_engineV1–V9
monitor-servicedb/migration/monitorflyway_schema_history_monitorV9–V25
lineage-service— Flyway disabled —reuses metadata tables

A migration follows this lifecycle from authoring to live schema:

  WRITE  ──►  COMMIT  ──►  STARTUP  ──►  APPLY  ──►  RECORD  ──►  VALIDATE
  V__NN     in db/         service       run         insert      ddl-auto
  numbered  migration/     boots         pending     row into    checks the
  SQL file  {service}/                   migrations  history     live schema
                                                     table       matches
  1. Write — a developer adds a numbered V__NN SQL file under the service's db/migration folder.
  2. Commit — the file ships with the service build.
  3. Startup — on service boot, Flyway compares the migration files with the service's history table.
  4. Apply — pending migrations run in order.
  5. Record — each applied migration is recorded as a row in the per-service history table.
  6. Validate — metadata-service runs JPA ddl-auto: validate; pipeline-engine and lineage-service use ddl-auto: none.

Permissive Flyway settings

The compose deployment runs Flyway with BASELINE_ON_MIGRATE, OUT_OF_ORDER, REPAIR_ON_MIGRATE, VALIDATE_ON_MIGRATE: false and BASELINE_VERSION: 20.1. These let a shared database with overlapping, historically diverged migration histories converge on startup — but they are a fragility risk, and migration ordering matters because lineage-service depends on metadata-service having already applied its lineage migrations (V4 and V48).


Software release lifecycle

The platform itself ships through a build → test → deploy → monitor cycle.

  BUILD ──────► TEST ──────► DEPLOY ──────► MONITOR
  Gradle JVM    vitest /     docker compose   Prometheus
  modules +     playwright   up on the VPS    scrape →
  Vite frontend / msw +      (git archive,    Grafana;
  + Python      backend      Flyway on        SSE alerts
  services      tests        startup)
PhaseWhat happens
BuildThe JVM services share one platform/Dockerfile with a BUILD_MODULE build-arg selecting the Gradle module. The Vite frontend and the two Python services have their own build contexts.
TestThe frontend runs vitest unit tests, playwright end-to-end tests (with a VITE_AUTH_MODE=dev bypass) and msw-mocked API tests; backend services run their own test suites.
DeployThe live system is a single Debian VPS provisioned by deploy/deploy_to_vps.py, which SSHes in, untars a git archive and runs docker compose up. Flyway migrations run on service startup; nginx terminates TLS.
MonitorPrometheus scrapes /actuator/prometheus and /metrics, feeding Grafana dashboards; monitor-service emits alerts over SSE.

Two deployment paths

The repository documents a GitHub Actions → GCP Artifact Registry → GKE Autopilot path with canary rollout, but the system actually running in production is the single-VPS Docker Compose topology. The release lifecycle described here reflects the live VPS path.


Putting it together

The three lifecycles interlock: a pipeline is authored as YAML, validated in the browser and on the server, scheduled through cron or backfill, executed as a topologically-ordered DAG run, monitored continuously by monitor-service, traced by lineage-service as OpenLineage events, governed and certified by metadata-service, and finally retired with its history retained for audit. Underneath, Flyway migrations evolve the schema on every service startup, and the platform itself moves through build, test, deploy and monitor on each release.

Previous
Connector catalog