DataFlow AI - The AI-native data integration platform.

A pipeline on the DataFlow AI Platform travels a long path: it is authored in the Design Studio, validated, scheduled, compiled into a DAG, executed task by task, monitored, traced for lineage, certified by governance and eventually retired. This page follows that path end to end, and also covers the database migration and software release lifecycles that support it.

Overview

Three lifecycles run in parallel on the platform, and they intersect at predictable points.

Lifecycle	What moves through it	Owned by
Pipeline lifecycle	A pipeline definition — draft to retirement	metadata-service + pipeline-engine
Run lifecycle	A single execution of a pipeline	pipeline-engine, observed by monitor-service
Data lifecycle	The datasets a pipeline produces — cataloged, traced, governed	metadata-service + lineage-service

A supporting Flyway migration lifecycle evolves the database schema, and a software release lifecycle ships the platform itself. The remainder of this page treats each in turn.

  AUTHOR ──► VALIDATE ──► SCHEDULE ──► EXECUTE ──► MONITOR ──► LINEAGE ──► GOVERN ──► RETIRE
   (draft)    (DSL +       (cron /      (DAG run)   (alerts,    (dataset/   (certify,   (archive)
              client       backfill)                cost,SLA)   column)     contracts)
              checks)

Stage 1 — Authoring (draft)

A pipeline begins life as a definition. There are three entry points, all of which produce the same artifact: a DataFlow YAML pipeline definition.

Entry point	Tool	Notes
Visual design	`DesignStudio` page, React Flow canvas	Drag-and-drop nodes; the canvas serialises to YAML
Guided creation	`CreatePipelineWizard` (`/pipelines/new`)	Step-by-step wizard for common shapes
Migration import	migration-engine	Converts a legacy Informatica / Alteryx / SSIS / DataStage job to DataFlow YAML
Template instantiation	`PipelineTemplates` gallery	Starts from a ready-made template, e.g. `subscriber-360.yaml`

The YAML DSL is parsed by dsl/PipelineYamlParser.kt against dsl/PipelineYamlSchema.kt in pipeline-engine. Catalog-aware authoring is supported because metadata-service already holds discovered schemas — connectors call discoverSchema() and the resulting tables and columns are persisted in the catalog, so the designer can reference real datasets.

A draft pipeline and its versions are stored as pipelines and pipeline_versions entities in metadata-service (created by Flyway metadata migrations V1–V2). Each save in the Design Studio creates a new version; pipeline-engine's git/PipelineVersionControl.kt and GitRepositoryManager.kt back this with a Git history, and YamlDiffEngine.kt produces the diffs shown in the GitDiffViewer component.

Versioning is Git-backed

Pipeline definitions are version-controlled in a real Git repository managed by pipeline-engine. The VersionHistory and GitDiffViewer components in the Design Studio surface that history; the /api/v1/git endpoints expose repos, history, diff, rollback, branches and merge.

Stage 2 — Validation

Before a pipeline can run, it is validated — twice.

Client-side. The SPA service services/pipelineValidator.ts checks the definition in the browser as the user edits, giving immediate feedback in the Design Studio.
Server-side. When a run is requested, pipeline-engine's validation/PipelineValidator.kt validates the parsed definition, and validation/ParameterResolver.kt resolves built-in variables, environment variables and runtime parameters.

Structural validation continues into DAG construction: dag/DagBuilder.kt builds an ExecutionDAG and rejects definitions with cycles or dangling dependencies.

Data-shape validation is a separate concern handled by governance: data contracts (/api/v1/contracts, Flyway metadata V16) declare the expected schema and semantics of a dataset, and the migration workflow exposes a dedicated ValidationSuitePage for converted pipelines.

Stage 3 — Scheduling

A validated pipeline can be run on demand or placed on a schedule. Scheduling is owned by pipeline-engine's scheduler/ package.

Component	Role
`scheduler/CronScheduler.kt`	Cron-based recurring schedules
`scheduler/BackfillManager.kt`	Backfill jobs over historical partitions
`scheduler/DependencyChain.kt`	Cross-pipeline dependency ordering
`scheduler/RetryPolicy.kt`	Retry behaviour for failed runs

Schedules, backfill jobs and dependencies are persisted as engine_schedules, engine_backfill_jobs / engine_backfill_partitions and dependency-chain entities (Flyway engine V1 and V3). The SchedulerController at /api/v1/scheduler exposes schedule CRUD plus pause and resume, a /next-runs preview, and backfill CRUD with pause, resume and cancel.

Europe/Warsaw by default

Pipeline schedules default to the Europe/Warsaw timezone — the platform is built for Polkomtel, a Polish telecom operator, and GDPR / Polish-compliance modules assume the same.

Pipelines may also be triggered through external orchestrators. The orchestrator/ package provides AirflowClient, AirflowDagGenerator, an AutomateNowAdapter and a WebhookCallbackService, exposed via /api/v1/orchestrator.

Stage 4 — Execution (DAG run)

A run begins when POST /api/v1/pipelines/{id}/run reaches pipeline-engine's ExecutionController. The execution/PipelineRunner.kt orchestrates the run:

  POST /pipelines/{id}/run
        │
        ▼
  PipelineRunner
    1. PipelineYamlParser   — parse YAML DSL
    2. ParameterResolver    — resolve params / env / runtime vars
    3. PipelineValidator    — validate the definition
    4. DagBuilder           — build ExecutionDAG (cycle / dangling detection)
    5. execute tasks in topological order
         ├─ parallel within each DAG level (fixed thread pool)
         └─ ExecutionContext supports cooperative cancellation
    6. aggregate results
        │
        ├─ ExecutionEventPublisher / PipelineRunLogPublisher  → live events
        ├─ OpenLineageEmitter  → lineage-service /openlineage/events
        └─ on task failure → SelfHealingService

Each run is recorded as an engine_execution_runs entity (with a run_id and a task_states JSONB column, Flyway engine V4). For streaming or large-compute work the engine submits jobs to external engines — flink/FlinkJobSubmitter or spark/DataprocJobSubmitter.

Self-healing

When a task fails, healing/SelfHealingService.kt runs FailureClassifier to categorise the failure and applies a matching strategy from RecoveryStrategies. Healing attempts are persisted (engine_healing_attempts, Flyway engine V8) and surfaced through the HealingController and the SelfHealingDashboard page.

Run states

A run progresses through a defined set of states.

        ┌──────────┐
        │  QUEUED  │  run requested, awaiting a slot
        └────┬─────┘
             ▼
        ┌──────────┐
        │ RUNNING  │  tasks executing in topological order
        └────┬─────┘
             │
   ┌─────────┼──────────┬──────────────┐
   ▼         ▼          ▼              ▼
┌────────┐┌────────┐┌──────────┐ ┌──────────┐
│SUCCESS ││ FAILED ││CANCELLED │ │ HEALING  │
└────────┘└───┬────┘└──────────┘ └────┬─────┘
              │  cancellation via      │ recovery strategy
              │  ExecutionContext      │ applied; may
              │                        ▼ re-enter RUNNING
              └──────────────────────────┘

Run state	Meaning
`QUEUED`	The run has been requested and is waiting for an execution slot.
`RUNNING`	The DAG is executing; tasks run in topological order, parallel within a level.
`SUCCESS`	All tasks completed successfully.
`FAILED`	A task failed and could not be recovered.
`CANCELLED`	The run was cancelled cooperatively via `ExecutionContext`.
`HEALING`	`SelfHealingService` is applying a recovery strategy; the run may re-enter `RUNNING`.

Cancellation is exposed at POST /api/v1/runs/{runId}/cancel; run status at GET /api/v1/runs/{runId}.

Stage 5 — Monitoring

Once a run completes, monitor-service ingests its metrics and evaluates them. Monitoring is continuous, not a one-off step.

Concern	monitor-service component	Surfaced at
Run status & logs	`PipelineRunLogService`, `PipelineRunLogController`	`/monitor/runs`, `LogViewerPage`
Alerts	`AlertService`, `AlertDefinitionService`, `SseAlertController`	`/monitor/alerts`
Cost	`PipelineCostService`, `CostAnomalyDetector`	`/monitor/costs`
SLA / SLO	`SlaBurnRateService`	`/monitor/sla`
Freshness	`FreshnessTracker`	`/monitor/freshness`
Performance	`PerformanceMetricsService`	`/monitor/performance`
Anomaly baselines	`AnomalyBaselineService`	`/monitor/anomalies`

Run events stream live to the SPA two ways: a WebSocket route (/api/v1/runs/{runId}/stream, served by PipelineStatusWebSocket) carries task-state and log updates to the log viewer, while Server-Sent Events (SseAlertController at /api/v1/monitor/sse) carry alerts and metrics. When a run produces bad data, service/QuarantineService.kt and the data_quarantine records (Flyway engine V7) hold it back from downstream consumption until a steward approves, rejects or edits it from the DataQuarantine page.

Stage 6 — Lineage capture

As a run executes, pipeline-engine's lineage/OpenLineageEmitter.kt emits OpenLineage RunEvents. lineage-service ingests them through its OpenLineageEventController at /api/v1/lineage/openlineage and builds dataset- and column-level lineage.

  pipeline-engine                       lineage-service
  ───────────────                       ───────────────
  OpenLineageEmitter ──RunEvent──►  OpenLineageEventController
                                          │
                                    LineageGraphBuilder
                                          │
                              ┌───────────┼────────────┐
                              ▼           ▼            ▼
                       DatasetLineage  ColumnLineage  ImpactAnalyzer
                              │           │            │
                              └─────  PropagationWorker ┘
                                  (tags / PII propagation)

lineage-service has Flyway disabled — it reuses the lineage tables that metadata-service creates (Flyway metadata V4 and V48). It records dataset and column lineage, supports findLineageAsOf time-travel queries, runs ImpactAnalyzer for downstream impact analysis, and uses PropagationWorker to propagate tags and PII classifications across the graph. The lineage graph is presented in the LineageExplorer page.

Stage 7 — Governance & certification

Datasets and pipelines are governed by metadata-service. Governance is where data is certified fit for use.

Activity	Mechanism	UI
Review & approval	`PolicyEngine`, `ApprovalWorkflow`, governance reviews	`ReviewQueue`
Data contracts	`/api/v1/contracts`, Flyway metadata V16	`DataContracts`
Quality monitoring	`QualityController`, quality rules / results / scores	`QualityMonitoring`
Endorsements	`/api/v1/governance/endorsements`	governance hub
Business glossary	glossary entities, Flyway metadata V10	`BusinessGlossary`
Schema evolution	`/api/v1/governance/schema-changes`	`SchemaEvolution`
Audit trail	hash-chained immutable audit log, `AuditChainVerifierService`	`AuditTrail`
Compliance	`GdprService`, `PolishComplianceService`, DSAR, masking	`ComplianceDashboard`, `DsarRequestsPage`

The audit log is a hash-chained immutable record (Flyway metadata V27 and V29) verified by AuditChainVerifierService. PII is protected throughout: the gateway's PiiMaskingFilter masks responses, metadata-service runs DynamicMaskingService and masking policies (Flyway metadata V17), and a PII classifier feeds the lineage-service PII propagation.

GDPR data-subject access requests follow their own sub-lifecycle: a DSAR (gdpr_dsar_requests, Flyway metadata V1) is raised, and erasure is orchestrated by pipeline-engine's erasure/ErasureOrchestrator.kt, which records erasure_runs and erasure_steps (Flyway engine V9).

Stage 8 — Retirement

A pipeline that is no longer needed is retired. Its definition and version history remain in metadata-service for audit, while the schedule is paused or removed through the SchedulerController and the pipeline is no longer eligible for execution.

Data retirement is governed separately. RetentionPolicyEnforcer in metadata-service applies retention policies, GDPR erasure removes personal data on request, and the data marketplace (Flyway metadata V20) can withdraw a data product so consumers no longer discover it. Historical run records and lineage are retained for impact analysis and compliance reporting even after a pipeline is retired.

  ACTIVE ──► DEPRECATED ──► RETIRED ──► (definition + history retained for audit)
              │                │
       schedule paused   removed from execution;
       consumers warned  data product withdrawn;
                         retention policy applies

Pipeline state summary

Bringing the stages together, a pipeline definition moves through these states.

  DRAFT ──► VALIDATED ──► SCHEDULED ──► ACTIVE ──► DEPRECATED ──► RETIRED
   │            │            │           │
   │            │            │           └─ each trigger spawns a RUN
   │            │            │              (QUEUED → RUNNING → SUCCESS/FAILED)
   │            │            └─ cron / backfill / dependency / orchestrator
   │            └─ DSL + DAG + contract validation
   └─ authored in Design Studio / wizard / migration / template

Pipeline state	Meaning
`DRAFT`	Authored, not yet validated; saved as a `pipeline_version`.
`VALIDATED`	Passed DSL, DAG and contract validation; runnable on demand.
`SCHEDULED`	Bound to a cron schedule, backfill or dependency chain.
`ACTIVE`	In regular operation; each trigger produces a run.
`DEPRECATED`	Marked for removal; schedule paused, consumers warned.
`RETIRED`	No longer executed; definition and history retained for audit.

Flyway database migration lifecycle

The platform's database schema evolves through Flyway migrations. The shared PostgreSQL database is logically partitioned by per-service migration histories.

Service	Flyway location	History table	Range
metadata-service	`db/migration/metadata`	`flyway_schema_history`	V1–V54
pipeline-engine	`db/migration/engine`	`flyway_schema_history_engine`	V1–V9
monitor-service	`db/migration/monitor`	`flyway_schema_history_monitor`	V9–V25
lineage-service	— Flyway disabled —	—	reuses metadata tables

A migration follows this lifecycle from authoring to live schema:

  WRITE  ──►  COMMIT  ──►  STARTUP  ──►  APPLY  ──►  RECORD  ──►  VALIDATE
  V__NN     in db/         service       run         insert      ddl-auto
  numbered  migration/     boots         pending     row into    checks the
  SQL file  {service}/                   migrations  history     live schema
                                                     table       matches

Write — a developer adds a numbered V__NN SQL file under the service's db/migration folder.
Commit — the file ships with the service build.
Startup — on service boot, Flyway compares the migration files with the service's history table.
Apply — pending migrations run in order.
Record — each applied migration is recorded as a row in the per-service history table.
Validate — metadata-service runs JPA ddl-auto: validate; pipeline-engine and lineage-service use ddl-auto: none.

Permissive Flyway settings

The compose deployment runs Flyway with BASELINE_ON_MIGRATE, OUT_OF_ORDER, REPAIR_ON_MIGRATE, VALIDATE_ON_MIGRATE: false and BASELINE_VERSION: 20.1. These let a shared database with overlapping, historically diverged migration histories converge on startup — but they are a fragility risk, and migration ordering matters because lineage-service depends on metadata-service having already applied its lineage migrations (V4 and V48).

Software release lifecycle

The platform itself ships through a build → test → deploy → monitor cycle.

  BUILD ──────► TEST ──────► DEPLOY ──────► MONITOR
  Gradle JVM    vitest /     docker compose   Prometheus
  modules +     playwright   up on the VPS    scrape →
  Vite frontend / msw +      (git archive,    Grafana;
  + Python      backend      Flyway on        SSE alerts
  services      tests        startup)

Phase	What happens
Build	The JVM services share one `platform/Dockerfile` with a `BUILD_MODULE` build-arg selecting the Gradle module. The Vite frontend and the two Python services have their own build contexts.
Test	The frontend runs `vitest` unit tests, `playwright` end-to-end tests (with a `VITE_AUTH_MODE=dev` bypass) and `msw`-mocked API tests; backend services run their own test suites.
Deploy	The live system is a single Debian VPS provisioned by `deploy/deploy_to_vps.py`, which SSHes in, untars a `git archive` and runs `docker compose up`. Flyway migrations run on service startup; nginx terminates TLS.
Monitor	Prometheus scrapes `/actuator/prometheus` and `/metrics`, feeding Grafana dashboards; monitor-service emits alerts over SSE.

Two deployment paths

The repository documents a GitHub Actions → GCP Artifact Registry → GKE Autopilot path with canary rollout, but the system actually running in production is the single-VPS Docker Compose topology. The release lifecycle described here reflects the live VPS path.

Putting it together

The three lifecycles interlock: a pipeline is authored as YAML, validated in the browser and on the server, scheduled through cron or backfill, executed as a topologically-ordered DAG run, monitored continuously by monitor-service, traced by lineage-service as OpenLineage events, governed and certified by metadata-service, and finally retired with its history retained for audit. Underneath, Flyway migrations evolve the schema on every service startup, and the platform itself moves through build, test, deploy and monitor on each release.