ETL on DataFlow
How DataFlow AI runs ETL
This page opens the lid on the DataFlow AI platform and shows, step by step, how it actually carries out the work of ETL. It begins with a plain-language overview that anyone can follow, then goes deeper into each part for readers who want the technical detail.
If you have not yet read What is ETL?, it is worth doing so first — this page assumes you already know what Extract, Transform, Load, a pipeline, a connector, and a DAG are.
The big picture — what happens when a pipeline runs
Before any detail, here is the whole story in one breath.
A pipeline begins life as a recipe — a written description of every step. When the time comes to run it (either because the clock reached a scheduled time, or because a person pressed "Run"), the platform reads the recipe, checks it carefully for mistakes, works out the correct order of steps, and then carries them out. As it works, it shows live progress on screen. If a step fails, it tries to recover automatically. When it finishes, it records what happened — how many rows were processed, how long it took, and whether the data passed its quality checks.
Here is that same story as a numbered sequence:
1. The recipe is written → a YAML file describing every step
2. The recipe is checked → validated against rules; mistakes caught early
3. A trigger fires → the schedule reaches its time, or a person clicks Run
4. The steps are ordered → the DAG engine works out what runs, and in what order
5. The steps are executed → each node does its job; independent steps run together
6. Progress is shown live → rows-in, rows-out, and status stream to the screen
7. Problems are handled → failed steps are retried; recovery is attempted
8. Quality is checked → the data is scored against its rules
9. The result is recorded → row counts, duration, and outcome are saved
The rest of this page walks through each of these stages in turn.
In plain terms
A pipeline is a written recipe. Running it means: read the recipe, check it for mistakes, figure out the order of steps, do them, watch for trouble, and write down what happened. Everything below is just detail on how each of those things is done.
Stage 1 — How a pipeline is defined: the YAML recipe
Every DataFlow AI pipeline is stored as a recipe written in a file format called YAML.
YAML is a plain-text format designed to be readable by ordinary people, not just computers. It uses simple indentation (spacing) to show how things are grouped, rather than the brackets and symbols of most programming languages. You can open a YAML file and broadly understand it even if you have never seen one before.
DataFlow AI's particular style of YAML — its set of rules for what a pipeline recipe may contain — is called the DSL, which stands for Domain-Specific Language. A DSL is simply a small, focused language built for one job; here, the one job is "describe an ETL pipeline." The style is deliberately modelled on the way the Kubernetes system (a popular tool for running software) describes its configuration, which is why every pipeline file begins with the lines apiVersion and kind.
The shape of a pipeline file
A pipeline recipe has two main parts: metadata (information about the pipeline) and spec (the actual specification of what it does).
apiVersion: dataflow.polkomtel.com/v1
kind: Pipeline
metadata:
name: daily-sales-load
namespace: sales-analytics
annotations:
description: "Loads yesterday's sales from Oracle into Snowflake"
owner: "data-team"
spec:
schedule: "0 2 * * 1-5" # run at 02:00, Monday to Friday
timezone: "Europe/Warsaw"
enabled: true
parameters:
- name: load_date
type: date
default: "{{yesterday}}"
timeout: 3600 # give up after 3600 seconds (one hour)
retries: 3 # if a step fails, try it up to 3 more times
nodes:
- id: read-oracle
type: connector_source
label: "Read sales from Oracle"
config:
connector: oracle-prod
table: SALES
- id: drop-tests
type: filter
label: "Remove test rows"
config:
condition: "status != 'TEST'"
- id: write-snowflake
type: connector_sink
label: "Write to Snowflake"
config:
connector: snowflake-dw
table: FACT_SALES
writeMode: UPSERT
upsertKeys: [sale_id]
edges:
- from: read-oracle
to: drop-tests
- from: drop-tests
to: write-snowflake
Reading that file from top to bottom: it is a pipeline called daily-sales-load; it runs at 2 AM on weekdays in Warsaw time; it has three steps (nodes) — read from Oracle, drop test rows, write to Snowflake — and the edges connect them in order. Even with no technical training, the shape of the recipe is followable.
The 11 node types
Each step in a pipeline is a node, and every node has a type that decides what job it does. DataFlow AI's DSL defines 11 node types. Eight are for batch pipelines and two are for streaming; one (connector_source) is shared.
| Node type | What it does |
|---|---|
connector_source | Reads data out of a registered connector — the Extract step. |
connector_sink | Writes data into a connector — the Load step. Carries a writeMode such as INSERT, UPSERT, or OVERWRITE. |
expression | Transforms columns one by one — calculating new values or reformatting existing ones. |
filter | Drops rows that fail a condition; rejected rows can optionally be sent to a separate destination for inspection. |
aggregator | Summarises rows into groups — the "total sales per region" job. |
joiner | Stitches two upstream sets of data together by a shared key (an INNER, LEFT, RIGHT, or FULL join). |
router | Fans data out down several paths, each chosen by its own condition. |
sorter | Re-orders rows by one or more columns. |
union | Stacks rows from several upstream nodes into one combined set, optionally removing duplicates. |
cdc_source | Consumes a live stream of database changes captured by Debezium — used in streaming pipelines. |
streaming_sink | Writes a continuous stream out to a destination such as Kafka. |
Parameters and template functions
A good recipe should not have to be re-written every day just because the date changed. DataFlow AI handles this with parameters and template functions.
A parameter is a named slot in the recipe whose value is filled in when the pipeline runs — for example a load_date. A template function is a small built-in helper, written inside double curly braces, that fills such a slot automatically. Useful examples include {{today}}, {{yesterday}}, {{now}}, {{date_add(today,-7)}} (the date a week ago), and {{month_start(today)}} (the first day of this month). There is also {{vault:secret/path#key}}, which safely fetches a password from a secure store without ever writing it into the file.
Every recipe is version-controlled
Each time a pipeline is saved, DataFlow AI does two things automatically. First, it validates the recipe (covered in the next stage). Second, it stores the new version in a Git repository — Git being the standard tool for keeping a full history of changes to files. Every saved version gets a number and a unique fingerprint (a git_commit_sha), and the previous version is kept. This means you can always look back at what a pipeline used to do, compare two versions, and undo a change if needed.
Stage 2 — How a recipe is checked: schema validation
Before a pipeline is ever allowed to run — in fact, before it is even saved — its YAML recipe is validated.
Validation means the recipe is checked, automatically, against a strict set of rules called a JSON Schema. (A JSON Schema is a formal description of what a valid file must look like — which fields are required, what type each field must be, and so on.) Validation catches mistakes such as: a node missing a required setting, a node type spelled wrong, two nodes with the same identifier, an arrow pointing to a node that does not exist, or a node left disconnected from the rest of the diagram.
Catching these mistakes early — at design time and again before execution — is far better than discovering them halfway through a run at 3 AM. A node that fails validation is marked with a red marker on the canvas. You are allowed to save a pipeline that still has errors, but you cannot run it until every error is resolved.
Stage 3 — How a pipeline is triggered
A pipeline run begins because something triggers it. There are three kinds of trigger.
- A schedule. The most common trigger — the clock reaches a time the pipeline was set to run at.
- A manual run. A person opens the pipeline and presses "Run." This is used for testing, for re-running after a fix, and for one-off jobs.
- An event. Something happens that the pipeline was told to wait for — a new file arrives in a folder, a message lands on a Kafka topic, or another pipeline finishes successfully.
Batch scheduling and the Polish holiday calendar
Scheduled pipelines are governed by a part of the platform called the Scheduler. Schedules are written in cron, a long-established and very compact way of describing repeating times. A cron expression is five fields — minute, hour, day-of-month, month, day-of-week. For example, 0 2 * * 1-5 means "at minute 0 of hour 2, every day-of-month, every month, on days-of-week 1 to 5" — in other words, 2 AM on weekdays. There are also friendly shortcuts: @daily, @hourly, and @monthly.
DataFlow AI adds something specific to its Polish customer: a Polish public holiday calendar. Many business reports should not run on public holidays — there is no fresh business activity to process, and running anyway can produce misleading "zero" figures. DataFlow AI's scheduler understands the official Polish public holidays defined by the law known as the Ustawa o dniach wolnych od pracy: New Year, Three Kings (6 January), Easter Monday, May Day, Constitution Day (3 May), Corpus Christi, the Assumption (15 August), All Saints (1 November), Independence Day (11 November), and Christmas (25–26 December).
To use it, you add the special marker !PL_HOLIDAY to the end of a cron expression. So 0 2 * * 1-5 !PL_HOLIDAY means "2 AM on weekdays — but skip Polish public holidays." Schedules default to the Europe/Warsaw timezone.
0 2 * * 1-5 !PL_HOLIDAY
│ │ │ │ │ └── skip official Polish public holidays
│ │ │ │ └────── day of week: 1–5 (Monday to Friday)
│ │ │ └──────── month: every
│ │ └────────── day of month: every
│ └──────────── hour: 02
└────────────── minute: 00
Pipelines that depend on other pipelines
Pipelines can also be chained: pipeline B can be told to wait until pipeline A has finished successfully before it starts. This is described in its own kind of recipe (kind: PipelineDependency). Each dependency specifies which pipeline it waits for, what condition counts as ready (success, or success-or-skipped), and a maxWaitMinutes — how long to keep waiting before giving up. To avoid the problem of many pipelines all waking at exactly the same instant and overwhelming the systems (a problem nicknamed the thundering herd), the scheduler can add a small random delay called jitter, spreading the starts out.
Stage 4 — How the steps are ordered: the DAG execution engine
When a pipeline is triggered, the part of the platform called the Pipeline Engine takes over. Its first job is to turn the flat list of nodes and edges in the recipe into an executable plan — and that plan is a DAG.
As explained in What is ETL?, a DAG (Directed Acyclic Graph) is a one-way, loop-free flowchart. The component that builds it, the DAG Builder, does three important things:
- It works out the dependencies. By reading the edges, it learns that, say, the Aggregator cannot start until the Filter has finished, because the Filter feeds it.
- It checks for loops. Using a well-known method called a topological sort (sometimes Kahn's algorithm), it confirms there are no cycles. If a recipe accidentally contained a loop, the run would be rejected here rather than running forever.
- It finds steps that can run side by side. If two branches of the pipeline do not depend on each other, they can run in parallel — at the same time — which is faster than doing everything one after another.
Once the DAG is built, the engine executes the nodes in topological order: a node runs only after every node feeding into it has finished. Independent nodes run together.
How much runs at once
Running things in parallel is powerful, but it must be bounded so the platform does not overload itself or the source systems. DataFlow AI uses three configurable limits:
| Limit | Default | What it controls |
|---|---|---|
PIPELINE_MAX_PARALLEL_NODES | 8 | How many nodes of a single pipeline run may execute at the same time. |
PIPELINE_MAX_CONCURRENT_EXECUTIONS | 20 | How many separate pipeline runs one engine instance handles at once. |
PIPELINE_EXECUTION_TIMEOUT_SECONDS | 3600 | The overall time limit; a run exceeding one hour is cancelled. |
Live progress tracking
While the DAG executes, the engine continuously reports progress. For each node it records a status (WAITING, RUNNING, or COMPLETED), how many rows went in (rowsIn), how many came out (rowsOut), and how long it took (durationMs). For the whole pipeline it reports an overall progress percentage, which node is currently active, running totals of rows read, written, and rejected, and an estimated completion time. This information is what you see updating live on the Monitor screen as a run unfolds.
Stage 5 — How the steps do their work: transformations and push-down SQL
Each transformation node does its job in one of two very different ways, and the platform chooses cleverly between them.
The transformation types
The transformation nodes — expression, filter, aggregator, joiner, router, sorter, union, and the surrogate-key sequence_generator — each perform one well-defined reshaping job, exactly as described in What is ETL?. The expression and filter nodes evaluate their logic using either SQL or a small expression language called SpEL (the Spring Expression Language — a compact way of writing conditions like rowCount > 0).
Push-down SQL — sending the work to the data
Here is one of DataFlow AI's most important ideas. Moving large volumes of data across a network is slow and expensive. So, wherever possible, the platform does not drag the data out to transform it. Instead it uses its Push-Down SQL Engine.
The Push-Down SQL Engine rewrites your transformation steps as native SQL — the database's own language — and sends that short instruction to the database, telling the database to do the work itself, right where the data already lives. The data never travels.
To do this it must translate into each database's own dialect of SQL, because no two databases speak SQL identically (one writes COALESCE, another writes NVL, for the same idea). The engine uses a technique called an AST visitor pattern — an Abstract Syntax Tree being a structured, tree-shaped representation of the transformation that can be re-written cleanly for each target. It supports 8 dialects: Teradata, Snowflake, PostgreSQL, SQL Server, Oracle, SAP HANA, Databricks (Spark SQL), and Sybase ASE.
The engine decides to use push-down when conditions favour it. The clearest case is when the source and the destination are the same database — then the entire transformation can happen inside that one system and the data never leaves it at all. Push-down is also chosen when the source database supports the SQL features needed (filters, grouping, joins, window functions), and when the estimated volume of data is large — above a configurable threshold, by default 100,000 rows. It is not used for the streaming-specific node types, which are handled by a different engine. The platform reports that it can push down more than 90% of the transformation work on typical analytical pipelines.
WITHOUT push-down WITH push-down
───────────────── ──────────────
Database ──millions of rows──▶ Database ◀──short SQL instruction──
DataFlow transforms Database transforms itself
──millions of rows──▶ ──only the small result──▶
Destination Destination
(slow: data crosses the network) (fast: only the instruction travels)
The batch compute engine — Spark on Dataproc
When work cannot be pushed into a database — for example, when joining a database table with a folder of files, or when the transformation is too complex — DataFlow AI processes the data itself using Apache Spark, an engine designed to crunch very large datasets by spreading the work across many computers at once.
DataFlow AI runs Spark on Dataproc Serverless, a Google Cloud service that supplies Spark computing power on demand — anywhere from 4 to 500 worker machines — with no clusters for anyone to manage. It includes optimisations such as Adaptive Query Execution, which lets Spark adjust its plan partway through a run based on what it actually sees in the data.
Stage 6 — Streaming pipelines: Flink and Debezium
Everything above describes a batch run — a scheduled job that processes a chunk of data and stops. Streaming pipelines work differently: they run continuously and process each record within seconds.
The streaming engine — Apache Flink
Streaming pipelines are executed by Apache Flink (version 1.18/1.19), an engine purpose-built for never-ending flows of data. DataFlow AI runs Flink on Google Cloud's Kubernetes service. Flink brings several capabilities that batch processing does not need:
- Windows. Because a stream never ends, you cannot "sum it all up" — there is no end. Instead you group events into windows of time. A tumbling window is a fixed, non-overlapping slice (e.g. every 5 minutes); a sliding window overlaps; a session window groups bursts of activity separated by gaps.
- Event-time processing with watermarks. Records sometimes arrive slightly out of order. Flink tracks the event time (when something actually happened, not when it arrived) and uses watermarks — markers that say "we believe we have now seen everything up to this moment" — to decide when a window is complete.
- Checkpoints. Periodically (by default every 60 seconds) Flink saves a complete snapshot of its in-progress state to cloud storage. This is called a checkpoint. If the job crashes, it restarts from the last checkpoint and loses nothing — a guarantee known as exactly-once processing, meaning every record is processed once and only once, never dropped and never double-counted.
The change feed — Debezium CDC
A streaming pipeline that keeps a copy of a database in sync needs a live feed of every change. As explained in What is ETL?, that feed is produced by Change Data Capture (CDC), and DataFlow AI provides it through Debezium.
Debezium reads each source database's internal transaction log and turns every insert, update, and delete into a message, published onto a Kafka topic. Each message carries a "before" and "after" image of the row, so downstream pipelines can see exactly what changed. The five supported databases, with their typical end-to-end delays, are:
| Database | How CDC reads the changes | Typical delay |
|---|---|---|
| Oracle | Reads the redo logs via LogMiner | under 5 seconds |
| PostgreSQL | Reads the write-ahead log (logical replication) | under 2 seconds |
| SQL Server | Reads SQL Server's native CDC tables | under 5 seconds |
| MySQL | Reads the binary log (binlog) | under 2 seconds |
| MongoDB | Reads MongoDB's change streams | under 3 seconds |
A telecom-specific feature: decoding Call Detail Records
DataFlow AI includes one streaming feature built specifically for Polkomtel: a decoder for Call Detail Records (CDRs). Telecom network equipment writes CDRs in a compact binary format called ASN.1 (using encoding rules known as BER/DER). This is not human-readable and not understood by ordinary tools.
DataFlow AI ships a purpose-built decoder, written in the Kotlin programming language, that reads this binary format and translates it into normal, readable rows — pulling out fields such as the calling and called phone numbers, the call duration, the data volume, and the time. It understands the record formats defined by the international 3GPP TS 32.298 standard, covering ordinary calls, SMS, data sessions, and modern voice-over-LTE. The decoded records are written out as Parquet files (an efficient columnar file format), at a throughput of around one million CDRs per minute.
In plain terms
Streaming = a pipeline that never stops. Flink is the engine that runs it; windows are how it groups an endless flow into countable pieces; checkpoints are its safety snapshots; Debezium is the live feed of database changes that feeds it. The CDR decoder is a special translator that turns the telecom network's cryptic binary call records into ordinary readable data.
Stage 7 — When things go wrong: error handling and self-healing
In the real world, things fail. A network connection drops; a database is briefly overloaded; a file is momentarily locked. A good ETL platform expects this and recovers gracefully rather than simply collapsing. DataFlow AI has three layers of protection.
Retry with exponential backoff
When a step fails for a transient reason (a temporary glitch rather than a real flaw), the engine simply tries again. By default it retries up to 3 times. Crucially, it does not retry instantly — it waits, and the wait grows each time. This is called exponential backoff: the delay roughly doubles on each attempt. The idea is that if a database is briefly overloaded, hammering it again immediately would only make things worse; waiting a little — and then a little longer — gives it time to recover.
| Setting | Default | Meaning |
|---|---|---|
retries | 3 | Maximum retry attempts after a transient failure. |
retryDelay | 300 seconds | The starting delay before the first retry. |
retryBackoffMultiplier | 2.0 | The delay is multiplied by this each time (doubling). |
retryMaxDelay | 1800 seconds | A cap so the delay never grows beyond 30 minutes. |
The circuit breaker
If a particular system keeps failing no matter how many times it is retried, continuing to hammer it is pointless and harmful. DataFlow AI borrows an idea from household electrics: the circuit breaker.
Just as an electrical circuit breaker "trips" and cuts the power when it detects a fault, DataFlow AI's circuit breaker trips after a set number of consecutive failures (by default 5) and stops sending requests to that system entirely. After a cool-down period (by default 120 seconds) it enters a cautious "half-open" state, lets a single test request through, and — if that succeeds — closes again and resumes normal traffic. This protects both the failing system and the rest of the platform from a cascade of failures.
Error routing and the Dead Letter Queue
Sometimes the problem is not a system but the data itself — a handful of rows are malformed and cannot be processed. Rather than abandoning the whole run for the sake of a few bad rows, DataFlow AI can route those rows aside. Every transform node has an error output, and a filter node can send rejected rows to a separate destination. Bad records can be collected into a Dead Letter Queue — a holding area (the term comes from postal services, for letters that cannot be delivered) — where they wait, with metadata describing what went wrong, to be inspected and corrected later. The good rows continue on unhindered.
Self-healing and AI diagnosis
When a run does fail outright, DataFlow AI's AI Copilot automatically steps in to diagnose it. It examines the error trace, the configuration of the step that failed, the shape of the input data, and the history of similar past failures. It then presents, in plain language, the most likely root cause, the evidence for that conclusion, and a step-by-step suggested fix — sometimes including a corrected snippet of YAML or SQL.
It is worth being precise here: in the current version, "self-healing" means the retry and circuit-breaker mechanisms above, plus this AI diagnosis. A future capability called AI Auto-Heal is planned, in which the Copilot would detect recurring failures and apply a fix automatically after a human approves it — but in the current release a person still applies the fix.
Stage 8 — Checking the result: the data quality engine
Producing data is not enough; the data must also be trustworthy. After (and during) a run, DataFlow AI's data quality engine checks the data against rules you have defined.
A set of rules is attached to a dataset using its own kind of recipe (kind: DataQualityRuleSet), and the rules are evaluated automatically every time a pipeline runs. There are 10 built-in rule types:
| Rule type | What it checks |
|---|---|
NOT_NULL | A column is never empty (or empty no more than an allowed percentage). |
UNIQUE | A column — or a combination of columns — has no duplicate values. |
RANGE | A numeric or date value stays between a minimum and a maximum. |
REGEX | A text value matches a required pattern — used, for example, to check that a PESEL or phone number is correctly formed. |
REFERENTIAL | Every value in one table also exists in another — for example, every order's customer ID is a real customer. |
CUSTOM_SQL | A custom check you write yourself; it must return zero rows to pass. |
COMPLETENESS | At least a minimum proportion of a column is filled in. |
CONSISTENCY | A cross-column rule holds — for example, an end date is never before a start date. |
FRESHNESS | The data is recent — its newest timestamp is no older than an allowed age. |
ANOMALY | A statistical measure has not drifted strangely away from its historical normal. |
Each rule has a severity (critical, high, or medium) and can be marked with blockPipeline, meaning a failure of that rule stops the pipeline rather than just warning. The engine also keeps a statistical profile of every dataset — counts of nulls and distinct values, minimum and maximum, mean, standard deviation, a histogram of the value distribution, and the most common values — and compares each new run against that history to spot drift. Finally, it rolls everything into a single quality score from 0 to 100, so anyone can judge a dataset's reliability at a glance.
The ANOMALY rule and the drift detection use simple, well-understood statistics: a value is flagged if it lies several standard deviations away from its historical average (a standard deviation being a standard measure of how spread out a set of numbers is). This is honest statistical anomaly detection comparing today's figures against a real historical baseline — not guesswork.
Putting it all together
Here, finally, is the complete journey of a single pipeline run, with the responsible part of the platform named at each step:
WRITTEN The pipeline DSL (YAML) describes nodes and edges.
│
CHECKED The recipe is validated against the JSON Schema.
│
TRIGGERED The Scheduler fires it — cron time (Polish-holiday-aware),
│ a manual Run, or an event.
│
PLANNED The Pipeline Engine's DAG Builder orders the nodes,
│ rejects loops, and finds steps that can run in parallel.
│
EXECUTED Each node runs. The Push-Down SQL Engine sends work into
│ the database where it can; Spark on Dataproc handles the
│ rest; Flink + Debezium handle streaming and CDC.
│
WATCHED Per-node row counts, status, and progress stream live
│ to the Monitor screen.
│
PROTECTED Failures are retried with exponential backoff; the
│ circuit breaker guards failing systems; bad rows are
│ routed to a Dead Letter Queue; the AI Copilot diagnoses
│ outright failures in plain language.
│
SCORED The data quality engine evaluates its 10 rule types and
│ produces a 0–100 quality score.
│
RECORDED Row counts, duration, outcome, and a new version in Git
are all saved for history and audit.
Every part of this model exists to serve the two promises the platform makes: that ETL should be reliable — surviving the everyday failures of real systems — and approachable — explainable, in plain language, to the people who use it, not only to the engineers who built it.
To see these ideas in action through the visual designer, continue to the Design Studio guide; to learn how to build streaming pipelines specifically, see the Streaming guide.