Introduction
What is ETL?
This page explains, from the very beginning and in everyday words, what "ETL" means, why a large company needs it, and where the DataFlow AI platform fits in. You do not need any technical background to read it — every special word is explained the first time it appears.
Start with a simple problem
Imagine a very large company — for example, Polkomtel, the company behind the "Plus" mobile phone network in Poland. Like every big business, Polkomtel does not keep all of its information in one place. Instead, different parts of the company each keep their own records, in their own filing cabinets, in their own way.
Here are four of those "filing cabinets":
- The customer relationship system (often shortened to CRM, which simply means the software that keeps track of customers) holds people's names, their contact details, and the contracts they have signed.
- The billing system holds invoices and records of payments.
- The network switches — the powerful computers inside the telephone network — write down a record of every single phone call and every internet session. Each of these records is called a Call Detail Record, or CDR for short. Think of a CDR as a tiny receipt that says "this phone number called that phone number, at this time, for this many seconds."
- The sales reporting tools hold figures about revenue — how much money the company is earning.
Now imagine a person whose job is to study the business — a business analyst — wants to answer one simple-sounding question:
"Which of our customers are most likely to leave us and switch to a competitor next month?"
That answer cannot come from any single filing cabinet. To work it out, the analyst needs to combine the customer's contract details (from the CRM), their payment history (from the billing system), how much they have been using their phone lately (from the CDRs), and more. The information is scattered, and each system "speaks a different language" — meaning each one stores its data using a different technology and a different layout.
This is the problem that ETL solves. ETL is the automated plumbing that gathers scattered information together so that questions like the one above can actually be answered.
In plain terms
ETL is the behind-the-scenes process that copies information out of many separate computer systems, tidies it up, and gathers it into one place where it can be studied. Without it, every interesting business question would require a person to manually open a dozen systems and copy figures into a spreadsheet by hand — every single day.
What the three letters mean
ETL stands for three steps that happen in order: Extract, Transform, Load.
A good way to picture the whole process is to think of cooking a meal:
- Extract is fetching the ingredients from various cupboards, the fridge, and the pantry.
- Transform is preparing the ingredients — washing, peeling, chopping, mixing, and cooking them so they become a finished dish.
- Load is serving the finished dish onto a plate so people can eat it.
Let us walk through each step properly, using the telecom company as our example.
Extract — "Copy the raw data out"
The first step is Extract. This means connecting to one of the source systems — say, the billing database — and reading a copy of the rows of data that are needed.
The word "copy" matters. Extract does not change anything in the original system and does not move the data away from it. It simply reads a duplicate. This is deliberate: the billing system is busy doing its real job (sending invoices to customers), and we must not disturb it.
A useful telecom example: every night, a process extracts the call records that the network switches wrote during the previous day — taking a copy of every "receipt" of every call and data session.
In the DataFlow AI platform, this step is performed by a building block called a Source Connector (the word "connector" is explained in detail further down). You tell the Source Connector which system to read from, which table of data you want, and — optionally — you can add a filter such as "only give me yesterday's records" so you do not needlessly copy data you already have.
Transform — "Clean it up and reshape it"
Raw data, freshly extracted, is almost never ready to use. It is messy, in the same way that vegetables straight from the garden are covered in soil. Common problems include:
- Missing values — for example, a customer record with no email address filled in.
- Inconsistent formats — the same phone number written five different ways across five systems (with a country code, without one, with spaces, with dashes, and so on).
- Duplicates — the very same customer accidentally entered into the system twice.
- Data spread across many tables that needs to be joined — stitched together — before it makes sense. The customer's name might be in one table and their payments in another; to see them side by side, the two tables must be matched up by a shared identifier such as a customer number.
The Transform step fixes all of this. It is where the data is cleaned, corrected, combined, and reshaped into one consistent, tidy form.
In DataFlow AI, the Transform step is done by dragging small building blocks called transformation nodes onto a design canvas. Each node does one specific job. The main ones are:
| Transformation node | What it does, in plain words |
|---|---|
| Filter | Throws away rows you do not want — for example, "drop every test record." |
| Expression | Calculates a new value or reformats an existing one — for example, turning every phone number into one standard format. |
| Joiner | Stitches two tables together by a shared key, so related information sits side by side. |
| Aggregator | Summarises many rows into fewer — for example, "total sales per region" turns thousands of individual sales into one figure per region. |
| Router | Sends rows down different paths depending on their contents. |
| Sorter | Puts the rows in a chosen order. |
| Union | Stacks rows from several sources into one combined set. |
| Sequence Generator | Creates a fresh, running number for each row, useful as a unique identifier. |
A particularly important transform job in a telecom company is masking — deliberately hiding sensitive personal details. Poland uses a national identity number called a PESEL, and phone numbers themselves are personal data. During the Transform step, these values can be masked (for example, replaced with ********) so that sensitive details never reach the final reports that ordinary staff can see.
Load — "Put the finished data in the warehouse"
The final step is Load. The cleaned, combined, tidy data is now written into its destination — a central system designed for analysis. That destination is usually called a data warehouse.
A data warehouse is a special kind of database built not for the day-to-day running of the business, but for asking questions about the business. Where the billing system is optimised for sending one invoice quickly, a data warehouse is optimised for scanning millions of records to produce a report. Common data warehouse products you will hear named in this documentation are Snowflake, Teradata, and Databricks.
Once the data is loaded into the warehouse, the analyst's reporting tools — often called Business Intelligence tools, or BI for short — can finally query it and produce the dashboards and answers the business needs.
In DataFlow AI, this step is performed by a building block called a Sink Connector. ("Sink" is the standard ETL word for a destination — the place data flows into, the opposite of a source.) When you configure a Sink Connector you must choose a write mode, which tells the platform how to write the new data:
| Write mode | What it does |
|---|---|
| Append | Adds the new rows alongside the existing ones, keeping everything. |
| Overwrite | Throws away everything that was there and replaces it completely. |
| Upsert (also called Merge) | Updates rows that already exist and inserts rows that are new. This is the most common choice for a daily refresh. |
| Delete-Insert | Removes a defined slice of data (for example, "all of yesterday's records") and then re-inserts that slice fresh. |
The word upsert is a blend of "update" and "insert" — it does both at once, deciding row by row which is needed.
In plain terms
Extract = copy the raw data out. Transform = clean it and combine it. Load = write the finished result into the warehouse where it can be studied. ETL is just those three steps, done automatically, over and over, on a schedule.
ELT — the same letters in a different order
You may also see the term ELT — the same three words, but with Load and Transform swapped: Extract, Load, Transform.
The difference is where the cleaning happens. In classic ETL, the data is transformed on the way to the warehouse, by the ETL platform itself. In ELT, the raw data is loaded into a powerful warehouse first, and the transformation is then done inside that warehouse, using the warehouse's own engine.
Why would anyone do it that way? Because modern data warehouses are extremely powerful. If the warehouse can do the heavy cleaning faster than the ETL platform can, it is wasteful to drag all the data out, clean it elsewhere, and push it back.
DataFlow AI does this automatically with a feature called the Push-Down SQL Engine. Instead of pulling millions of rows out of a database to transform them, the platform rewrites your transformation as an instruction in SQL — the standard language databases understand — and sends only that short instruction to the database, telling it to do the work itself, where the data already lives. Large volumes of data therefore never have to travel across the network. The platform reports that it can "push down" more than 90% of standard analytical work in this way, across 8 different database dialects (a dialect being one database's particular flavour of SQL).
In plain terms
ELT is ETL with the cleaning done inside the warehouse instead of on the way there. DataFlow AI chooses the faster option automatically by sending the work to wherever the data already sits, rather than dragging the data to the work.
What is a "pipeline"?
So far we have talked about steps. The complete, end-to-end sequence of steps — Extract this, Transform it like so, Load it there — is called a pipeline.
A pipeline is best thought of as a recipe. Once you have written the recipe down, you no longer need to think about it: it can be run automatically, every night, without anyone watching. A single pipeline says, in effect:
"Take the call records and customer table from these systems, drop the test rows, mask the PESEL numbers, join the two together, summarise by region, and write the result into the Snowflake warehouse."
In DataFlow AI a pipeline is drawn as a diagram of connected boxes. Each box is a node (one step), and the arrows between boxes are edges (they show the order in which steps run, and which step feeds which). This box-and-arrow diagram has a formal name: a DAG.
DAG stands for Directed Acyclic Graph. That sounds intimidating, so here it is broken down:
- Graph — a set of boxes joined by arrows.
- Directed — the arrows point one way; data flows in a definite direction.
- Acyclic — there are no loops. You can never follow the arrows and end up back where you started. This guarantees the pipeline always finishes and never chases its own tail forever.
So "the pipeline is a DAG" simply means "the pipeline is a one-way, loop-free flowchart of steps."
┌──────────┐ ┌──────────┐ ┌────────────┐ ┌──────────┐
│ Source │ ──▶ │ Filter │ ──▶ │ Aggregator│ ──▶ │ Sink │
│ (Extract)│ │ (drop │ │ (summarise │ │ (Load) │
│ │ │ test │ │ by region)│ │ │
└──────────┘ │ rows) │ └────────────┘ └──────────┘
└──────────┘
a pipeline drawn as a DAG — boxes are nodes, arrows are edges
Batch versus streaming — the two speeds of ETL
A pipeline can run at one of two very different speeds. DataFlow AI supports both.
Batch — "do it all at once, on a schedule"
A batch pipeline runs on a timetable — for example, every night at 2 AM. When its time comes, it wakes up, processes one big "batch" of data all at once (say, all of yesterday's call records), writes the result, and then goes back to sleep until tomorrow.
Batch is the right choice for nightly warehouse loads, daily or weekly summary reports, and importing files. It is the most common kind of pipeline.
Streaming — "do it continuously, the instant data arrives"
A streaming pipeline never sleeps. It runs continuously, 24 hours a day, and processes each new record within seconds of that record appearing. Where batch is like emptying the post box once a day, streaming is like a conveyor belt that carries each letter forward the moment it lands.
Streaming is the right choice when freshness matters: keeping a backup copy of a database perfectly in step with the original in real time, reacting to network events as the switches produce them, or powering live dashboards.
DataFlow AI's streaming pipelines are powered by two well-known open-source technologies: Apache Flink (an engine specialised in processing continuous flows of data) and Debezium (a tool that watches databases for changes — explained next).
| Batch | Streaming | |
|---|---|---|
| How often it runs | On a schedule (e.g. nightly) | Continuously, never stops |
| How fresh the data is | As fresh as the last scheduled run | Seconds old |
| Typical use | Nightly warehouse loads, reports, file imports | Live replication, real-time event processing |
| Powered by | Scheduled jobs on the Spark engine | Apache Flink + Debezium |
What is "Change Data Capture" (CDC)?
Streaming raises a practical question: how does a pipeline know, second by second, that something in a source database has changed? The clumsy way would be to re-scan the entire table over and over, looking for differences. For a table with millions of rows, that is hopelessly slow and wasteful.
The clever way is called Change Data Capture, almost always shortened to CDC.
Every serious database keeps a private internal diary — called a transaction log — in which it writes down every change it makes: every row inserted, every row updated, every row deleted. The database keeps this diary for its own safety and recovery purposes.
CDC works by reading that diary. Instead of re-scanning the whole table, a CDC tool simply watches the transaction log and picks up each change the instant the database records it. It is the difference between re-counting every book in a library every hour, versus simply standing at the desk and noting each book as it is borrowed or returned.
DataFlow AI provides CDC through the Debezium tool, and supports it for five database types: Oracle, PostgreSQL, Microsoft SQL Server, MySQL, and MongoDB. The captured changes typically reach their destination in under 5 seconds.
In plain terms
CDC means "watch the database's own change-diary and pick up each edit the moment it happens" — instead of repeatedly re-reading the whole table to spot what is new. It is what makes real-time streaming fast and affordable.
What is a "connector"?
We have used the word connector several times. Here is the full explanation.
Every source system and every destination system is different. An Oracle database, a folder of CSV files, an Apache Kafka stream, and a Snowflake warehouse each "speak a different language" and need to be talked to in their own special way — with their own login method, their own commands, their own quirks.
A connector is a ready-made adapter — a piece of software that already knows how to talk to one particular kind of system. Think of the travel adapters you pack for a holiday: one plug shape for the UK, another for Europe, another for the US. A connector is exactly that, but for data systems. It lets DataFlow AI plug into a database, a file store, or a streaming system without you needing to know the technical details of that system.
DataFlow AI ships with a large library of connectors, including:
- Databases: Oracle, PostgreSQL, Microsoft SQL Server, MySQL, Teradata, SAP HANA, Sybase ASE, MongoDB.
- Cloud data warehouses: Snowflake, Databricks, Google BigQuery.
- Files: CSV (and other delimited text), Excel, JSON, Parquet, XML.
- Streaming: Apache Kafka, and a Debezium-based Change Data Capture connector.
- Business applications: Salesforce, SAP ERP, ServiceNow, and a generic connector for any REST API.
- A telecom-specific connector for reading Call Detail Records in their special binary format — a feature built specifically for Polkomtel.
Once an administrator has set up a connector (entering the address of the system and the login details, which are then stored securely and never shown again), that connector becomes a reusable building block that appears in the pipeline designer, ready to be dragged onto a canvas.
A few more words you will meet
These terms appear throughout DataFlow AI. Here is a plain-language definition of each.
Schema
A schema is the shape of a set of data — the list of its columns and what kind of value each column holds. For a table of customers, the schema might be: a column called name holding text, a column called date_of_birth holding a date, and a column called balance holding a number. The schema is the blueprint; the rows are the actual data poured into that blueprint. When the blueprint changes — for example, a new column is added — that is called schema evolution, and a good ETL platform can cope with it gracefully.
Data lineage
Data lineage is the family tree of a piece of data. It answers the question: "Where did this number come from, and what was done to it along the way?"
If a report shows "total revenue for the Warsaw region," lineage lets you trace that single figure backwards: it was summed from these rows, which were joined from these two tables, which were extracted from these two source systems, with these transformations applied at each step. DataFlow AI can trace lineage right down to the individual column level — for example showing that a warehouse column CUST_ID was produced by taking a source column id, trimming the spaces, and converting it to upper case.
Lineage is invaluable for two reasons. First, trust: when someone asks "can we believe this number?", lineage shows exactly how it was built. Second, impact analysis: before changing a source system, you can ask "what reports downstream depend on this?" and see everything that would be affected.
Data quality
Data quality is a measure of how trustworthy and fit-for-use a set of data is. Poor-quality data — full of gaps, duplicates, and impossible values — leads to poor business decisions.
DataFlow AI includes a data quality engine that checks data against rules you define. A rule might say "the email column must never be empty," or "the age column must be between 0 and 120," or "every customer ID in the orders table must also exist in the customers table." Each set of data is then given a quality score — a single number, typically out of 100 — so anyone can see at a glance how reliable it is.
Data governance
Data governance is the set of rules, approvals, and controls that make sure data is handled correctly and legally — especially sensitive personal data. Because Polkomtel handles the personal details of millions of subscribers, it must obey strict laws, most notably the European GDPR (in Poland known as RODO) data-protection regulation. Governance features make sure, for example, that personal data is masked where it should be, that it is deleted when a customer requests it, and that someone with authority approves a pipeline before it runs in production.
Why companies need ETL
Pulling all of this together, here is why a company like Polkomtel cannot function without ETL:
- The data is scattered. No single system holds the whole picture. Only by combining systems can the important questions be answered.
- The systems must not be disturbed. The billing and network systems have a critical day job. ETL takes copies and works on those copies, leaving the originals untouched.
- Raw data is messy. Real-world data is full of gaps, duplicates, and inconsistencies. It must be cleaned before it can be trusted.
- It must happen automatically and repeatedly. Reports are needed every single day. A manual process would be slow, error-prone, and impossible to scale. A pipeline, once written, runs itself.
- It must be traceable and legal. Regulators and auditors demand to know where data came from and how personal information is protected. ETL platforms record lineage and enforce governance rules.
Where DataFlow AI fits
DataFlow AI is an ETL and ELT platform built by the software company auraliscode sp. z o.o specifically for Polkomtel ("Plus" telecom in Poland), as part of a project to modernise the company's central data warehouse.
What sets it apart is its goal of making ETL usable not only by specialist engineers but also by non-technical people. It does this in several ways:
- A visual drag-and-drop designer. You build a pipeline by dragging boxes onto a canvas and connecting them with arrows — no programming required.
- An AI Copilot. You can describe what you want in plain English (or plain Polish) — for example, "load yesterday's sales from Oracle into Snowflake and remove the test records" — and the AI assistant drafts the pipeline for you.
- Plain-language error messages. When something goes wrong, the platform explains the problem in everyday words and suggests a fix, rather than printing a cryptic technical error.
- A role-adaptive interface. The screen adjusts itself depending on who you are — an engineer, an analyst, a data steward, or an administrator — showing only the controls that are relevant to your job.
In short: ETL is the universal problem of moving and tidying data so a business can use it. DataFlow AI is a platform that solves that problem, and tries hard to make the solution approachable for everyone, not just specialists.
The next page, How DataFlow AI runs ETL, opens the lid and shows step by step how the platform actually carries out everything described here.