DataFlow AI - The AI-native data integration platform.

This page follows one named person, step by step, as she builds a real pipeline in DataFlow AI — from an empty canvas to a finished daily report that business analysts open in their dashboard tool every morning. It is written for readers who have never built an ETL pipeline before, so every term is explained in plain words as it appears.

SAP HANA pipeline — A SAP HANA source pipeline open in Design Studio.

Meet the people in this story

Before we touch a single screen, it helps to know who is involved. Real ETL work is rarely done by one person alone — it is a small relay race.

Person	Role	What they need
Anna Kowalska	Data Engineer	Build the pipeline that moves and reshapes the data
Marek Nowicki	Business Analyst	Open the finished report every morning and read it
Tomasz Wiśniewski	Data Steward	Make sure no personal data leaks and the numbers are trustworthy
Katarzyna Zielińska	Platform Admin	Set up the connections to the source and target systems

The story is told mostly from Anna's point of view, because she does the building. But you will see exactly where the other three step in.

The business need, in one sentence

Every morning at 08:00, Marek needs a fresh report showing yesterday's sales figures by region and product category — and that data lives inside SAP HANA, a system Marek cannot query himself.

Let us unpack that sentence, because it contains the whole problem.

SAP HANA is a large, fast database that the Sales Office (BIURO_SPRZEDAZY) at Polkomtel uses to record every sale. It is powerful, but it is not a friendly place for an analyst to poke around — it speaks a technical "language", it is protected, and running heavy queries on it could slow down the live sales system.
A BI tool (Business Intelligence tool) is the dashboard software Marek already uses — think of it as a screen full of charts and tables he refreshes with a click. BI tools are happiest reading from a data warehouse: a database built specifically for analysis and reporting, kept separate from the live operational systems.
So the gap is: the data is born in SAP HANA, but it needs to live in a warehouse for Marek to use it. Something has to copy it across every night, clean it up, and summarise it.

That "something" is the pipeline Anna is about to build.

In plain terms

A pipeline is a recipe. It says: take these ingredients from this cupboard, prepare them this way, and put the finished dish on that shelf — and do it again automatically every night. Anna writes the recipe once; DataFlow AI cooks it every day without anyone watching.

The shape of the pipeline before we start

It is worth picturing the finished pipeline before building it, the way you would glance at a recipe photo before cooking. Anna's pipeline will have seven stages, and they map exactly onto the classic Extract, Transform, Load idea:

   SAP HANA            DataFlow AI Design Studio                  Warehouse
 +-----------+   E   +------------------------------------+   L   +-----------+
 | SALES     |------>| Filter -> Join -> Aggregate ->     |------>| DAILY_    |
 | tables    |       | Quality checks                     |       | SALES_RPT |
 +-----------+       +------------------------------------+       +-----------+
       Extract                  Transform                            Load

Extract — copy yesterday's rows out of SAP HANA.
Transform — drop test rows, join the sales table to the region/product reference tables, and summarise into per-region, per-category totals.
Load — write the finished summary into a warehouse table the BI tool reads.

Around those three steps sit three more concerns: data quality (are the numbers trustworthy?), scheduling (run it every night automatically), and publishing (make the result visible to the BI tool). We will cover all of them.

Step 1 — Make sure the connections exist

A connection in DataFlow AI is a saved, named, secured doorway to an outside system. It stores the address, the login, and the security settings for one database — so that pipeline builders never have to type passwords or remember server names.

Anna does not create connections herself. For security reasons, connections are set up once by a Platform Admin — Katarzyna — and the passwords are stored encrypted in Google Secret Manager, never shown on screen again.

So before Anna starts, she checks with Katarzyna that two connections are ready in the workspace:

A SAP HANA connection — the source. The DataFlow AI connector for SAP HANA is called sap-hana. It reaches SAP HANA over a driver called ODBC and can read SAP's special analytical views from the _SYS_BIC schema (those are pre-built calculations SAP HANA exposes). It has been tested against Polkomtel's real BIURO_SPRZEDAZY views.
A **warehouse connection** — the target. Polkomtel uses three warehouses depending on the team: **Teradata** (the long-standing DWH-MONA warehouse), **Snowflake**, and **BigQuery**. For this report Anna will load into the warehouse her analytics team uses. The walkthrough below uses **Snowflake** as the example, and a
later shows what changes for Teradata or BigQuery.

How a connection gets created (Katarzyna's side)

So you can see what Anna is relying on, here is the click-path Katarzyna followed earlier:

In the left sidebar, Admin → Connections → "+ New Connection".
Choose the connector type from the grid — for the source, the SAP HANA tile.
Fill the form: hostname, port (SAP HANA uses port 30015 for a tenant database), database name, username, and password. Authentication is set to PASSWORD.
Click "Test Connection". DataFlow AI makes a live connection and shows the round-trip latency plus a small preview of the schema. A green result means the doorway works.
Click "Save". The connection now appears in the Design Studio toolbox for any engineer in the workspace.

Watch out

Creating or editing a connection requires at least the Data Engineer role, and the underlying credentials are never displayed again after saving. If a "Test Connection" fails, it is almost always a network or firewall issue between the platform and SAP HANA — not something Anna can fix from the pipeline screen. She raises it with Katarzyna.

Step 2 — Create the pipeline and open Design Studio

Design Studio is the visual workshop where pipelines are built. It is a drag-and-drop canvas: you place "nodes" (boxes that each do one job) and connect them with lines that carry data from box to box.

Anna creates the pipeline:

Left sidebar → Pipelines → "+ New Pipeline" (the keyboard shortcut Alt+N does the same thing from anywhere).
She types a name: daily_sap_sales_report. Names matter — they show up in monitoring screens and Git history, so she keeps it descriptive.
She adds a short description: "Daily SAP HANA sales summary by region and product category for BI."
For mode, she chooses Batch. Batch means the pipeline runs on a schedule, processes one big load of data, and stops — which is exactly right for a once-a-night report. (The other mode, Streaming, runs continuously 24/7 and is used for real-time work; it is not what we want here.)
She clicks "Create". The canvas opens.

In plain terms

Batch vs. Streaming is just two speeds of ETL. Batch is the milkman: one delivery, same time every day. Streaming is the tap: water flows the instant you turn it on and never stops. A morning sales report is a milkman job.

The Design Studio screen

The canvas has four main areas. Anna will use all of them:

Area	Where	What it is for
Component Palette	Left	The toolbox of node types, grouped into Sources, Transforms, Targets, Quality, and AI. Searchable with `Ctrl+K`.
Canvas	Centre	The work surface where nodes are placed and wired.
Properties Inspector	Right	The settings panel for whichever node is currently selected.
Bottom Panel	Bottom	Data Preview, Console (logs), and Test Results. Toggle it with `Ctrl+J`.

The Studio auto-saves every 30 seconds, so Anna never loses work.

Building the transformation visually in Design Studio.

Step 3 — Extract: read the sales data out of SAP HANA

The first node every pipeline needs is a Source — the box that copies data out of a system.

Anna drags the SAP HANA tile from the Sources group of the palette onto the canvas. A blue source node appears. She clicks it, and the Properties Inspector on the right fills with settings.

She configures three things in the General tab:

Connection — she picks the sap-hana connection Katarzyna prepared.
Table or Query — she could pick a single table from the schema browser, but this report needs rows from one main fact table, so she writes a small query. A query is a typed instruction telling the database exactly which rows to return.
Incremental column — this is the clever part. Anna does not want to copy all sales history every single night; that would be slow and pointless. She wants only yesterday's rows. She sets the incremental column to the sale date and uses a date filter so each run pulls one day.

Her source query looks like this:

SELECT
  sale_id,
  sale_date,
  region_id,
  product_id,
  channel_code,
  net_amount,
  quantity
FROM "_SYS_BIC"."biuro.sprzedazy/SALES_FACT"
WHERE sale_date = '{{yesterday}}'

The {{yesterday}} part is a template function — a placeholder DataFlow AI fills in automatically when the pipeline runs. On the morning of 21 May it becomes 2026-05-20; the next day it becomes 2026-05-21. Anna writes the recipe once and it always asks for "yesterday", whatever today happens to be.

In plain terms

An incremental load is the difference between re-reading an entire book every night and reading only the page that was added today. The book (all of sales history) might be huge; yesterday's page is tiny. Incremental loads keep the pipeline fast and keep the load on SAP HANA gentle.

Before moving on, Anna clicks Preview on the source node. The bottom panel shows up to 100 sample rows fetched live from SAP HANA. She can see real sale records — proof that the doorway works and the columns are named as she expects.

Step 4 — Transform: filter, join, and aggregate

Raw data is almost never report-ready. Anna now adds three transform nodes in a row. Each does one job.

4a. Filter — drop the test rows

The Sales Office sometimes records dummy sales for system testing, flagged with a channel code of TEST. Those must never reach a business report.

Anna drags a Filter node onto the canvas and wires the source node's output into it. (She wires nodes by hovering over the source node until a small circle — an output port — appears on its edge, then dragging a line from that circle to the next node's input port.)

In the Filter node she sets a condition:

channel_code != 'TEST'

Any row where the channel code is TEST is dropped; everything else passes through.

4b. Join — bring in human-readable names

The sales table stores a region_id and a product_id — numbers, not names. Marek's report needs to say "Mazovia" and "Mobile Plans", not "7" and "412". The friendly names live in two small reference tables.

A Join node combines two datasets by matching a shared column — like laying two lists side by side and lining up the rows that share an ID.

Anna adds two more SAP HANA source nodes (one reading the region reference table, one the product reference table), then drags two Join nodes:

The first Join matches the filtered sales rows to the region table on region_id, adding a region_name column.
The second Join matches that result to the product table on product_id, adding product_category.

She uses a LEFT join for both. A LEFT join keeps every sales row even if a matching reference row is somehow missing — so a sale is never silently lost just because a region code was not yet in the reference table.

4c. Aggregate — summarise into the report shape

Marek does not want one row per individual sale — he wants totals. An Aggregator node does this: it groups rows together and calculates a summary number for each group, exactly like a spreadsheet pivot table.

Anna drags an Aggregator node and configures it:

Group by: region_name, product_category
Aggregations:
- total_revenue = SUM of net_amount
- units_sold = SUM of quantity
- transaction_count = COUNT of sale_id

After this node, a day with 90,000 individual sales might collapse to just 60 summary rows — one per region-and-category combination. That is the report.

In plain terms

Think of the three transforms as three kitchen actions. Filter throws out the spoiled ingredients. Join brings the labelled jars next to the unlabelled ones so you know what is what. Aggregate is the final plating — turning a pile of ingredients into a small number of finished portions.

Doing it in SQL instead — the same pipeline, two views

Everything above was built by dragging boxes. But Design Studio has a SQL mode: a button at the top switches the canvas to a code editor showing the same pipeline as SQL. Some engineers prefer to express joins and aggregations as a single SQL statement. Anna glances at the SQL view to double-check the logic:

SELECT
  r.region_name,
  p.product_category,
  SUM(s.net_amount)  AS total_revenue,
  SUM(s.quantity)    AS units_sold,
  COUNT(s.sale_id)   AS transaction_count
FROM sales_fact s
LEFT JOIN dim_region  r ON s.region_id  = r.region_id
LEFT JOIN dim_product p ON s.product_id = p.product_id
WHERE s.channel_code != 'TEST'
GROUP BY r.region_name, p.product_category

The visual canvas and the SQL editor are two windows onto one pipeline definition — change either and the other updates. Anna can work whichever way suits the task.

In plain terms

DataFlow AI has a Push-Down SQL Engine. When the source and target can speak SQL, the platform is smart enough to send the instruction to the database rather than dragging all the data across the network to process it here. SAP HANA has a fast in-memory engine; letting it do the filtering and grouping in place is quicker and cheaper. The platform claims it can push down around 90% of standard analytical work this way — and Anna does not have to do anything special to get it.

Step 5 — Data quality: make the numbers trustworthy

A beautiful report built on bad data is worse than no report. Before the data is allowed to land in the warehouse, Anna adds Quality nodes — automatic checks that inspect the data and can stop the pipeline if something looks wrong.

DataFlow AI offers ten kinds of quality rule. Anna picks four that matter for this report:

Quality check	Rule type	What it catches
Region name is always present	`NOT_NULL`	A join that failed to find a region
Revenue is never negative	`RANGE` (min 0)	A data-entry error or a refund recorded wrongly
The summary has a sensible number of rows	`ROW_COUNT`	An empty or doubled load
Yesterday's data is actually fresh	`FRESHNESS`	The source did not update overnight

She drags a Quality node onto the canvas after the Aggregator and configures these rules in its settings. For each rule she sets a severity — Critical, Warning, or Info — and whether a failure should block the pipeline.

Anna sets the NOT_NULL and RANGE checks to Critical / block: if region names are missing or revenue is negative, the pipeline stops and nothing reaches the warehouse — better no report than a wrong one. She sets the ROW_COUNT check to Warning: it raises an alert but lets the run continue.

Watch out

Quality rules run on every single execution, not just once. This is deliberate. Source systems change without warning — a column gets renamed, a feed arrives late, an upstream bug doubles every row. The quality node is the seatbelt that catches those problems before they become a wrong number on an executive's screen. Tomasz the Data Steward also monitors these scores from the Governance Hub.

Where personal data gets masked

This particular sales report deals with regions and product categories, not customers, so it contains no personal data. But if a pipeline did pull a customer's PESEL (the Polish national ID number) or phone number, Anna would add an Expression node to mask those columns — scrambling or hiding the sensitive part — so the personal detail never reaches the final report. DataFlow AI's PII scanner would also flag such columns automatically for Tomasz to review. Masking happens during the Transform step, on purpose, so sensitive data is dealt with early.

Step 6 — Load: write the result into the warehouse

The final node is a Sink (also called a Target) — the box that writes data into a destination system.

Anna drags the Snowflake tile from the Targets group and wires the Quality node's output into it. In the Properties Inspector she sets:

Connection — the Snowflake warehouse connection.
Table — ANALYTICS.REPORTING.DAILY_SALES_RPT, the warehouse table the BI tool will read.
Write mode — this decides how new data meets old data. The options are:

Write mode	What it does	Good for
Append	Adds new rows next to the existing ones	Logs, history that only grows
Overwrite	Deletes everything, writes fresh	Small lookup tables
Upsert / Merge	Updates rows that already exist, inserts rows that are new	Most daily loads
Delete-Insert	Removes one slice of data, then re-inserts it	Re-loading a single day

Anna picks Delete-Insert, keyed on the report date. Why? Because if yesterday's run is ever re-run (to fix a problem), she wants it to cleanly replace yesterday's slice — not pile a second copy on top. Delete-Insert removes the old day and writes the corrected day.

She clicks Validate in the toolbar. Validation checks that every node is connected, every required field is filled, there are no loops in the DAG, and the schemas line up. A green result means the pipeline is runnable.

Then she clicks Run → Run Now for a first manual test. The nodes light up one by one as data flows through; the Console tab streams logs; a few seconds later the run finishes green. Anna opens the warehouse and sees DAILY_SALES_RPT populated with yesterday's 60 summary rows. The pipeline works.

Step 7 — The pipeline as YAML

Everything Anna built by dragging boxes is stored, behind the scenes, as a single human-readable text file in a format called YAML. Every save is automatically committed to Git, so the pipeline has full version history. Anna can view and edit this YAML directly with the YAML tab — the visual canvas and the YAML stay in sync.

Here is the finished pipeline:

apiVersion: dataflow.polkomtel.com/v1
kind: Pipeline
metadata:
  name: daily-sap-sales-report
  namespace: revenue-assurance
  labels:
    domain: sales
    report: bi-daily
  annotations:
    description: Daily SAP HANA sales summary by region and product category
    owner: anna.kowalska@plk.pl
    sla: "08:00 Europe/Warsaw"
spec:
  schedule: "0 5 * * 1-5 !PL_HOLIDAY"
  timezone: Europe/Warsaw
  enabled: true
  timeout: 1800
  retries: 3
  retryDelay: 300
  parameters:
    - name: report_date
      type: date
      default: "{{yesterday}}"
      description: The sales day to summarise

  nodes:
    - id: src_sales
      type: connector_source
      label: SAP HANA - sales fact
      config:
        connector: sap-hana
        query: >
          SELECT sale_id, sale_date, region_id, product_id,
                 channel_code, net_amount, quantity
          FROM "_SYS_BIC"."biuro.sprzedazy/SALES_FACT"
          WHERE sale_date = '{{parameters.report_date}}'
        incrementalColumn: sale_date
        fetchSize: 5000

    - id: src_region
      type: connector_source
      label: SAP HANA - region reference
      config:
        connector: sap-hana
        table: '"_SYS_BIC"."biuro.sprzedazy/DIM_REGION"'

    - id: src_product
      type: connector_source
      label: SAP HANA - product reference
      config:
        connector: sap-hana
        table: '"_SYS_BIC"."biuro.sprzedazy/DIM_PRODUCT"'

    - id: drop_test_rows
      type: filter
      label: Remove test transactions
      config:
        condition: "channel_code != 'TEST'"

    - id: join_region
      type: joiner
      label: Add region name
      config:
        leftInput: drop_test_rows
        rightInput: src_region
        joinType: LEFT
        on: "region_id"

    - id: join_product
      type: joiner
      label: Add product category
      config:
        leftInput: join_region
        rightInput: src_product
        joinType: LEFT
        on: "product_id"

    - id: summarise
      type: aggregator
      label: Totals by region and category
      config:
        groupBy: [region_name, product_category]
        aggregations:
          - { column: net_amount, function: SUM,   alias: total_revenue }
          - { column: quantity,   function: SUM,   alias: units_sold }
          - { column: sale_id,    function: COUNT, alias: transaction_count }

    - id: quality_gate
      type: quality
      label: Trust checks
      config:
        rules:
          - { type: NOT_NULL, column: region_name, severity: CRITICAL, blockPipeline: true }
          - { type: RANGE,    column: total_revenue, min: 0, severity: CRITICAL, blockPipeline: true }
          - { type: ROW_COUNT, min: 1, severity: WARNING, blockPipeline: false }

    - id: load_warehouse
      type: connector_sink
      label: Snowflake - DAILY_SALES_RPT
      config:
        connector: snowflake
        table: ANALYTICS.REPORTING.DAILY_SALES_RPT
        writeMode: DELETE_INSERT
        upsertKeys: [report_date, region_name, product_category]
        batchSize: 10000

  edges:
    - { from: src_sales,      to: drop_test_rows }
    - { from: drop_test_rows, to: join_region }
    - { from: src_region,     to: join_region }
    - { from: join_region,    to: join_product }
    - { from: src_product,    to: join_product }
    - { from: join_product,   to: summarise }
    - { from: summarise,      to: quality_gate }
    - { from: quality_gate,   to: load_warehouse }

  notifications:
    onSuccess:
      - channel: email
        message: "Daily SAP sales report loaded for {{parameters.report_date}}"
    onFailure:
      - channel: slack
        message: "FAILED: daily SAP sales report for {{parameters.report_date}}"

You do not need to memorise this — Design Studio writes it for you. But it is worth knowing it exists, because it makes the pipeline reviewable, diff-able, and recoverable like any other piece of code.

Step 8 — Schedule the daily run

A pipeline that has to be started by hand every morning is not much of an automation. Anna sets a schedule so DataFlow AI runs it on its own.

In Pipeline Settings (the gear icon) she sets the schedule. You can build it with a visual picker or type a cron expression — a compact code that means "run at these times". Anna uses:

0 5 * * 1-5 !PL_HOLIDAY

Reading that left to right: minute 0, hour 5, every day-of-month, every month, days 1-5 (Monday to Friday). So: 05:00 every weekday. That gives the pipeline three hours of headroom before Marek's 08:00 deadline.

The !PL_HOLIDAY part is a DataFlow AI extension. It means: skip Polish public holidays. The platform has the Polish holiday calendar built in (New Year, Three Kings, Easter Monday, Constitution Day, and the rest), so the report does not run pointlessly on a day the Sales Office is closed.

In plain terms

A cron expression is just a five-slot time pattern: minute, hour, day-of-month, month, day-of-week. It looks cryptic but it is only ever answering one question — "when should this run?". 0 5 * * 1-5 means five o'clock in the morning, Monday to Friday.

If anything goes wrong overnight, the pipeline retries up to three times with increasing delays, and the onFailure notification sends a Slack message — so Anna learns about a problem before Marek does.

Step 9 — Expose the result to the BI tool

The pipeline now drops a clean, trustworthy summary into ANALYTICS.REPORTING.DAILY_SALES_RPT in the warehouse every weekday morning. The last mile is connecting Marek's dashboard to it.

This part happens in the BI tool, not in DataFlow AI — but it is simple, because the hard work is done:

In the BI tool, Marek (or a BI administrator) creates a data source pointing at the Snowflake warehouse.
He selects the DAILY_SALES_RPT table — a small, clean, pre-summarised table. Because the pipeline already filtered, joined, and aggregated the data, the BI tool only has to display it. No heavy lifting happens at dashboard-open time.
He builds the charts once: revenue by region, units by product category, a transaction-count trend.
He sets the dashboard to refresh each morning after 05:30.

From then on, Marek opens his dashboard at 08:00 and sees yesterday's figures — fresh, correct, and fast — without ever touching SAP HANA.

In plain terms

The reason the BI tool feels instant is that the pipeline pushed all the slow work to the night before. Marek's dashboard reads a tiny finished table, not millions of raw sales rows. Good ETL design is mostly about doing the expensive work once, in advance, where nobody is waiting.

Switching the target warehouse

The walkthrough used Snowflake, but Polkomtel runs three warehouses. Swapping the target is a small change — only the Sink node differs.

Warehouse	Connector	Notable detail
Snowflake	`snowflake`	Bulk load via `COPY INTO`; supports `MERGE` for upserts
Teradata (`DWH-MONA`)	`teradata`	Uses the high-speed Teradata Parallel Transporter; pick `multiload` mode for upserts
BigQuery	`bigquery`	Loads via batch load jobs; data stays in the customer's own GCP project in Warsaw

To switch, Anna deletes the Snowflake sink, drags the new warehouse's tile in its place, points it at the right connection and table, and re-validates. The seven-stage shape of the pipeline — extract, filter, join, aggregate, quality, load — does not change at all.

What Anna built, summarised

Stage	Node type	Purpose
1	`connector_source` (sap-hana)	Extract yesterday's sales from SAP HANA
2	`filter`	Drop test transactions
3	`joiner` ×2	Add region names and product categories
4	`aggregator`	Summarise into per-region, per-category totals
5	`quality`	Block the run if the numbers look wrong
6	`connector_sink` (warehouse)	Load the summary into the warehouse
7	schedule + notifications	Run every weekday at 05:00, alert on failure

One recipe, written once by Anna, now feeds Marek's morning dashboard every working day — and Tomasz can see, at any time, that the data is masked where it needs to be and the quality checks are passing. That is a complete ETL use case, end to end.