End-to-end use cases

Use case: SAP HANA to BI reporting

This page follows one named person, step by step, as she builds a real pipeline in DataFlow AI — from an empty canvas to a finished daily report that business analysts open in their dashboard tool every morning. It is written for readers who have never built an ETL pipeline before, so every term is explained in plain words as it appears.

SAP HANA pipeline
A SAP HANA source pipeline open in Design Studio.

Meet the people in this story

Before we touch a single screen, it helps to know who is involved. Real ETL work is rarely done by one person alone — it is a small relay race.

PersonRoleWhat they need
Anna KowalskaData EngineerBuild the pipeline that moves and reshapes the data
Marek NowickiBusiness AnalystOpen the finished report every morning and read it
Tomasz WiśniewskiData StewardMake sure no personal data leaks and the numbers are trustworthy
Katarzyna ZielińskaPlatform AdminSet up the connections to the source and target systems

The story is told mostly from Anna's point of view, because she does the building. But you will see exactly where the other three step in.

The business need, in one sentence

Every morning at 08:00, Marek needs a fresh report showing yesterday's sales figures by region and product category — and that data lives inside SAP HANA, a system Marek cannot query himself.

Let us unpack that sentence, because it contains the whole problem.

  • SAP HANA is a large, fast database that the Sales Office (BIURO_SPRZEDAZY) at Polkomtel uses to record every sale. It is powerful, but it is not a friendly place for an analyst to poke around — it speaks a technical "language", it is protected, and running heavy queries on it could slow down the live sales system.
  • A BI tool (Business Intelligence tool) is the dashboard software Marek already uses — think of it as a screen full of charts and tables he refreshes with a click. BI tools are happiest reading from a data warehouse: a database built specifically for analysis and reporting, kept separate from the live operational systems.
  • So the gap is: the data is born in SAP HANA, but it needs to live in a warehouse for Marek to use it. Something has to copy it across every night, clean it up, and summarise it.

That "something" is the pipeline Anna is about to build.

In plain terms

A pipeline is a recipe. It says: take these ingredients from this cupboard, prepare them this way, and put the finished dish on that shelf — and do it again automatically every night. Anna writes the recipe once; DataFlow AI cooks it every day without anyone watching.


The shape of the pipeline before we start

It is worth picturing the finished pipeline before building it, the way you would glance at a recipe photo before cooking. Anna's pipeline will have seven stages, and they map exactly onto the classic Extract, Transform, Load idea:

   SAP HANA            DataFlow AI Design Studio                  Warehouse
 +-----------+   E   +------------------------------------+   L   +-----------+
 | SALES     |------>| Filter -> Join -> Aggregate ->     |------>| DAILY_    |
 | tables    |       | Quality checks                     |       | SALES_RPT |
 +-----------+       +------------------------------------+       +-----------+
       Extract                  Transform                            Load
  • Extract — copy yesterday's rows out of SAP HANA.
  • Transform — drop test rows, join the sales table to the region/product reference tables, and summarise into per-region, per-category totals.
  • Load — write the finished summary into a warehouse table the BI tool reads.

Around those three steps sit three more concerns: data quality (are the numbers trustworthy?), scheduling (run it every night automatically), and publishing (make the result visible to the BI tool). We will cover all of them.


Step 1 — Make sure the connections exist

A connection in DataFlow AI is a saved, named, secured doorway to an outside system. It stores the address, the login, and the security settings for one database — so that pipeline builders never have to type passwords or remember server names.

Anna does not create connections herself. For security reasons, connections are set up once by a Platform Admin — Katarzyna — and the passwords are stored encrypted in Google Secret Manager, never shown on screen again.

So before Anna starts, she checks with Katarzyna that two connections are ready in the workspace:

  1. A SAP HANA connection — the source. The DataFlow AI connector for SAP HANA is called sap-hana. It reaches SAP HANA over a driver called ODBC and can read SAP's special analytical views from the _SYS_BIC schema (those are pre-built calculations SAP HANA exposes). It has been tested against Polkomtel's real BIURO_SPRZEDAZY views.
  2. A **warehouse connection** — the target. Polkomtel uses three warehouses depending on the team: **Teradata** (the long-standing DWH-MONA warehouse), **Snowflake**, and **BigQuery**. For this report Anna will load into the warehouse her analytics team uses. The walkthrough below uses **Snowflake** as the example, and a

    later shows what changes for Teradata or BigQuery.
    1. How a connection gets created (Katarzyna's side)

      So you can see what Anna is relying on, here is the click-path Katarzyna followed earlier:

      1. In the left sidebar, Admin → Connections → "+ New Connection".
      2. Choose the connector type from the grid — for the source, the SAP HANA tile.
      3. Fill the form: hostname, port (SAP HANA uses port 30015 for a tenant database), database name, username, and password. Authentication is set to PASSWORD.
      4. Click "Test Connection". DataFlow AI makes a live connection and shows the round-trip latency plus a small preview of the schema. A green result means the doorway works.
      5. Click "Save". The connection now appears in the Design Studio toolbox for any engineer in the workspace.

      Watch out

      Creating or editing a connection requires at least the Data Engineer role, and the underlying credentials are never displayed again after saving. If a "Test Connection" fails, it is almost always a network or firewall issue between the platform and SAP HANA — not something Anna can fix from the pipeline screen. She raises it with Katarzyna.


      Step 2 — Create the pipeline and open Design Studio

      Design Studio is the visual workshop where pipelines are built. It is a drag-and-drop canvas: you place "nodes" (boxes that each do one job) and connect them with lines that carry data from box to box.

      Anna creates the pipeline:

      1. Left sidebar → Pipelines"+ New Pipeline" (the keyboard shortcut Alt+N does the same thing from anywhere).
      2. She types a name: daily_sap_sales_report. Names matter — they show up in monitoring screens and Git history, so she keeps it descriptive.
      3. She adds a short description: "Daily SAP HANA sales summary by region and product category for BI."
      4. For mode, she chooses Batch. Batch means the pipeline runs on a schedule, processes one big load of data, and stops — which is exactly right for a once-a-night report. (The other mode, Streaming, runs continuously 24/7 and is used for real-time work; it is not what we want here.)
      5. She clicks "Create". The canvas opens.

      In plain terms

      Batch vs. Streaming is just two speeds of ETL. Batch is the milkman: one delivery, same time every day. Streaming is the tap: water flows the instant you turn it on and never stops. A morning sales report is a milkman job.

      The Design Studio screen

      The canvas has four main areas. Anna will use all of them:

      AreaWhereWhat it is for
      Component PaletteLeftThe toolbox of node types, grouped into Sources, Transforms, Targets, Quality, and AI. Searchable with Ctrl+K.
      CanvasCentreThe work surface where nodes are placed and wired.
      Properties InspectorRightThe settings panel for whichever node is currently selected.
      Bottom PanelBottomData Preview, Console (logs), and Test Results. Toggle it with Ctrl+J.

      The Studio auto-saves every 30 seconds, so Anna never loses work.

      Design Studio
      Building the transformation visually in Design Studio.

      Step 3 — Extract: read the sales data out of SAP HANA

      The first node every pipeline needs is a Source — the box that copies data out of a system.

      Anna drags the SAP HANA tile from the Sources group of the palette onto the canvas. A blue source node appears. She clicks it, and the Properties Inspector on the right fills with settings.

      She configures three things in the General tab:

      1. Connection — she picks the sap-hana connection Katarzyna prepared.
      2. Table or Query — she could pick a single table from the schema browser, but this report needs rows from one main fact table, so she writes a small query. A query is a typed instruction telling the database exactly which rows to return.
      3. Incremental column — this is the clever part. Anna does not want to copy all sales history every single night; that would be slow and pointless. She wants only yesterday's rows. She sets the incremental column to the sale date and uses a date filter so each run pulls one day.

      Her source query looks like this:

      SELECT
        sale_id,
        sale_date,
        region_id,
        product_id,
        channel_code,
        net_amount,
        quantity
      FROM "_SYS_BIC"."biuro.sprzedazy/SALES_FACT"
      WHERE sale_date = '{{yesterday}}'
      

      The {{yesterday}} part is a template function — a placeholder DataFlow AI fills in automatically when the pipeline runs. On the morning of 21 May it becomes 2026-05-20; the next day it becomes 2026-05-21. Anna writes the recipe once and it always asks for "yesterday", whatever today happens to be.

      In plain terms

      An incremental load is the difference between re-reading an entire book every night and reading only the page that was added today. The book (all of sales history) might be huge; yesterday's page is tiny. Incremental loads keep the pipeline fast and keep the load on SAP HANA gentle.

      Before moving on, Anna clicks Preview on the source node. The bottom panel shows up to 100 sample rows fetched live from SAP HANA. She can see real sale records — proof that the doorway works and the columns are named as she expects.


      Step 4 — Transform: filter, join, and aggregate

      Raw data is almost never report-ready. Anna now adds three transform nodes in a row. Each does one job.

      4a. Filter — drop the test rows

      The Sales Office sometimes records dummy sales for system testing, flagged with a channel code of TEST. Those must never reach a business report.

      Anna drags a Filter node onto the canvas and wires the source node's output into it. (She wires nodes by hovering over the source node until a small circle — an output port — appears on its edge, then dragging a line from that circle to the next node's input port.)

      In the Filter node she sets a condition:

      channel_code != 'TEST'
      

      Any row where the channel code is TEST is dropped; everything else passes through.

      4b. Join — bring in human-readable names

      The sales table stores a region_id and a product_id — numbers, not names. Marek's report needs to say "Mazovia" and "Mobile Plans", not "7" and "412". The friendly names live in two small reference tables.

      A Join node combines two datasets by matching a shared column — like laying two lists side by side and lining up the rows that share an ID.

      Anna adds two more SAP HANA source nodes (one reading the region reference table, one the product reference table), then drags two Join nodes:

      • The first Join matches the filtered sales rows to the region table on region_id, adding a region_name column.
      • The second Join matches that result to the product table on product_id, adding product_category.

      She uses a LEFT join for both. A LEFT join keeps every sales row even if a matching reference row is somehow missing — so a sale is never silently lost just because a region code was not yet in the reference table.

      4c. Aggregate — summarise into the report shape

      Marek does not want one row per individual sale — he wants totals. An Aggregator node does this: it groups rows together and calculates a summary number for each group, exactly like a spreadsheet pivot table.

      Anna drags an Aggregator node and configures it:

      • Group by: region_name, product_category
      • Aggregations:
        • total_revenue = SUM of net_amount
        • units_sold = SUM of quantity
        • transaction_count = COUNT of sale_id

      After this node, a day with 90,000 individual sales might collapse to just 60 summary rows — one per region-and-category combination. That is the report.

      In plain terms

      Think of the three transforms as three kitchen actions. Filter throws out the spoiled ingredients. Join brings the labelled jars next to the unlabelled ones so you know what is what. Aggregate is the final plating — turning a pile of ingredients into a small number of finished portions.

      Doing it in SQL instead — the same pipeline, two views

      Everything above was built by dragging boxes. But Design Studio has a SQL mode: a button at the top switches the canvas to a code editor showing the same pipeline as SQL. Some engineers prefer to express joins and aggregations as a single SQL statement. Anna glances at the SQL view to double-check the logic:

      SELECT
        r.region_name,
        p.product_category,
        SUM(s.net_amount)  AS total_revenue,
        SUM(s.quantity)    AS units_sold,
        COUNT(s.sale_id)   AS transaction_count
      FROM sales_fact s
      LEFT JOIN dim_region  r ON s.region_id  = r.region_id
      LEFT JOIN dim_product p ON s.product_id = p.product_id
      WHERE s.channel_code != 'TEST'
      GROUP BY r.region_name, p.product_category
      

      The visual canvas and the SQL editor are two windows onto one pipeline definition — change either and the other updates. Anna can work whichever way suits the task.

      In plain terms

      DataFlow AI has a Push-Down SQL Engine. When the source and target can speak SQL, the platform is smart enough to send the instruction to the database rather than dragging all the data across the network to process it here. SAP HANA has a fast in-memory engine; letting it do the filtering and grouping in place is quicker and cheaper. The platform claims it can push down around 90% of standard analytical work this way — and Anna does not have to do anything special to get it.


      Step 5 — Data quality: make the numbers trustworthy

      A beautiful report built on bad data is worse than no report. Before the data is allowed to land in the warehouse, Anna adds Quality nodes — automatic checks that inspect the data and can stop the pipeline if something looks wrong.

      DataFlow AI offers ten kinds of quality rule. Anna picks four that matter for this report:

      Quality checkRule typeWhat it catches
      Region name is always presentNOT_NULLA join that failed to find a region
      Revenue is never negativeRANGE (min 0)A data-entry error or a refund recorded wrongly
      The summary has a sensible number of rowsROW_COUNTAn empty or doubled load
      Yesterday's data is actually freshFRESHNESSThe source did not update overnight

      She drags a Quality node onto the canvas after the Aggregator and configures these rules in its settings. For each rule she sets a severityCritical, Warning, or Info — and whether a failure should block the pipeline.

      Anna sets the NOT_NULL and RANGE checks to Critical / block: if region names are missing or revenue is negative, the pipeline stops and nothing reaches the warehouse — better no report than a wrong one. She sets the ROW_COUNT check to Warning: it raises an alert but lets the run continue.

      Watch out

      Quality rules run on every single execution, not just once. This is deliberate. Source systems change without warning — a column gets renamed, a feed arrives late, an upstream bug doubles every row. The quality node is the seatbelt that catches those problems before they become a wrong number on an executive's screen. Tomasz the Data Steward also monitors these scores from the Governance Hub.

      Where personal data gets masked

      This particular sales report deals with regions and product categories, not customers, so it contains no personal data. But if a pipeline did pull a customer's PESEL (the Polish national ID number) or phone number, Anna would add an Expression node to mask those columns — scrambling or hiding the sensitive part — so the personal detail never reaches the final report. DataFlow AI's PII scanner would also flag such columns automatically for Tomasz to review. Masking happens during the Transform step, on purpose, so sensitive data is dealt with early.


      Step 6 — Load: write the result into the warehouse

      The final node is a Sink (also called a Target) — the box that writes data into a destination system.

      Anna drags the Snowflake tile from the Targets group and wires the Quality node's output into it. In the Properties Inspector she sets:

      1. Connection — the Snowflake warehouse connection.
      2. TableANALYTICS.REPORTING.DAILY_SALES_RPT, the warehouse table the BI tool will read.
      3. Write mode — this decides how new data meets old data. The options are:
      Write modeWhat it doesGood for
      AppendAdds new rows next to the existing onesLogs, history that only grows
      OverwriteDeletes everything, writes freshSmall lookup tables
      Upsert / MergeUpdates rows that already exist, inserts rows that are newMost daily loads
      Delete-InsertRemoves one slice of data, then re-inserts itRe-loading a single day

      Anna picks Delete-Insert, keyed on the report date. Why? Because if yesterday's run is ever re-run (to fix a problem), she wants it to cleanly replace yesterday's slice — not pile a second copy on top. Delete-Insert removes the old day and writes the corrected day.

      She clicks Validate in the toolbar. Validation checks that every node is connected, every required field is filled, there are no loops in the DAG, and the schemas line up. A green result means the pipeline is runnable.

      Then she clicks Run → Run Now for a first manual test. The nodes light up one by one as data flows through; the Console tab streams logs; a few seconds later the run finishes green. Anna opens the warehouse and sees DAILY_SALES_RPT populated with yesterday's 60 summary rows. The pipeline works.


      Step 7 — The pipeline as YAML

      Everything Anna built by dragging boxes is stored, behind the scenes, as a single human-readable text file in a format called YAML. Every save is automatically committed to Git, so the pipeline has full version history. Anna can view and edit this YAML directly with the YAML tab — the visual canvas and the YAML stay in sync.

      Here is the finished pipeline:

      apiVersion: dataflow.polkomtel.com/v1
      kind: Pipeline
      metadata:
        name: daily-sap-sales-report
        namespace: revenue-assurance
        labels:
          domain: sales
          report: bi-daily
        annotations:
          description: Daily SAP HANA sales summary by region and product category
          owner: anna.kowalska@plk.pl
          sla: "08:00 Europe/Warsaw"
      spec:
        schedule: "0 5 * * 1-5 !PL_HOLIDAY"
        timezone: Europe/Warsaw
        enabled: true
        timeout: 1800
        retries: 3
        retryDelay: 300
        parameters:
          - name: report_date
            type: date
            default: "{{yesterday}}"
            description: The sales day to summarise
      
        nodes:
          - id: src_sales
            type: connector_source
            label: SAP HANA - sales fact
            config:
              connector: sap-hana
              query: >
                SELECT sale_id, sale_date, region_id, product_id,
                       channel_code, net_amount, quantity
                FROM "_SYS_BIC"."biuro.sprzedazy/SALES_FACT"
                WHERE sale_date = '{{parameters.report_date}}'
              incrementalColumn: sale_date
              fetchSize: 5000
      
          - id: src_region
            type: connector_source
            label: SAP HANA - region reference
            config:
              connector: sap-hana
              table: '"_SYS_BIC"."biuro.sprzedazy/DIM_REGION"'
      
          - id: src_product
            type: connector_source
            label: SAP HANA - product reference
            config:
              connector: sap-hana
              table: '"_SYS_BIC"."biuro.sprzedazy/DIM_PRODUCT"'
      
          - id: drop_test_rows
            type: filter
            label: Remove test transactions
            config:
              condition: "channel_code != 'TEST'"
      
          - id: join_region
            type: joiner
            label: Add region name
            config:
              leftInput: drop_test_rows
              rightInput: src_region
              joinType: LEFT
              on: "region_id"
      
          - id: join_product
            type: joiner
            label: Add product category
            config:
              leftInput: join_region
              rightInput: src_product
              joinType: LEFT
              on: "product_id"
      
          - id: summarise
            type: aggregator
            label: Totals by region and category
            config:
              groupBy: [region_name, product_category]
              aggregations:
                - { column: net_amount, function: SUM,   alias: total_revenue }
                - { column: quantity,   function: SUM,   alias: units_sold }
                - { column: sale_id,    function: COUNT, alias: transaction_count }
      
          - id: quality_gate
            type: quality
            label: Trust checks
            config:
              rules:
                - { type: NOT_NULL, column: region_name, severity: CRITICAL, blockPipeline: true }
                - { type: RANGE,    column: total_revenue, min: 0, severity: CRITICAL, blockPipeline: true }
                - { type: ROW_COUNT, min: 1, severity: WARNING, blockPipeline: false }
      
          - id: load_warehouse
            type: connector_sink
            label: Snowflake - DAILY_SALES_RPT
            config:
              connector: snowflake
              table: ANALYTICS.REPORTING.DAILY_SALES_RPT
              writeMode: DELETE_INSERT
              upsertKeys: [report_date, region_name, product_category]
              batchSize: 10000
      
        edges:
          - { from: src_sales,      to: drop_test_rows }
          - { from: drop_test_rows, to: join_region }
          - { from: src_region,     to: join_region }
          - { from: join_region,    to: join_product }
          - { from: src_product,    to: join_product }
          - { from: join_product,   to: summarise }
          - { from: summarise,      to: quality_gate }
          - { from: quality_gate,   to: load_warehouse }
      
        notifications:
          onSuccess:
            - channel: email
              message: "Daily SAP sales report loaded for {{parameters.report_date}}"
          onFailure:
            - channel: slack
              message: "FAILED: daily SAP sales report for {{parameters.report_date}}"
      

      You do not need to memorise this — Design Studio writes it for you. But it is worth knowing it exists, because it makes the pipeline reviewable, diff-able, and recoverable like any other piece of code.


      Step 8 — Schedule the daily run

      A pipeline that has to be started by hand every morning is not much of an automation. Anna sets a schedule so DataFlow AI runs it on its own.

      In Pipeline Settings (the gear icon) she sets the schedule. You can build it with a visual picker or type a cron expression — a compact code that means "run at these times". Anna uses:

      0 5 * * 1-5 !PL_HOLIDAY
      

      Reading that left to right: minute 0, hour 5, every day-of-month, every month, days 1-5 (Monday to Friday). So: 05:00 every weekday. That gives the pipeline three hours of headroom before Marek's 08:00 deadline.

      The !PL_HOLIDAY part is a DataFlow AI extension. It means: skip Polish public holidays. The platform has the Polish holiday calendar built in (New Year, Three Kings, Easter Monday, Constitution Day, and the rest), so the report does not run pointlessly on a day the Sales Office is closed.

      In plain terms

      A cron expression is just a five-slot time pattern: minute, hour, day-of-month, month, day-of-week. It looks cryptic but it is only ever answering one question — "when should this run?". 0 5 * * 1-5 means five o'clock in the morning, Monday to Friday.

      If anything goes wrong overnight, the pipeline retries up to three times with increasing delays, and the onFailure notification sends a Slack message — so Anna learns about a problem before Marek does.


      Step 9 — Expose the result to the BI tool

      The pipeline now drops a clean, trustworthy summary into ANALYTICS.REPORTING.DAILY_SALES_RPT in the warehouse every weekday morning. The last mile is connecting Marek's dashboard to it.

      This part happens in the BI tool, not in DataFlow AI — but it is simple, because the hard work is done:

      1. In the BI tool, Marek (or a BI administrator) creates a data source pointing at the Snowflake warehouse.
      2. He selects the DAILY_SALES_RPT table — a small, clean, pre-summarised table. Because the pipeline already filtered, joined, and aggregated the data, the BI tool only has to display it. No heavy lifting happens at dashboard-open time.
      3. He builds the charts once: revenue by region, units by product category, a transaction-count trend.
      4. He sets the dashboard to refresh each morning after 05:30.

      From then on, Marek opens his dashboard at 08:00 and sees yesterday's figures — fresh, correct, and fast — without ever touching SAP HANA.

      In plain terms

      The reason the BI tool feels instant is that the pipeline pushed all the slow work to the night before. Marek's dashboard reads a tiny finished table, not millions of raw sales rows. Good ETL design is mostly about doing the expensive work once, in advance, where nobody is waiting.


      Switching the target warehouse

      The walkthrough used Snowflake, but Polkomtel runs three warehouses. Swapping the target is a small change — only the Sink node differs.

      WarehouseConnectorNotable detail
      SnowflakesnowflakeBulk load via COPY INTO; supports MERGE for upserts
      Teradata (DWH-MONA)teradataUses the high-speed Teradata Parallel Transporter; pick multiload mode for upserts
      BigQuerybigqueryLoads via batch load jobs; data stays in the customer's own GCP project in Warsaw

      To switch, Anna deletes the Snowflake sink, drags the new warehouse's tile in its place, points it at the right connection and table, and re-validates. The seven-stage shape of the pipeline — extract, filter, join, aggregate, quality, load — does not change at all.


      What Anna built, summarised

      StageNode typePurpose
      1connector_source (sap-hana)Extract yesterday's sales from SAP HANA
      2filterDrop test transactions
      3joiner ×2Add region names and product categories
      4aggregatorSummarise into per-region, per-category totals
      5qualityBlock the run if the numbers look wrong
      6connector_sink (warehouse)Load the summary into the warehouse
      7schedule + notificationsRun every weekday at 05:00, alert on failure

      One recipe, written once by Anna, now feeds Marek's morning dashboard every working day — and Tomasz can see, at any time, that the data is masked where it needs to be and the quality checks are passing. That is a complete ETL use case, end to end.

    Previous
    Migrating Alteryx