Users & journeys

Data Engineer guide

This guide follows Anna Kowalska, the Data Engineer persona, through every journey she takes on DataFlow AI — from her first day on the platform to the steady rhythm of building, scheduling, and debugging the 500+ pipelines the DWH team runs. Every step is mapped to a concrete screen and route so you can follow along in the product.


Before you start

The Data Engineer is the platform's power user. Anna works across all three editing modes — Visual, SQL, and Python — and has the broadest set of permissions short of the admin. Her backend role is ENGINEER (level 75); her Keycloak realm role is developer; her UX persona is engineer.

Her allowed route prefixes include /, /design-studio, /monitor, /governance, /migration, /connections, /marketplace, /pipelines, /data-browser, and /templates. She can create, edit, run, deploy, debug, and delete pipelines, and manage connections.


Journey 1 — Five-day onboarding

The goal of onboarding is to take Anna from her first login to a productive production pipeline within five days.

Day 1 — Account setup

  1. Anna opens the platform URL. Because Active Directory federation is configured, her AD credentials auto-provision a DataFlow account at the Keycloak login screen.
  2. After authentication she lands on the workspace selector in the top bar and chooses her DWH workspace.
  3. An onboarding tour overlay walks her through the Design Studio — the canvas, the palette, and the editing modes.
  4. She clones a set of sample pipelines and runs a "Hello World" pipeline. It completes within 15 minutes.

Screens: Keycloak login, TopBar workspace selector, onboarding tour overlay, Design Studio.

The Day 1 onboarding tour overlay walking the engineer through the Design Studio
The Day 1 onboarding tour — an overlay that introduces the canvas, palette, and the three editing modes before Anna runs her first Hello World pipeline.

Day 2–3 — Tools setup

  1. Anna connects her Git repository (GitLab or Stash) so every pipeline definition is versioned as YAML.
  2. She sets up her local IDE — VS Code with the Python SDK and the dataflow CLI — and authenticates with dataflow login.
  3. She registers connections to the databases she works with: Teradata, Snowflake, and Databricks.
  4. She runs a test pipeline and confirms the push-down SQL is correct.

Screens: Git integration, CLI, connection registration, Design Studio test run.

Day 4–5 — First real pipeline

  1. Anna opens the Migration Center and migrates a pipeline from PowerCenter.
  2. She customizes the converted YAML in the Design Studio YAML editor.
  3. She submits the pipeline for code review through a Git pull request.
  4. Once approved, she deploys to Dev.

Screens: Migration Center, Design Studio YAML editor, Git PR / code review, environment promotion.

Onboarding metricTarget
Time to first pipeline run< 2 hours
Time to first production pipeline< 5 days
Onboarding satisfaction> 4.5 / 5

Journey 2 — Daily workflow

A typical day for Anna moves from a morning health check, through fixing failures, into development, and ends with code review and deployment.

09:00 — Morning check

  1. Anna opens her role-adaptive Home Dashboard (/dashboard, engineer variant).
  2. She reads the PipelineStatusCard — overnight, 487 of 500 pipelines are healthy, 2 failed, 11 are in warning.
  3. She clicks a failed run in the RecentFailuresCard, which navigates her to /monitor/runs/{runId}.

09:30 — Fix failures

  1. On the Run Detail page she reads the structured, color-coded error log with its full stack trace.
  2. The AI Diagnosis Panel diagnoses the issue — for example, "Teradata lock contention" — and suggests a fix such as "add retry with exponential backoff".
  3. Anna applies the fix and uses one-click re-run from checkpoint so the pipeline resumes rather than restarting from scratch.

10:00 — Pipeline development

  1. Anna opens the Design Studio and creates a new pipeline.
  2. She uses tri-mode editing — Visual, SQL, and Python all stay in sync, backed by one canonical YAML definition.
  3. At each node she views the inline data preview and profiling (column types, null percentage, value distribution) in the right-panel Preview tab.

14:00 — Code review and deploy

  1. Anna commits her work to Git, either from the UI CommitModal or the CLI.
  2. She opens a pull request that includes a visual pipeline diff so reviewers see the DAG change, not just YAML text.
  3. CI validates the change; once it is green and approved, she merges and promotes to staging.

Screens: HomeDashboard (engineer), PipelineStatusCard, RecentFailuresCard, Monitor Center Run Detail, AiDiagnosisPanel, Design Studio canvas and editors, NodePreviewTab, CommitModal, DeployModal.


Journey 3 — Build and schedule a pipeline

This is the core creative journey. Anna builds an ETL pipeline from scratch in the Design Studio and puts it on a schedule.

  1. New pipeline. In the Design Studio, Anna creates a new pipeline and gives it a name and a type — Batch or Streaming (a streaming pipeline runs on Kafka + Flink).
  2. Drag components. From the left-panel Component Palette she drags nodes onto the canvas. The palette has five categories — Sources (14 connectors), Transforms (12), Targets (14), Quality (5), and AI (3). She connects each node's output port to the next node's input port.
  3. Configure source and target. Selecting a node opens the right-side Properties Inspector. For a source she sets the connection, table or query, partition column, fetch size, and incremental column. For a target she additionally sets the write mode (Append / Overwrite / Upsert / Merge / Delete-Insert), upsert keys, and pre/post SQL.
  4. Add transformations. She drops in transforms — SQL, Filter, Aggregate, Join, Deduplicate, and so on — and wires them between source and target.
  5. Preview data. At any node she opens Data Preview to see up to 100 real rows pulled from the source with a LIMIT.
  6. Switch to YAML. Optionally she opens the YAML editor, which stays in bidirectional sync with the canvas and offers autocomplete and validation.
  7. Run or schedule. She clicks Run to execute manually, or Schedule to set a cron expression with a timezone, a business-day filter, and an overlap strategy.
  8. Monitor live. The run streams per-step status over WebSocket; she watches it live and reviews history on the Runs tab.
  9. Backfill. For historical dates she opens Backfill and sets a date range, granularity, concurrency, and an optional dry-run toggle.
The Design Studio canvas with a pipeline DAG, component palette, and properties inspector
The Design Studio — Anna's main workspace. The component palette sits left, the DAG canvas centre, and the Properties Inspector right; the YAML editor stays in bidirectional sync with the canvas.

Telecom templates

Anna does not always start from a blank canvas. Pipeline Templates (/templates) offers pre-built, telecom-specific templates such as "CDR Ingestion". Choosing Use Template opens the Design Studio at /design-studio?template={id} pre-populated.


Journey 4 — Debugging a failure

Walkthrough of a real incident: pipeline wf_E112 reported FAILED at 03:47.

Step 1 — Triage

  1. Anna receives a notification (PagerDuty, email, or Slack).
  2. She opens the pipeline's Run Detail page. The failed node is highlighted red on the visual DAG, with a red glow and an error-tooltip bubble.

Step 2 — Investigate

  1. She reads the structured error log and stack trace in the left panel.
  2. The AI Diagnosis Panel (right, 40% of the screen) gives a violet-gradient card with a Summary, a Root Cause — for example, "Sybase source timed out due to lock contention" — and numbered Suggested Actions such as "retry with read-uncommitted", each with a confidence percentage.
  3. She follows the lineage links to see which downstream systems are affected.

Step 3 — Fix and verify

  1. Anna applies the suggested fix in the pipeline YAML.
  2. She clicks Re-run from Checkpoint so the pipeline resumes from the failed step rather than starting over.
  3. She verifies data completeness in the target table.

Screens: notification panel, Monitor PipelineRunDetail + RunDagViewer + AiDiagnosisPanel, Lineage Explorer (impact), RunDetailActions.

Re-run from checkpoint vs. re-run

Re-run from Checkpoint is only available when a checkpoint exists for the run. It resumes from the failed node and is far cheaper than a full Re-run, which restarts the pipeline from the beginning. Always prefer the checkpoint option when it is offered.


Journey 5 — Building a custom connector

Engineers can extend the platform itself with new connectors and SQL dialects. This is a developer-facing journey.

Adding a connector

  1. Implement either JDBCConnectorBase or NativeConnectorBase, providing the required methods: connect, disconnect, testConnection, doDiscoverSchema, doExtractData, and doLoadData.
  2. Register the new connector in the ConnectorRegistry.
  3. Add a corresponding value to the ConnectionType enum.
  4. Write tests for the connector.

Adding a push-down SQL dialect

  1. Create a SqlDialect class for the new database.
  2. Register it in SparkSqlBridge.kt.
  3. Write dialect tests.

Contribution workflow

StepAction
1Create a feature branch
2Ensure all tests pass
3Run ktlint (Kotlin) and ruff (Python)
4Open a pull request with at least one code-owner approval and green CI
5Squash-merge to main

Once merged, the new connector appears in the Connector Marketplace (/marketplace) and as a node in the Design Studio Component Palette.


Journey → screen cross-reference

JourneyEntry routeKey screens / components
Onboarding//dashboardKeycloak login, workspace selector, onboarding tour, Design Studio, Git connect, CLI
Daily check & debug/dashboard, /monitorPipelineStatusCard, RecentFailuresCard, PipelineRunDetail, RunDagViewer, AiDiagnosisPanel
Build & schedule/design-studioCanvas, palette, SQL/Python/YAML editors, NodePreviewTab, schedule dialog, backfill
Code review & deploy/design-studioCommitModal, DeployModal, visual pipeline diff, environment promotion
Debugging/monitor/runs/:idPipelineRunDetail, RunDagViewer, AiDiagnosisPanel, RunDetailActions
Custom connector(developer flow)ConnectorRegistry, ConnectionType, SparkSqlBridge.kt

Where to go next

Previous
User journeys