DataFlow AI - The AI-native data integration platform.

DataFlow AI can be deployed in three different topologies, each with a different balance of cost, control, and time-to-launch. This page describes all three — On-Premises, Full GCP cloud, and the recommended Hybrid — with their infrastructure requirements, capacity sizing, cost tables, security posture, and a decision guide for choosing between them.

The three topologies at a glance

A "topology" here means where the platform runs and where the data lives. The deployment-scenarios source document defines three, and recommends Hybrid for Polkomtel Plus.

Topology	Where it runs	3-Year TCO (USD)	Annual cost	Time to first production pipeline	Full production
1. On-Premises	Polkomtel's own Warsaw data center	$1,836,000 (≈2,286,000 PLN incl. CapEx)	$582K OpEx + $540K CapEx in Year 0	14–24 weeks	6–9 months
2. GCP Full Cloud	GCP `europe-central2` (Warsaw), DR in `europe-west3` (Frankfurt)	$495,000 (realistic)	$165,000 (realistic)	3–4 weeks	2–3 months
3. Hybrid ⭐ Recommended	Source databases stay on-prem; the platform runs on GCP	$606,000–$741,000	$197,000–$242,000	10–14 weeks	4–6 months

A few terms used throughout this page:

CapEx (capital expenditure) — a large upfront purchase, e.g. buying servers.
OpEx (operating expenditure) — a recurring cost, e.g. a monthly cloud bill.
TCO (total cost of ownership) — the all-in cost over a defined period, here three years.
CDC (change data capture) — streaming each database change as it happens.
DR (disaster recovery) — a standby copy of the system in a second location.

Pricing basis

All cost figures use GCP list prices for Q1 2026, an exchange rate of 1 USD = 4.00 PLN, and Dell Q1 2026 Polish enterprise channel pricing. Polish VAT (23%) is not included in any cost table. GCP always bills in USD, which introduces foreign-exchange risk for a PLN-denominated budget.

Scenario 1 — On-Premises

Every DataFlow AI component runs on hardware Polkomtel owns and operates in its own Warsaw data center, on bare-metal Dell PowerEdge or VMware-virtualized servers, with Kubernetes provided by Red Hat OpenShift. No data ever leaves the corporate network — this is the maximum-data-sovereignty option.

Headline numbers: 2,160,000 PLN CapEx · $582K annual OpEx · $1.836M 3-Year TCO · up to 24 weeks to launch · 13 Kubernetes nodes (10 workers + 3 control plane) · 100% data sovereignty.

Infrastructure — hardware to buy (CapEx)

The On-Premises topology requires a one-time hardware purchase totalling 2,160,000 PLN (≈$540,000).

Component	Specification	Count	Unit (PLN)	Total (PLN)
Kubernetes worker nodes	Dell PowerEdge R750 · 2× Xeon Silver 4316 · 256 GB ECC · 2× 1.92 TB NVMe · 25 GbE	10	65,000	650,000
Kubernetes control plane	Dell PowerEdge R650 · 2× Xeon Silver 4310 · 128 GB · 2× 960 GB NVMe	3	35,000	105,000
PostgreSQL HA servers	Dell PowerEdge R750xs · 2× Xeon Gold 5318Y · 512 GB · 8× 3.84 TB NVMe RAID10	2	95,000	190,000
Redis cluster nodes	Dell PowerEdge R650 · Xeon Silver 4310 · 128 GB · 4× 960 GB NVMe	3	30,000	90,000
Kafka broker nodes	Dell PowerEdge R750 · Xeon Silver 4316 · 128 GB · 6× 3.84 TB NVMe · KRaft	5	55,000	275,000
Storage array (NAS)	NetApp AFF A400 · 200 TB raw (50 TB usable after RAID + replication) · 100 GbE	1	280,000	280,000
Load balancers	F5 BIG-IP i2800 · HA pair · WAF module · SSL offload	2	60,000	120,000
Top-of-rack switches	Arista 7050X3 · 32× 100 GbE	4	45,000	180,000
Spine switches	Arista 7280R3 · 36× 400 GbE	2	85,000	170,000
UPS (N+1)	APC Symmetra LX 40 kVA · 15-min battery at full load	2	50,000	100,000
Total CapEx				2,160,000 PLN (≈$540,000)

Infrastructure — annual running cost (OpEx)

Running the on-prem estate costs 2,328,000 PLN/yr (≈$582,000/yr).

Category	Annual (PLN)	Notes
Colocation (Warsaw DC)	336,000	6 racks, 30 kW, precision cooling, dual 10 Gbps uplinks
Power & cooling	102,000	30 kW average; cooling at PUE 1.4
Internet bandwidth	108,000	10 Gbps redundant fiber, 2 ISPs, BGP failover
Red Hat OpenShift Enterprise	216,000	13 nodes, Red Hat support, ACM + ACS security
Confluent Platform (Kafka)	336,000	5 broker enterprise license, Schema Registry, ksqlDB
HashiCorp Vault Enterprise	72,000	3-node HA, HSM seal, audit logging
Hardware maintenance (15% of CapEx/yr)	324,000	Dell ProSupport+
Backup & DR software	60,000	Veeam, immutable backups
Security (EDR/IDS/IPS)	42,000	CrowdStrike, Fortinet, Nessus
Infrastructure engineers (2 FTE)	300,000	2 dedicated senior engineers
Hardware amortization (5-year)	432,000	2.16M PLN ÷ 5 years
Total annual OpEx	2,328,000 PLN/yr	≈$582,000/yr

3-Year TCO and scaling

Year 0 CapEx is 2,160,000 PLN; Years 1–3 OpEx is 2,328,000 PLN each — a 3-Year total of 9,144,000 PLN (≈$2,286,000). There is no elastic scaling: the hardware is provisioned for peak load from day one. A hardware refresh is required at Year 5 (a further +2,160,000 PLN).

On-prem scaling is manual and slow. Worker nodes are added when average CPU exceeds 70% sustained for two weeks (order three R750 nodes — 8-week lead time). A sixth Kafka broker is added when a broker's partition-leader count exceeds 200 (55,000 PLN + 1 week to install). Power has headroom to 45 kW before the colocation footprint must expand.

When to choose On-Premises

Best for: maximum data sovereignty; air-gapped or high-security environments; organizations that already have a data center and a large infrastructure staff; regulatory environments that prohibit cloud entirely (military, government).
Not recommended for: fast time-to-launch (under 3 months); elastic or unpredictable workloads; teams that want managed services; heavy AI/ML work that needs GCP's Vertex AI or BigQuery ML.

Scenario 2 — GCP Full Cloud

The entire platform runs on Google Cloud Platform in the europe-central2 (Warsaw, Poland) region, with disaster recovery in europe-west3 (Frankfurt, Germany). GKE Autopilot manages all containerized workloads, and the data services — Cloud SQL, Memorystore, Dataproc Serverless, Cloud Composer — are fully managed by Google.

Headline numbers: $0 CapEx · $13,750/month (realistic) · $495K 3-Year TCO (realistic) · 3–4 weeks to launch · 100% managed services.

Three sub-scenarios

GCP Full Cloud has three cost profiles depending on load. The middle one — Realistic — is the recommended baseline.

Sub-scenario	Monthly	Annual	Profile
Minimum	$8,150	$97,800	Dev/test or lean production, ~50 active pipelines, no streaming CDC, no DR, single region
Realistic ⭐	$13,750	$165,000	Full production, 200 active pipelines, moderate CDC streaming, full HA, warm DR standby
Pessimistic	$23,200	$278,400	Peak load, heavy CDC, active-active DR in both regions, 40+ TB/day

The Minimum profile is explicitly not recommended for production Polkomtel workloads — it has no high availability and no disaster recovery.

GCP service cost breakdown (Realistic profile)

The Realistic profile's $13,750/month breaks down across twelve service categories. The figures below are for a deployment of 500+ pipelines, 15–25 TB/day, and 30–50 concurrent executions.

Category	Min	Realistic	Pessimistic
1. Compute — GKE Autopilot (platform services, pipeline engine, Flink, connectors, cluster fee)	$1,873	$3,406	$6,639
2. Dataproc — Spark batch jobs + history server	$729	$2,652	$6,924
3. Database — Cloud SQL PG15 HA (instance, SSD, backups, read replica)	$320	$754	$1,741
4. Storage — Cloud Storage (Standard + Nearline + operations)	$65	$193	$639
5. Messaging — Confluent Kafka + Pub/Sub	$2,396	$4,836	$9,822
6. Caching — Memorystore Redis (primary HA + read replica)	$143	$716	$1,432
7. Orchestration — Cloud Composer (Airflow)	$282	$565	$1,130
8. Networking — VPN, egress, NAT, load balancer, DNS	$302	$590	$1,959
9. Security — Cloud Armor, Cloud KMS, Secret Manager	$32	$121	$445
10. Operations — Logging, Monitoring, Artifact Registry, Cloud Build	$29	$89	$300
11. Disaster recovery (`europe-west3`)	$20	$872	$4,270
12. Miscellaneous buffer (5–7%)	$79	$56	$464
Total monthly	$8,150	$13,750	$23,200
Total annual	$97,800	$165,000	$278,400

Two cost levers dominate:

Confluent Cloud Kafka is the single largest line item ($4,752/month in the Realistic profile) and the most negotiable — an annual commitment typically yields a 30–50% discount.
Dataproc Serverless cost is the most variable. "Pushdown SQL" — running the transformation inside the source database (Teradata, Snowflake) instead of moving data to Spark — is the primary cost lever.

3-Year TCO (GCP Full Cloud)

Profile	Year 1	Year 2	Year 3	3-Year total
Minimum	$97,800	$97,800	$97,800	$293,400
Realistic	$165,000	$165,000	$165,000	$495,000
Realistic + 3-year CUD	$155,000	$134,400	$134,400	$423,800
Pessimistic	$278,400	$278,400	$278,400	$835,200

A CUD (Committed Use Discount) is a price reduction Google gives in exchange for committing to a 1- or 3-year usage level. Applying a full 3-year CUD plus Dataproc Spot instances can cut the realistic annual GCP cost from $165,000 down to roughly $104,076/yr — a 3-year saving of about $182,772. The catch: CUDs require an upfront commitment, and Google bills the committed amount monthly regardless of actual usage.

When to choose GCP Full Cloud

Pros: zero CapEx; elastic auto-scaling; fully managed data services (no database administration); Warsaw region keeps data GDPR-compliant; built-in DR in Frankfurt with under-60-second failover; access to Claude API, Vertex AI, and BigQuery ML; provisioning in 24–48 hours; automatic patching; fastest time-to-launch at 3–4 weeks.
Cons: ongoing spend with no "paid off" point; data-egress costs when results are sent back on-prem; internet dependency; billing spikes if workloads exceed estimates; moving 100 TB+ of Teradata data into GCP is expensive and risky; GCP vendor lock-in; foreign-exchange risk because Google bills in USD.

Scenario 3 — Hybrid (recommended)

In the Hybrid topology, the source databases — Teradata, Oracle, SAP HANA, MSSQL — stay on-premises (they are already there, and moving them is expensive and risky), while the DataFlow AI platform itself runs on GCP europe-central2. The two halves are joined by a dedicated, private 10 Gbps Google Cloud Interconnect link with under-5-millisecond latency.

Headline numbers: $0 new CapEx · $12,740/month GCP spend · $242K total annual · $741K 3-Year TCO · 10–14 weeks to launch · under-5 ms Interconnect latency.

What stays on-prem and why

Component	Why it stays on-prem	New cost
Teradata Data Warehouse	Existing investment, 100 TB+, migration risk; pushdown SQL runs near it	$0 (existing)
Oracle ERP database	Business-critical; on-prem regulatory policy; license tied to hardware	$0 (existing)
SAP HANA	SAP licensing tied to on-prem servers	$0 (existing)
Active Directory	Corporate identity; federated to Keycloak on GCP via LDAP	$0 (existing)
Debezium CDC agents (4 VMs)	Co-located with the source databases for low-latency change capture	$0 (existing VMware capacity)
PII masking agents (2 VMs)	Mask PESEL and other personal data before it crosses to GCP	$0 (existing VMware capacity)
On-prem Kafka buffer (3 brokers)	Absorbs CDC bursts; retains events if the Interconnect link drops	180,000 PLN/yr (or $0 if existing servers are reused)
Cloud Interconnect on-prem termination	Cisco ASR edge router, Warsaw cross-connect	7,200 PLN/month

A key compliance feature: personal data is masked on-prem at the Debezium layer before it ever crosses to GCP. PESEL, NIP, REGON, phone numbers, and email addresses are pseudonymized with a deterministic SHA-256 hash. The AI Copilot only ever receives schema context — never raw billing or customer data.

Hybrid cost breakdown

The GCP-side platform cost in the Hybrid topology is lower than full-cloud — about 7% lower — because the CDC agents and their Kafka buffering run on-prem, reducing GKE, Kafka, and storage spend on GCP.

Cost component	Annual
On-prem incremental (Interconnect, Kafka buffer, 0.5 FTE network engineer)	~$89,100/yr (or ~$46K with existing Kafka)
GCP platform ($12,740/month)	$152,880/yr
Combined Hybrid annual	$241,980/yr (≈$242,000)
Combined — reusing existing on-prem Kafka	≈$197,000/yr

3-Year TCO (Hybrid)

Variant	Year 1	Year 2	Year 3	3-Year total
Hybrid (new Kafka hardware)	$257,000 (incl. +$15K Interconnect setup)	$242,000	$242,000	$741,000
Hybrid (existing on-prem Kafka)	$212,000	$197,000	$197,000	$606,000

There is a one-time ~$15,000 physical cross-connect installation at the Warsaw Interconnect point of presence, with a 6–8 week lead time for the physical circuit — this is the critical path for a Hybrid rollout.

When to choose Hybrid

Pros: keeps sensitive source databases on-prem; reuses existing infrastructure; GCP handles elastic compute, AI, and analytics; the 10 Gbps Interconnect is private, not over the internet; PII is masked on-prem before reaching GCP (RODO-compliant); a progressive migration path; lower GCP costs than full-cloud; a 3-year TCO of ~$606K versus $1.836M for On-Premises — a saving of $1.23M.
Cons: the most complex to set up (two environments); a 6–8 week Interconnect lead time; needs network engineers for BGP/VPN/VLAN configuration; data residency is split (metadata in GCP, raw data on-prem); partial dependency on Interconnect uptime — though the on-prem Kafka buffer holds events locally for 48 hours.

Why Hybrid is recommended for Polkomtel

Polkomtel already owns Teradata, Oracle, and SAP HANA on-prem — a sunk cost best kept in place to avoid migration risk. Hybrid costs $197K–$242K/yr versus $165K/yr for GCP-only — a $32K–77K/yr premium that buys data sovereignty. Over three years, Hybrid ($741K) versus On-Premises ($2,286K) saves $1.55M — a return on investment above 200%, and Hybrid delivers 73% cost savings versus On-Premises. It launches in 10–14 weeks rather than 6–9 months, which matters for hitting the Informatica decommission deadline.

Capacity planning and growth

The three-year capacity plan assumes Year 1 launch, +30% growth in Year 2, and +50% in Year 3.

Metric	Year 1	Year 2	Year 3	On-Prem impact	GCP / Hybrid impact
Active pipelines	200	260	300	Possible K8s node scale-out in Year 3	GKE autoscales; +$600/mo in Year 3
Daily data volume processed	500 GB	800 GB	1,200 GB	NetApp headroom; monitor Kafka storage	Cloud Storage is unbounded; +$150/mo Dataproc in Year 3
CDC event throughput (peak)	5,000 eps	8,000 eps	12,000 eps	May need a 6th Kafka broker in Year 3	Confluent autoscales partitions
AI Copilot queries/day	500	1,500	3,000	Needs Anthropic API regardless	Claude API +$100/mo Year 2, +$250/mo Year 3
Concurrent users	50	80	120	K8s horizontal pod autoscaler handles it	GKE autoscales
Lineage graph nodes	50,000	150,000	500,000	pgvector tuning needed at 500K	Cloud SQL auto-grows storage

The Year 3 cost increase is roughly +$130K CapEx and +$50K/yr OpEx for On-Premises, versus +$7,000/month (~$84K/yr) for GCP or Hybrid. On GCP and Hybrid, scaling is automatic — no node provisioning, storage auto-grows, and budget guardrails alert at 80%, 100%, and 130% of the configured budget.

Time-to-launch comparison

Phase	On-Premises	GCP Full Cloud	Hybrid
Infrastructure provisioning	6–12 wks (hardware procurement)	2–5 days (Terraform apply)	6–8 wks (Cloud Interconnect physical circuit)
Platform deployment	4–8 wks	1–2 wks	2–3 wks
Connectivity & security	2–4 wks	1 wk	3–4 wks
First pipeline in production	14–24 wks	3–4 wks	10–14 wks
Full production (all pipelines)	6–9 months	2–3 months	4–6 months
Risk level	High (procurement, hardware failure)	Low (managed, auto-recovery)	Medium (network complexity)
Team requirement	2+ FTE infra engineers dedicated	0.5 FTE GCP admin	0.5 FTE network + 0.25 FTE GCP admin

Security and RODO compliance per scenario

RODO is the Polish implementation of the GDPR. The table below shows how each topology meets the key compliance requirements (✓ Full / ⚠ Partial).

Requirement	On-Prem	GCP Full	Hybrid
RODO (Polish GDPR)	✓ all data on-prem	✓ GCP `europe-central2` Warsaw	✓ PII masked before GCP
UKE telecom regulations	✓	⚠ partial — verify with legal	✓ CDR / raw telco data stays on-prem
SOC 2 Type II	⚠ manual implementation + audit	✓ inherits GCP certification	✓ GCP covered, on-prem manual
ISO 27001	⚠ manual audit	✓ GCP certified	⚠ partial
Encryption at rest	✓ Vault HSM seal	✓ Cloud KMS CMEK	✓ both
Encryption in transit	✓ internal mTLS	✓ Google TLS 1.3	✓ MACsec Interconnect + mTLS
Network segmentation (zero-trust)	⚠ manual VLAN + firewall	✓ VPC Service Controls	✓ GCP VPC + on-prem VLAN
Audit logging (immutable)	⚠ ELK + Wazuh manual SIEM	✓ Cloud Audit Logs (400-day retention)	✓ both
DDoS protection	⚠ FortiGate IPS	✓ Cloud Armor Enterprise	✓ Cloud Armor
Data Loss Prevention	⚠ manual policies	✓ Cloud DLP + DataFlow governance	✓ Debezium masks + Cloud DLP

All three topologies share the same DataFlow AI security features: Keycloak 24 for OIDC/SAML with Active Directory federation and MFA; HashiCorp Vault for dynamic 30-minute database credentials; API Gateway RBAC with 5 roles and 26 permissions; default-deny Kubernetes network policies; and 30-day rotation of all database passwords, API keys, and Kafka credentials.

For disaster recovery, the two source documents quote different targets — note the discrepancy:

The deployment-scenarios document quotes a warm GCP DR target of RTO < 60 seconds, RPO < 30 seconds.
The GCP cost-analysis document quotes RTO < 4 hours, RPO < 15 minutes for the same DR setup.

Decision guide — choosing a topology

The source document provides a decision flowchart. Walk through these questions in order:

Q1. Do regulations require ALL data to stay on-premises?
      YES  ->  On-Premises  (military / government air-gap)
      NO   ->  go to Q2

Q2. Are there large existing on-prem databases (Teradata / Oracle / SAP HANA)?
      YES  ->  go to Q3
      NO   ->  go to Q4

Q3. Is data transfer to the cloud acceptable, given PII masking on-prem first?
      YES  ->  HYBRID  (recommended)
      NO   ->  On-Premises

Q4. Is the budget under $200K/year?
      YES  ->  GCP Full Cloud (Minimum or Realistic)
      NO   ->  GCP Full Cloud (Realistic or Pessimistic)

The Polkomtel Plus path through this flowchart: Q1 = No (regulations permit cloud), Q2 = Yes (large Teradata, Oracle, and SAP HANA estates already on-prem), Q3 = Yes (transfer is acceptable with on-prem PII masking) → Hybrid.

Side-by-side summary

Factor	On-Premises	GCP Full Cloud	Hybrid ⭐
Upfront CapEx	2.16M PLN (~$540K)	$0	$0 new
3-Year TCO	~$2,286,000	~$495,000 (realistic)	~$606,000–$741,000
Time to first pipeline	14–24 weeks	3–4 weeks	10–14 weeks
Elastic scaling	No — fixed hardware	Yes — fully automatic	Yes — GCP platform scales
Data sovereignty	Maximum	GCP Warsaw region	Source data on-prem, metadata in GCP
Dedicated staff	2+ FTE infra engineers	0.5 FTE	0.75 FTE
Best fit	Air-gapped / regulatory ban on cloud	Greenfield, budget-led, fast launch	Large existing on-prem databases

Two deployment realities

The topologies above describe the intended GKE-based GCP architecture. The platform's current live production deployment is a single Debian VPS running Docker Compose — not GKE, and not multi-region. For the build and rollout mechanics of what actually runs today, see Deployment & rollout.

Where to go next

For the business case, ROI, and migration economics behind choosing a topology, see Business value & ROI.
For the actual build, release, and rollout process — including the live VPS deployment — see Deployment & rollout.
For administrative tasks once a topology is live, see the Admin guide.