Operations
Observability & monitoring
Observability on the DataFlow AI Platform spans two layers: an infrastructure layer (Prometheus scraping Actuator endpoints, visualized in Grafana) and an application layer (monitor-service, a first-class product feature that tracks alerts, SLAs, cost, freshness, and pipeline run history). This page explains both and how to use them to keep pipelines healthy.
The observability stack
┌──────────────┐ ┌───────────────┐ ┌──────────────┐ ┌─────────────┐
│ api-gateway │ │ metadata-svc │ │ pipeline-eng │ │ monitor-svc │
│ /actuator/ │ │ /actuator/ │ │ /actuator/ │ │ /actuator/ │
│ prometheus │ │ prometheus │ │ prometheus │ │ prometheus │
└──────┬───────┘ └──────┬────────┘ └──────┬───────┘ └──────┬──────┘
│ │ │ │
└─────────────────┴────────┬─────────┴─────────────────┘
│ scrape every 15s
┌─────────▼──────────┐
│ Prometheus │
│ v2.54.1 │
└─────────┬──────────┘
│ datasource (default)
┌─────────▼──────────┐
│ Grafana 11.2.2 │
│ dashboards (UI) │
└────────────────────┘
monitor-service ── OpenTelemetry tracing + Prometheus metrics + SSE streams
copilot ── exposes /metrics (Python)
The stack has three cooperating parts:
- Prometheus scrapes metrics from each service's Spring Boot Actuator endpoint.
- Grafana queries Prometheus as a datasource and renders dashboards.
monitor-serviceis the product's own monitoring service — it adds OpenTelemetry tracing, business-level alerting, SLA burn-rate tracking, and SSE live streams the frontend consumes.
Prometheus
Prometheus (prom/prometheus:v2.54.1) is configured by docker/prometheus/prometheus.yml with a 15-second scrape interval. It scrapes the /actuator/prometheus endpoint from:
api-gatewaymetadata-servicepipeline-enginepushdown-sqllineage-servicemonitor-service
Known scrape-port mismatch
The scrape config targets host-port numbers (for example metadata-service:8081), but inside the Docker network every JVM service listens on container port 8080. This is a potential mismatch — when adding or debugging scrape targets, verify the port the container actually exposes on the dataflow-network bridge, not the host-published port.
What metrics are exposed
Spring Boot Actuator with Micrometer exposes the standard JVM and HTTP metric families on /actuator/prometheus, including:
| Metric family | What it tells you |
|---|---|
http_server_requests_* | Request count, latency histograms, status codes per route |
jvm_memory_*, jvm_gc_* | Heap/non-heap usage and garbage-collection behavior |
jvm_threads_* | Thread pool saturation |
process_cpu_usage, system_cpu_usage | CPU pressure on the host and process |
hikaricp_connections_* | Database connection-pool usage and waits |
The copilot Python service exposes its own /metrics endpoint for scraping.
Grafana
Grafana (grafana/grafana:11.2.2) is provisioned by docker/grafana/datasources.yml with a single Prometheus datasource (http://prometheus:9090), set as default.
No dashboards ship in the repo
The repository provisions the Prometheus datasource but does not provision any dashboards. Dashboards are created manually in the Grafana UI. When standing up a fresh environment, plan to import or build dashboards yourself — there is no dashboard-as-code in the repo.
On the VPS, Grafana is published on port 4001 (offset to coexist with the itsm stack). On a default Compose deployment it is on 3001.
Recommended dashboard panels
Because dashboards are built by hand, a useful starting set covers:
- Request health —
http_server_requestsrate, p50/p95/p99 latency, and 5xx ratio per service. - JVM health — heap usage trend, GC pause time, thread count.
- Database — HikariCP active/idle/pending connections.
- Pipeline throughput — pipeline run counts and failure rate (sourced from
monitor-servicedata).
OpenTelemetry tracing
monitor-service carries the platform's distributed tracing via its OpenTelemetryConfig and a filter/TracingFilter. Request correlation is propagated end to end using an X-Request-Id header, so a single request can be followed across the gateway and downstream services.
The gateway's RequestLoggingFilter produces structured access logs, and the request ID generated or accepted at the edge flows through every hop. When investigating a slow or failed request, capture the X-Request-Id from the response and use it to correlate logs and traces.
Health and readiness probes
The api-gateway exposes a HealthController with the standard Spring Boot Actuator probe trio:
| Probe | Endpoint | Meaning |
|---|---|---|
| Liveness | /actuator/health/liveness | The process is alive; restart it if this fails |
| Readiness | /actuator/health/readiness | The service is ready to receive traffic |
| Startup | /actuator/health/startup | The service has finished initializing |
The gateway's readiness probe is dependency-aware — it probes Keycloak's .well-known OIDC discovery endpoint and the metadata-service health endpoint. This means the gateway will not report ready until its critical upstreams are reachable.
Every Kotlin service container also defines a Docker healthcheck on /actuator/health. Compose depends_on … condition: service_healthy uses these to enforce start order: postgres → keycloak / kafka / redis → application services → frontend.
To check health manually on the VPS:
# gateway aggregate health
curl -sf http://127.0.0.1:8085/actuator/health | jq
# readiness specifically
curl -sf http://127.0.0.1:8085/actuator/health/readiness | jq
# container health at a glance
docker compose ps
Application-level monitoring — monitor-service
monitor-service is not just infrastructure plumbing — it is a first-class product feature. It owns alerts, SLA tracking, pipeline run history, cost tracking, data freshness, and schema-change monitoring.
Its Flyway schema spans 25 migrations (V9–V25) covering:
- Alert persistence and notification routing
- Cluster metrics history
- SLA burn-rate tracking
- Data quarantine
- Schema-change monitoring (the
monitor_*tables, includingmonitor_alertsandmonitor_pipeline_runs)
monitor-service also pulls Keycloak user metrics when MONITOR_KEYCLOAK_ENABLED=true.
How to observe pipeline health
A pipeline run flows through the platform and surfaces in monitoring at several points:
pipeline-engineexecutes the run and publishes run events viaExecutionEventPublisher/PipelineRunLogPublisher.- On task failure,
SelfHealingServiceclassifies the failure and applies recovery strategies — these recovery events are observable. - On completion,
monitor-serviceingests the run metrics and evaluates cost, SLA, and freshness, raising alerts where thresholds are crossed. - Run history is persisted in
monitor_pipeline_runs; alerts inmonitor_alerts.
To check pipeline run state directly in the database:
docker compose exec postgres psql -U postgres -d dataflow_metadata \
-c "SELECT status, COUNT(*) FROM monitor_pipeline_runs GROUP BY status;"
SSE live streams
The platform pushes real-time observability data to the UI using Server-Sent Events rather than polling.
| Stream | Source | Carries |
|---|---|---|
| Alerts | monitor-service /api/v1/monitor/sse/** | Live alert firing/clearing |
| Metrics | monitor-service SSE | Live metric updates |
| Notifications | monitor-service notification inbox | User-facing notification inbox events |
| Copilot chat | copilot | Token-streamed AI responses |
In addition, WebSocket carries pipeline run log streaming on /api/v1/runs/{runId}/stream — the gateway rewrites http→ws to pipeline-engine, and the frontend LogStream component renders live task-state and log updates.
SSE requires correct nginx config
SSE only works through the VPS nginx because /api/ is configured with proxy_buffering off and the conditional WebSocket-upgrade map. If alerts or live metrics stop updating after an nginx change, re-check that hotfix — a hardcoded Connection "upgrade" header breaks SSE and WebSocket streams. See the deployment guide's nginx hotfix section.
The frontend's SSEManager (EventSource) drives alerts, metrics, and notifications; the request interceptor in frontend/src/api/client.ts ensures no stream opens before the Keycloak token is ready.
Logging
Each service produces structured logs. The gateway's RequestLoggingFilter emits structured access logs at the edge, and the X-Request-Id correlation header threads through downstream services so a request can be reconstructed across hops.
On the VPS, container logs are accessible through Docker Compose:
# tail logs for one service
docker compose logs -f --tail=200 api-gateway
# logs since a timestamp
docker compose logs --since 2026-05-20T04:00:00 monitor-service
When diagnosing an issue, capture the X-Request-Id from the failing response and grep for it across the relevant service logs to follow the full request path.
Alerting
Alerting exists at two levels.
Application alerting (monitor-service)
monitor-service evaluates SLA burn-rate, cost, freshness, and schema-change conditions and persists alerts to monitor_alerts, with configurable notification routing. These alerts are surfaced live to the UI over the SSE alert stream and through the notification inbox.
Infrastructure alerting (GKE path)
The DR runbook and the production GitHub Actions workflow reference PagerDuty (P1/P2 routing keys) and Slack (#dataflow-incidents, SLACK_WEBHOOK_DEPLOYMENTS). Cloud Monitoring alert policies are described in docs/deployment-scenarios.html.
Infrastructure alerting is GKE-only
PagerDuty, Slack deployment notifications, and Cloud Monitoring alert policies belong to the documented GKE path. The single-VPS production deployment relies on monitor-service's in-product alerting plus manual health checks — there is no external paging configured for the VPS.
A quick observability checklist
When you need to know "is the platform healthy right now?", run through this list:
docker compose ps— are all 8 services and the infra tierhealthy?curl -sf http://127.0.0.1:8085/actuator/health/readiness— is the gateway ready?- Open Grafana — are request latency p99 and 5xx ratio within normal bounds?
- Check
monitor_pipeline_runsfor a spike inFAILEDstatus. - Check
monitor_alerts(and the UI alert stream) for unacknowledged alerts. docker compose logs --since 1hon any service showing errors — correlate byX-Request-Id.