Operations

Observability & monitoring

Observability on the DataFlow AI Platform spans two layers: an infrastructure layer (Prometheus scraping Actuator endpoints, visualized in Grafana) and an application layer (monitor-service, a first-class product feature that tracks alerts, SLAs, cost, freshness, and pipeline run history). This page explains both and how to use them to keep pipelines healthy.


The observability stack

  ┌──────────────┐  ┌───────────────┐  ┌──────────────┐  ┌─────────────┐
  │ api-gateway   │  │ metadata-svc  │  │ pipeline-eng  │  │ monitor-svc │
  │ /actuator/    │  │ /actuator/    │  │ /actuator/    │  │ /actuator/  │
  │  prometheus   │  │  prometheus   │  │  prometheus   │  │  prometheus │
  └──────┬───────┘  └──────┬────────┘  └──────┬───────┘  └──────┬──────┘
         │                 │                  │                 │
         └─────────────────┴────────┬─────────┴─────────────────┘
                                    │  scrape every 15s
                          ┌─────────▼──────────┐
                          │   Prometheus       │
                          │   v2.54.1           │
                          └─────────┬──────────┘
                                    │  datasource (default)
                          ┌─────────▼──────────┐
                          │   Grafana 11.2.2   │
                          │   dashboards (UI)  │
                          └────────────────────┘

  monitor-service ── OpenTelemetry tracing + Prometheus metrics + SSE streams
  copilot         ── exposes /metrics (Python)

The stack has three cooperating parts:

  • Prometheus scrapes metrics from each service's Spring Boot Actuator endpoint.
  • Grafana queries Prometheus as a datasource and renders dashboards.
  • monitor-service is the product's own monitoring service — it adds OpenTelemetry tracing, business-level alerting, SLA burn-rate tracking, and SSE live streams the frontend consumes.

Prometheus

Prometheus (prom/prometheus:v2.54.1) is configured by docker/prometheus/prometheus.yml with a 15-second scrape interval. It scrapes the /actuator/prometheus endpoint from:

  • api-gateway
  • metadata-service
  • pipeline-engine
  • pushdown-sql
  • lineage-service
  • monitor-service

Known scrape-port mismatch

The scrape config targets host-port numbers (for example metadata-service:8081), but inside the Docker network every JVM service listens on container port 8080. This is a potential mismatch — when adding or debugging scrape targets, verify the port the container actually exposes on the dataflow-network bridge, not the host-published port.

What metrics are exposed

Spring Boot Actuator with Micrometer exposes the standard JVM and HTTP metric families on /actuator/prometheus, including:

Metric familyWhat it tells you
http_server_requests_*Request count, latency histograms, status codes per route
jvm_memory_*, jvm_gc_*Heap/non-heap usage and garbage-collection behavior
jvm_threads_*Thread pool saturation
process_cpu_usage, system_cpu_usageCPU pressure on the host and process
hikaricp_connections_*Database connection-pool usage and waits

The copilot Python service exposes its own /metrics endpoint for scraping.


Grafana

Grafana (grafana/grafana:11.2.2) is provisioned by docker/grafana/datasources.yml with a single Prometheus datasource (http://prometheus:9090), set as default.

No dashboards ship in the repo

The repository provisions the Prometheus datasource but does not provision any dashboards. Dashboards are created manually in the Grafana UI. When standing up a fresh environment, plan to import or build dashboards yourself — there is no dashboard-as-code in the repo.

On the VPS, Grafana is published on port 4001 (offset to coexist with the itsm stack). On a default Compose deployment it is on 3001.

Because dashboards are built by hand, a useful starting set covers:

  • Request healthhttp_server_requests rate, p50/p95/p99 latency, and 5xx ratio per service.
  • JVM health — heap usage trend, GC pause time, thread count.
  • Database — HikariCP active/idle/pending connections.
  • Pipeline throughput — pipeline run counts and failure rate (sourced from monitor-service data).

OpenTelemetry tracing

monitor-service carries the platform's distributed tracing via its OpenTelemetryConfig and a filter/TracingFilter. Request correlation is propagated end to end using an X-Request-Id header, so a single request can be followed across the gateway and downstream services.

The gateway's RequestLoggingFilter produces structured access logs, and the request ID generated or accepted at the edge flows through every hop. When investigating a slow or failed request, capture the X-Request-Id from the response and use it to correlate logs and traces.


Health and readiness probes

The api-gateway exposes a HealthController with the standard Spring Boot Actuator probe trio:

ProbeEndpointMeaning
Liveness/actuator/health/livenessThe process is alive; restart it if this fails
Readiness/actuator/health/readinessThe service is ready to receive traffic
Startup/actuator/health/startupThe service has finished initializing

The gateway's readiness probe is dependency-aware — it probes Keycloak's .well-known OIDC discovery endpoint and the metadata-service health endpoint. This means the gateway will not report ready until its critical upstreams are reachable.

Every Kotlin service container also defines a Docker healthcheck on /actuator/health. Compose depends_on … condition: service_healthy uses these to enforce start order: postgres → keycloak / kafka / redis → application services → frontend.

To check health manually on the VPS:

# gateway aggregate health
curl -sf http://127.0.0.1:8085/actuator/health | jq

# readiness specifically
curl -sf http://127.0.0.1:8085/actuator/health/readiness | jq

# container health at a glance
docker compose ps

Application-level monitoring — monitor-service

monitor-service is not just infrastructure plumbing — it is a first-class product feature. It owns alerts, SLA tracking, pipeline run history, cost tracking, data freshness, and schema-change monitoring.

Its Flyway schema spans 25 migrations (V9–V25) covering:

  • Alert persistence and notification routing
  • Cluster metrics history
  • SLA burn-rate tracking
  • Data quarantine
  • Schema-change monitoring (the monitor_* tables, including monitor_alerts and monitor_pipeline_runs)

monitor-service also pulls Keycloak user metrics when MONITOR_KEYCLOAK_ENABLED=true.

How to observe pipeline health

A pipeline run flows through the platform and surfaces in monitoring at several points:

  1. pipeline-engine executes the run and publishes run events via ExecutionEventPublisher / PipelineRunLogPublisher.
  2. On task failure, SelfHealingService classifies the failure and applies recovery strategies — these recovery events are observable.
  3. On completion, monitor-service ingests the run metrics and evaluates cost, SLA, and freshness, raising alerts where thresholds are crossed.
  4. Run history is persisted in monitor_pipeline_runs; alerts in monitor_alerts.

To check pipeline run state directly in the database:

docker compose exec postgres psql -U postgres -d dataflow_metadata \
  -c "SELECT status, COUNT(*) FROM monitor_pipeline_runs GROUP BY status;"

SSE live streams

The platform pushes real-time observability data to the UI using Server-Sent Events rather than polling.

StreamSourceCarries
Alertsmonitor-service /api/v1/monitor/sse/**Live alert firing/clearing
Metricsmonitor-service SSELive metric updates
Notificationsmonitor-service notification inboxUser-facing notification inbox events
Copilot chatcopilotToken-streamed AI responses

In addition, WebSocket carries pipeline run log streaming on /api/v1/runs/{runId}/stream — the gateway rewrites http→ws to pipeline-engine, and the frontend LogStream component renders live task-state and log updates.

SSE requires correct nginx config

SSE only works through the VPS nginx because /api/ is configured with proxy_buffering off and the conditional WebSocket-upgrade map. If alerts or live metrics stop updating after an nginx change, re-check that hotfix — a hardcoded Connection "upgrade" header breaks SSE and WebSocket streams. See the deployment guide's nginx hotfix section.

The frontend's SSEManager (EventSource) drives alerts, metrics, and notifications; the request interceptor in frontend/src/api/client.ts ensures no stream opens before the Keycloak token is ready.


Logging

Each service produces structured logs. The gateway's RequestLoggingFilter emits structured access logs at the edge, and the X-Request-Id correlation header threads through downstream services so a request can be reconstructed across hops.

On the VPS, container logs are accessible through Docker Compose:

# tail logs for one service
docker compose logs -f --tail=200 api-gateway

# logs since a timestamp
docker compose logs --since 2026-05-20T04:00:00 monitor-service

When diagnosing an issue, capture the X-Request-Id from the failing response and grep for it across the relevant service logs to follow the full request path.


Alerting

Alerting exists at two levels.

Application alerting (monitor-service)

monitor-service evaluates SLA burn-rate, cost, freshness, and schema-change conditions and persists alerts to monitor_alerts, with configurable notification routing. These alerts are surfaced live to the UI over the SSE alert stream and through the notification inbox.

Infrastructure alerting (GKE path)

The DR runbook and the production GitHub Actions workflow reference PagerDuty (P1/P2 routing keys) and Slack (#dataflow-incidents, SLACK_WEBHOOK_DEPLOYMENTS). Cloud Monitoring alert policies are described in docs/deployment-scenarios.html.

Infrastructure alerting is GKE-only

PagerDuty, Slack deployment notifications, and Cloud Monitoring alert policies belong to the documented GKE path. The single-VPS production deployment relies on monitor-service's in-product alerting plus manual health checks — there is no external paging configured for the VPS.


A quick observability checklist

When you need to know "is the platform healthy right now?", run through this list:

  1. docker compose ps — are all 8 services and the infra tier healthy?
  2. curl -sf http://127.0.0.1:8085/actuator/health/readiness — is the gateway ready?
  3. Open Grafana — are request latency p99 and 5xx ratio within normal bounds?
  4. Check monitor_pipeline_runs for a spike in FAILED status.
  5. Check monitor_alerts (and the UI alert stream) for unacknowledged alerts.
  6. docker compose logs --since 1h on any service showing errors — correlate by X-Request-Id.
Previous
Deployment scenarios