Skip to content

Observability

This content is for the 1.0 version. Switch to the latest version for up-to-date documentation.

streamhub-core ships first-class Prometheus instrumentation out of the box, and a ready-to-deploy Prometheus + Grafana + node_exporter stack lives in deploy/observability/.

  • Endpoint: GET /metrics at the root path (deliberately not under /api/v1).
  • Format: Prometheus text exposition.
  • Auth: public by default (it carries no secrets). Set METRICS_TOKEN to require Authorization: Bearer <token> (or ?token=). Toggle the Node/process collectors with METRICS_DEFAULT_METRICS=off.
Terminal window
curl -s http://127.0.0.1:3020/metrics | grep streamhub_

DB-derived gauges are recomputed from SQLite on every scrape, so they always match the source of truth.

What’s exposed (all prefixed streamhub_)

Section titled “What’s exposed (all prefixed streamhub_)”
Area Key metrics
HTTP streamhub_http_requests_total{method,route,status}, streamhub_http_request_duration_seconds (histogram), streamhub_http_requests_in_flight. route is the matched pattern (bounded cardinality; unmatched → unmatched).
Streams streamhub_active_streams{app}, streamhub_stream_viewers{app,room}, streamhub_stream_events_total{app,event}.
Recording / VODs streamhub_recordings_started_total{app}, streamhub_vods_generated_total{app}, streamhub_recording_failures_total{app,reason}, streamhub_upload_queue_depth{app}, streamhub_vods{app,status}.
S3 / egress upload streamhub_s3_uploads_total{provider,result}, streamhub_s3_upload_bytes_total{provider}, streamhub_s3_errors_total{op}.
Callbacks streamhub_callbacks_total{app,event,result} (delivered
Tenancy / quotas streamhub_apps{tenant}, streamhub_tenant_quota{tenant,metric}, streamhub_tenant_usage{tenant,metric}.
Transcoding / GPU streamhub_media_transcode_total{kind,accel,type}, streamhub_gpu_available{type}.
Errors streamhub_errors_total{source,code} (today only source=http, 5xx).
Process default process_* / nodejs_* (CPU, RSS/heap, event-loop lag, GC, handles) unless disabled — this only covers the core’s own Node process, not the host.

LiveKit exports its own Prometheus metrics (rooms, participants, tracks, packet loss, egress/ingress, CPU) — do not proxy them through core. Enable in livekit.yaml (or the matching *_CONFIG_BODY in Compose):

port: 7880
prometheus_port: 6789 # → GET http://<livekit>:6789/metrics

Ingress and egress run as separate services and expose their own metrics the same way — add a prometheus_port and a scrape job for each (6790 / 6791 by convention). Keep every one of these ports bound to 127.0.0.1 — never open them in the firewall.

deploy/observability/ is a separate Docker Compose stack from the media stack (docker-compose.yml at the repo root) — independent lifecycle, no shared services. It targets a single 8 GB node shared with the media stack, so everything binds to 127.0.0.1 and stays under a hard RAM budget.

Component mem_limit Real-world usage
Prometheus 512m (reservation 256m) 150–300 MB
Grafana 256m (reservation 128m) 80–150 MB
node_exporter 64m ~20 MB
Total ~832 MB hard cap ~250–470 MB
  1. Configure secrets — both are gitignored and have no defaults.

    Terminal window
    cd deploy/observability
    # Grafana admin password (required, no default)
    cp .env.example .env
    $EDITOR .env # set GRAFANA_ADMIN_PASSWORD
    # streamhub-core scrape token, only needed if METRICS_TOKEN is set on the core
    cp secrets/metrics_token.example secrets/metrics_token
    $EDITOR secrets/metrics_token # paste the real METRICS_TOKEN (mtk_…)
  2. Bring the stack up (from the repo root).

    Terminal window
    docker compose -f deploy/observability/docker-compose.observability.yml \
    --env-file deploy/observability/.env up -d
  3. Verify targets are up.

    Terminal window
    curl -s http://127.0.0.1:9090/api/v1/targets \
    | jq '.data.activeTargets[] | {job: .labels.job, health}'
    curl -s -o /dev/null -w '%{http_code}\n' http://127.0.0.1:3001/login

    The livekit / livekit-ingress / livekit-egress jobs show down until those services have prometheus_port enabled — that’s expected and doesn’t break anything (see the LiveKitDown / LiveKitEgressDown / LiveKitIngressDown alerts).

  4. Reach Grafana — it binds to 127.0.0.1:3001 only; by design there’s no port published to the world. Either tunnel over SSH:

    Terminal window
    ssh -N -L 3001:127.0.0.1:3001 user@stream01
    # open http://127.0.0.1:3001 locally

    or front it with an nginx location{} block requiring Basic Auth / IP allowlisting plus its own certbot TLS, on a dedicated subdomain (e.g. obs.streamhub.example.com) — the Grafana login is a second layer on top of that, not a substitute for it.

Tear it down with docker compose -f deploy/observability/docker-compose.observability.yml down (data persists in the prom_data / grafana_data named volumes; add -v to also delete them).

File Role
docker-compose.observability.yml prometheus + grafana + node-exporter, network_mode: host, all bound to 127.0.0.1
prometheus.yml scrape configs: streamhub-core (bearer token via credentials_file), livekit, livekit-ingress, livekit-egress, node, prometheus
alerts.yml host RAM/swap/disk, core/LiveKit/egress/ingress down, recording failures, VOD backlog, failed callbacks, log-error spikes, TLS cert expiry
.env.example, secrets/metrics_token.example secret templates — copy and fill in, never commit the real values
scripts/cert-expiry-textfile.sh optional certbot renewal hook that feeds the TlsCertExpiringSoon alert via node_exporter’s textfile collector
grafana/provisioning/ Prometheus datasource + dashboard provider, auto-loaded at Grafana boot
grafana/dashboards/server-global.json host + service health
grafana/dashboards/per-app.json $app-scoped viewers/streams/recordings/VODs/callbacks/quota
grafana/dashboards/media-latency.json LiveKit native metrics: RTT, forward latency, packet loss, NACK/PLI, rooms

Prometheus retention on the shared node defaults to --storage.tsdb.retention.time=15d (PROMETHEUS_RETENTION_TIME in .env) with a hard disk cap of --storage.tsdb.retention.size=4GB (PROMETHEUS_RETENTION_SIZE). 30-day retention is the target once observability moves to its own dedicated node, not while it’s sharing RAM with egress.

  • Loki + Alloy for searchable logs in Grafana (30-day retention, chunks in S3) — needs another 350–600 MB of RAM, better suited to a dedicated observability node than to the shared 8 GB box.
  • Alertmanager — this directory ships the alert rules only (alerts.yml); routing to Slack/email/PagerDuty isn’t wired up. In the meantime check http://127.0.0.1:9090/alerts or Grafana’s own Alerting UI.
  • Enabling prometheus_port on the media stack’s docker-compose.yml (livekit/ingress/egress) is a change to that compose file, outside this directory’s scope.
  • Phase-2/3 core metrics (streamhub_bytes_ingest_total, streamhub_bytes_egress_total, streamhub_ingest_latency_seconds, streamhub_s3_bucket_bytes, streamhub_recording_duration_seconds) — the “ingest by protocol” panel in per-app.json is already wired up for when they exist, but is empty today.
  • A 30-day purge job for the server_logs table (see Logs, below) — a core code change, not a deploy-stack change.
  • Cluster-wide observability (central Prometheus, Alertmanager routing, aggregated dashboards) is a later phase.

Structured pino logs go to three places:

  1. stdoutdocker compose logs -f core / journalctl -u streamhub-core.
  2. A rotating file under <DATA_DIR>/logs/ (LOG_MAX_BYTES, default 10 MB; rotates by size and by day, keeps LOG_MAX_FILES archives, default 10). This is not the same as 30-day retention.
  3. The server_logs table (global DB): ts, level, source, app_id, message, meta_json, queryable via GET /api/v1/logs (filter by app/level/date range, paginated). There’s no automatic time-based pruning yet — only a manual per-app purge — so the table grows without bound until a retention job is added.

source values seen today: recording, livekit, livekit-webhook, transcoding, broadcast, hls, callbacks, system, logs, plugins.

  • Active streams: sum(streamhub_active_streams)
  • Request rate: sum by (route,status) (rate(streamhub_http_requests_total[5m]))
  • p95 latency: histogram_quantile(0.95, sum by (le,route) (rate(streamhub_http_request_duration_seconds_bucket[5m])))
  • Upload success rate: sum(rate(streamhub_s3_uploads_total{result="ok"}[5m])) / sum(rate(streamhub_s3_uploads_total[5m]))
  • VOD upload backlog: sum(streamhub_upload_queue_depth)
  • Callback failures: sum by (event) (rate(streamhub_callbacks_total{result="failed"}[5m]))
  • Tenant usage vs quota: streamhub_tenant_usage / streamhub_tenant_quota

Recommended alerts (already in deploy/observability/alerts.yml): core/LiveKit target down; sustained streamhub_upload_queue_depth; rate(streamhub_recording_failures_total[15m]) > 0; elevated rate(streamhub_callbacks_total{result="failed"}[15m]); host RAM/swap/disk pressure; TLS cert expiring soon.