Observability
streamhub-core ships first-class Prometheus instrumentation out of the box, and a
ready-to-deploy Prometheus + Grafana + node_exporter stack lives in
deploy/observability/.
streamhub-core /metrics
Section titled “streamhub-core /metrics”- Endpoint:
GET /metricsat the root path (deliberately not under/api/v1). - Format: Prometheus text exposition.
- Auth: public by default (it carries no secrets). Set
METRICS_TOKENto requireAuthorization: Bearer <token>(or?token=). Toggle the Node/process collectors withMETRICS_DEFAULT_METRICS=off.
curl -s http://127.0.0.1:3020/metrics | grep streamhub_DB-derived gauges are recomputed from SQLite on every scrape, so they always match the source of truth.
What’s exposed (all prefixed streamhub_)
Section titled “What’s exposed (all prefixed streamhub_)”| Area | Key metrics |
|---|---|
| HTTP | streamhub_http_requests_total{method,route,status}, streamhub_http_request_duration_seconds (histogram), streamhub_http_requests_in_flight. route is the matched pattern (bounded cardinality; unmatched → unmatched). |
| Streams | streamhub_active_streams{app}, streamhub_stream_viewers{app,room}, streamhub_stream_events_total{app,event}. |
| Recording / VODs | streamhub_recordings_started_total{app}, streamhub_vods_generated_total{app}, streamhub_recording_failures_total{app,reason}, streamhub_upload_queue_depth{app}, streamhub_vods{app,status}. |
| S3 / egress upload | streamhub_s3_uploads_total{provider,result}, streamhub_s3_upload_bytes_total{provider}, streamhub_s3_errors_total{op}. |
| Callbacks | streamhub_callbacks_total{app,event,result} (delivered|failed|dropped). |
| Tenancy / quotas | streamhub_apps{tenant}, streamhub_tenant_quota{tenant,metric}, streamhub_tenant_usage{tenant,metric}. |
| Transcoding / GPU | streamhub_media_transcode_total{kind,accel,type}, streamhub_gpu_available{type}. |
| Errors | streamhub_errors_total{source,code} (today only source=http, 5xx). |
| Process | default process_* / nodejs_* (CPU, RSS/heap, event-loop lag, GC, handles) unless disabled — this only covers the core’s own Node process, not the host. |
LiveKit native metrics
Section titled “LiveKit native metrics”LiveKit exports its own Prometheus metrics (rooms, participants, tracks, packet loss,
egress/ingress, CPU) — do not proxy them through core. Enable in livekit.yaml (or the
matching *_CONFIG_BODY in Compose):
port: 7880prometheus_port: 6789 # → GET http://<livekit>:6789/metricsIngress and egress run as separate services and expose their own metrics the same way — add a
prometheus_port and a scrape job for each (6790 / 6791 by convention). Keep every one of
these ports bound to 127.0.0.1 — never open them in the firewall.
Deploy the Prometheus + Grafana stack
Section titled “Deploy the Prometheus + Grafana stack”deploy/observability/ is a separate Docker Compose stack from the media stack
(docker-compose.yml at the repo root) — independent lifecycle, no shared services. It
targets a single 8 GB node shared with the media stack, so everything binds to 127.0.0.1
and stays under a hard RAM budget.
| Component | mem_limit |
Real-world usage |
|---|---|---|
| Prometheus | 512m (reservation 256m) | 150–300 MB |
| Grafana | 256m (reservation 128m) | 80–150 MB |
| node_exporter | 64m | ~20 MB |
| Total | ~832 MB hard cap | ~250–470 MB |
-
Configure secrets — both are gitignored and have no defaults.
Terminal window cd deploy/observability# Grafana admin password (required, no default)cp .env.example .env$EDITOR .env # set GRAFANA_ADMIN_PASSWORD# streamhub-core scrape token, only needed if METRICS_TOKEN is set on the corecp secrets/metrics_token.example secrets/metrics_token$EDITOR secrets/metrics_token # paste the real METRICS_TOKEN (mtk_…) -
Bring the stack up (from the repo root).
Terminal window docker compose -f deploy/observability/docker-compose.observability.yml \--env-file deploy/observability/.env up -d -
Verify targets are up.
Terminal window curl -s http://127.0.0.1:9090/api/v1/targets \| jq '.data.activeTargets[] | {job: .labels.job, health}'curl -s -o /dev/null -w '%{http_code}\n' http://127.0.0.1:3001/loginThe
livekit/livekit-ingress/livekit-egressjobs showdownuntil those services haveprometheus_portenabled — that’s expected and doesn’t break anything (see theLiveKitDown/LiveKitEgressDown/LiveKitIngressDownalerts). -
Reach Grafana — it binds to
127.0.0.1:3001only; by design there’s no port published to the world. Either tunnel over SSH:Terminal window ssh -N -L 3001:127.0.0.1:3001 user@stream01# open http://127.0.0.1:3001 locallyor front it with an nginx
location{}block requiring Basic Auth / IP allowlisting plus its own certbot TLS, on a dedicated subdomain (e.g.obs.streamhub.example.com) — the Grafana login is a second layer on top of that, not a substitute for it.
Tear it down with docker compose -f deploy/observability/docker-compose.observability.yml down (data persists in the prom_data / grafana_data named volumes; add -v to also
delete them).
What’s in deploy/observability/
Section titled “What’s in deploy/observability/”| File | Role |
|---|---|
docker-compose.observability.yml |
prometheus + grafana + node-exporter, network_mode: host, all bound to 127.0.0.1 |
prometheus.yml |
scrape configs: streamhub-core (bearer token via credentials_file), livekit, livekit-ingress, livekit-egress, node, prometheus |
alerts.yml |
host RAM/swap/disk, core/LiveKit/egress/ingress down, recording failures, VOD backlog, failed callbacks, log-error spikes, TLS cert expiry |
.env.example, secrets/metrics_token.example |
secret templates — copy and fill in, never commit the real values |
scripts/cert-expiry-textfile.sh |
optional certbot renewal hook that feeds the TlsCertExpiringSoon alert via node_exporter’s textfile collector |
grafana/provisioning/ |
Prometheus datasource + dashboard provider, auto-loaded at Grafana boot |
grafana/dashboards/server-global.json |
host + service health |
grafana/dashboards/per-app.json |
$app-scoped viewers/streams/recordings/VODs/callbacks/quota |
grafana/dashboards/media-latency.json |
LiveKit native metrics: RTT, forward latency, packet loss, NACK/PLI, rooms |
Prometheus retention on the shared node defaults to --storage.tsdb.retention.time=15d
(PROMETHEUS_RETENTION_TIME in .env) with a hard disk cap of --storage.tsdb.retention.size=4GB
(PROMETHEUS_RETENTION_SIZE). 30-day retention is the target once observability moves to its
own dedicated node, not while it’s sharing RAM with egress.
Not included yet (later phases)
Section titled “Not included yet (later phases)”- Loki + Alloy for searchable logs in Grafana (30-day retention, chunks in S3) — needs another 350–600 MB of RAM, better suited to a dedicated observability node than to the shared 8 GB box.
- Alertmanager — this directory ships the alert rules only (
alerts.yml); routing to Slack/email/PagerDuty isn’t wired up. In the meantime checkhttp://127.0.0.1:9090/alertsor Grafana’s own Alerting UI. - Enabling
prometheus_porton the media stack’sdocker-compose.yml(livekit/ingress/egress) is a change to that compose file, outside this directory’s scope. - Phase-2/3 core metrics (
streamhub_bytes_ingest_total,streamhub_bytes_egress_total,streamhub_ingest_latency_seconds,streamhub_s3_bucket_bytes,streamhub_recording_duration_seconds) — the “ingest by protocol” panel inper-app.jsonis already wired up for when they exist, but is empty today. - A 30-day purge job for the
server_logstable (see Logs, below) — a core code change, not a deploy-stack change. - Cluster-wide observability (central Prometheus, Alertmanager routing, aggregated dashboards) is a later phase.
Structured pino logs go to three places:
- stdout —
docker compose logs -f core/journalctl -u streamhub-core. - A rotating file under
<DATA_DIR>/logs/(LOG_MAX_BYTES, default 10 MB; rotates by size and by day, keepsLOG_MAX_FILESarchives, default 10). This is not the same as 30-day retention. - The
server_logstable (global DB):ts, level, source, app_id, message, meta_json, queryable viaGET /api/v1/logs(filter by app/level/date range, paginated). There’s no automatic time-based pruning yet — only a manual per-app purge — so the table grows without bound until a retention job is added.
source values seen today: recording, livekit, livekit-webhook, transcoding,
broadcast, hls, callbacks, system, logs, plugins.
Useful Grafana / PromQL queries
Section titled “Useful Grafana / PromQL queries”- Active streams:
sum(streamhub_active_streams) - Request rate:
sum by (route,status) (rate(streamhub_http_requests_total[5m])) - p95 latency:
histogram_quantile(0.95, sum by (le,route) (rate(streamhub_http_request_duration_seconds_bucket[5m]))) - Upload success rate:
sum(rate(streamhub_s3_uploads_total{result="ok"}[5m])) / sum(rate(streamhub_s3_uploads_total[5m])) - VOD upload backlog:
sum(streamhub_upload_queue_depth) - Callback failures:
sum by (event) (rate(streamhub_callbacks_total{result="failed"}[5m])) - Tenant usage vs quota:
streamhub_tenant_usage / streamhub_tenant_quota
Recommended alerts (already in deploy/observability/alerts.yml): core/LiveKit target
down; sustained streamhub_upload_queue_depth; rate(streamhub_recording_failures_total[15m]) > 0;
elevated rate(streamhub_callbacks_total{result="failed"}[15m]); host RAM/swap/disk pressure;
TLS cert expiring soon.