Observability

streamhub-core ships first-class Prometheus instrumentation out of the box, and a ready-to-deploy Prometheus + Grafana + node_exporter stack lives in deploy/observability/.

streamhub-core `/metrics`

Endpoint: GET /metrics at the root path (deliberately not under /api/v1).
Format: Prometheus text exposition.
Auth: public by default (it carries no secrets). Set METRICS_TOKEN to require Authorization: Bearer <token> (or ?token=). Toggle the Node/process collectors with METRICS_DEFAULT_METRICS=off.

curl -s http://127.0.0.1:3020/metrics | grep streamhub_

DB-derived gauges are recomputed from SQLite on every scrape, so they always match the source of truth.

What’s exposed (all prefixed `streamhub_`)

Area	Key metrics
HTTP	`streamhub_http_requests_total{method,route,status}`, `streamhub_http_request_duration_seconds` (histogram), `streamhub_http_requests_in_flight`. `route` is the matched pattern (bounded cardinality; unmatched → `unmatched`).
Streams	`streamhub_active_streams{app}`, `streamhub_stream_viewers{app,room}`, `streamhub_stream_events_total{app,event}`.
Recording / VODs	`streamhub_recordings_started_total{app}`, `streamhub_vods_generated_total{app}`, `streamhub_recording_failures_total{app,reason}`, `streamhub_upload_queue_depth{app}`, `streamhub_vods{app,status}`.
S3 / egress upload	`streamhub_s3_uploads_total{provider,result}`, `streamhub_s3_upload_bytes_total{provider}`, `streamhub_s3_errors_total{op}`.
Callbacks	`streamhub_callbacks_total{app,event,result}` (`delivered`\|`failed`\|`dropped`).
Tenancy / quotas	`streamhub_apps{tenant}`, `streamhub_tenant_quota{tenant,metric}`, `streamhub_tenant_usage{tenant,metric}`.
Transcoding / GPU	`streamhub_media_transcode_total{kind,accel,type}`, `streamhub_gpu_available{type}`.
Errors	`streamhub_errors_total{source,code}` (today only `source=http`, 5xx).
Process	default `process_` / `nodejs_` (CPU, RSS/heap, event-loop lag, GC, handles) unless disabled — this only covers the core’s own Node process, not the host.

LiveKit native metrics

LiveKit exports its own Prometheus metrics (rooms, participants, tracks, packet loss, egress/ingress, CPU) — do not proxy them through core. Enable in livekit.yaml (or the matching *_CONFIG_BODY in Compose):

port: 7880
prometheus_port: 6789   # → GET http://<livekit>:6789/metrics

Ingress and egress run as separate services and expose their own metrics the same way — add a prometheus_port and a scrape job for each (6790 / 6791 by convention). Keep every one of these ports bound to 127.0.0.1 — never open them in the firewall.

Deploy the Prometheus + Grafana stack

deploy/observability/ is a separate Docker Compose stack from the media stack (docker-compose.yml at the repo root) — independent lifecycle, no shared services. It targets a single 8 GB node shared with the media stack, so everything binds to 127.0.0.1 and stays under a hard RAM budget.

Component	`mem_limit`	Real-world usage
Prometheus	512m (reservation 256m)	150–300 MB
Grafana	256m (reservation 128m)	80–150 MB
node_exporter	64m	~20 MB
Total	~832 MB hard cap	~250–470 MB

Configure secrets — both are gitignored and have no defaults.

cd deploy/observability

# Grafana admin password (required, no default)
cp .env.example .env
$EDITOR .env    # set GRAFANA_ADMIN_PASSWORD

# streamhub-core scrape token, only needed if METRICS_TOKEN is set on the core
cp secrets/metrics_token.example secrets/metrics_token
$EDITOR secrets/metrics_token   # paste the real METRICS_TOKEN (mtk_…)

Bring the stack up (from the repo root).

docker compose -f deploy/observability/docker-compose.observability.yml \
  --env-file deploy/observability/.env up -d

Verify targets are up.
Terminal window
```
curl -s http://127.0.0.1:9090/api/v1/targets \
  | jq '.data.activeTargets[] | {job: .labels.job, health}'

curl -s -o /dev/null -w '%{http_code}\n' http://127.0.0.1:3001/login
```
The livekit / livekit-ingress / livekit-egress jobs show down until those services have prometheus_port enabled — that’s expected and doesn’t break anything (see the LiveKitDown / LiveKitEgressDown / LiveKitIngressDown alerts).
Reach Grafana — it binds to 127.0.0.1:3001 only; by design there’s no port published to the world. Either tunnel over SSH:
Terminal window
```
ssh -N -L 3001:127.0.0.1:3001 user@stream01
# open http://127.0.0.1:3001 locally
```
or front it with an nginx location{} block requiring Basic Auth / IP allowlisting plus its own certbot TLS, on a dedicated subdomain (e.g. obs.streamhub.example.com) — the Grafana login is a second layer on top of that, not a substitute for it.

Tear it down with docker compose -f deploy/observability/docker-compose.observability.yml down (data persists in the prom_data / grafana_data named volumes; add -v to also delete them).

What’s in `deploy/observability/`

File	Role
`docker-compose.observability.yml`	prometheus + grafana + node-exporter, `network_mode: host`, all bound to `127.0.0.1`
`prometheus.yml`	scrape configs: `streamhub-core` (bearer token via `credentials_file`), `livekit`, `livekit-ingress`, `livekit-egress`, `node`, `prometheus`
`alerts.yml`	host RAM/swap/disk, core/LiveKit/egress/ingress down, recording failures, VOD backlog, failed callbacks, log-error spikes, TLS cert expiry
`.env.example`, `secrets/metrics_token.example`	secret templates — copy and fill in, never commit the real values
`scripts/cert-expiry-textfile.sh`	optional certbot renewal hook that feeds the `TlsCertExpiringSoon` alert via node_exporter’s textfile collector
`grafana/provisioning/`	Prometheus datasource + dashboard provider, auto-loaded at Grafana boot
`grafana/dashboards/server-global.json`	host + service health
`grafana/dashboards/per-app.json`	`$app`-scoped viewers/streams/recordings/VODs/callbacks/quota
`grafana/dashboards/media-latency.json`	LiveKit native metrics: RTT, forward latency, packet loss, NACK/PLI, rooms

Prometheus retention on the shared node defaults to --storage.tsdb.retention.time=15d (PROMETHEUS_RETENTION_TIME in .env) with a hard disk cap of --storage.tsdb.retention.size=4GB (PROMETHEUS_RETENTION_SIZE). 30-day retention is the target once observability moves to its own dedicated node, not while it’s sharing RAM with egress.

Not included yet (later phases)

Loki + Alloy for searchable logs in Grafana (30-day retention, chunks in S3) — needs another 350–600 MB of RAM, better suited to a dedicated observability node than to the shared 8 GB box.
Alertmanager — this directory ships the alert rules only (alerts.yml); routing to Slack/email/PagerDuty isn’t wired up. In the meantime check http://127.0.0.1:9090/alerts or Grafana’s own Alerting UI.
Enabling prometheus_port on the media stack’s docker-compose.yml (livekit/ingress/egress) is a change to that compose file, outside this directory’s scope.
Phase-2/3 core metrics (streamhub_bytes_ingest_total, streamhub_bytes_egress_total, streamhub_ingest_latency_seconds, streamhub_s3_bucket_bytes, streamhub_recording_duration_seconds) — the “ingest by protocol” panel in per-app.json is already wired up for when they exist, but is empty today.
A 30-day purge job for the server_logs table (see Logs, below) — a core code change, not a deploy-stack change.
Cluster-wide observability (central Prometheus, Alertmanager routing, aggregated dashboards) is a later phase.

Logs

Structured pino logs go to three places:

stdout — docker compose logs -f core / journalctl -u streamhub-core.
A rotating file under <DATA_DIR>/logs/ (LOG_MAX_BYTES, default 10 MB; rotates by size and by day, keeps LOG_MAX_FILES archives, default 10). This is not the same as 30-day retention.
The server_logs table (global DB): ts, level, source, app_id, message, meta_json, queryable via GET /api/v1/logs (filter by app/level/date range, paginated). There’s no automatic time-based pruning yet — only a manual per-app purge — so the table grows without bound until a retention job is added.

source values seen today: recording, livekit, livekit-webhook, transcoding, broadcast, hls, callbacks, system, logs, plugins.

Useful Grafana / PromQL queries

Active streams: sum(streamhub_active_streams)
Request rate: sum by (route,status) (rate(streamhub_http_requests_total[5m]))
p95 latency: histogram_quantile(0.95, sum by (le,route) (rate(streamhub_http_request_duration_seconds_bucket[5m])))
Upload success rate: sum(rate(streamhub_s3_uploads_total{result="ok"}[5m])) / sum(rate(streamhub_s3_uploads_total[5m]))
VOD upload backlog: sum(streamhub_upload_queue_depth)
Callback failures: sum by (event) (rate(streamhub_callbacks_total{result="failed"}[5m]))
Tenant usage vs quota: streamhub_tenant_usage / streamhub_tenant_quota

Recommended alerts (already in deploy/observability/alerts.yml): core/LiveKit target down; sustained streamhub_upload_queue_depth; rate(streamhub_recording_failures_total[15m]) > 0; elevated rate(streamhub_callbacks_total{result="failed"}[15m]); host RAM/swap/disk pressure; TLS cert expiring soon.