Skip to content

Cluster

This content is for the 1.0 version. Switch to the latest version for up-to-date documentation.

Control plane — streamhub-core (master)
┌───────────────────────────────────┐
Publishers ──▶│ node registry (join by token + IP) │
│ router (room→node affinity │
│ + viewer/transcode │
│ balancing) │
└──────────────┬──────────────────────┘
┌─────────────────────┼─────────────────────┐
▼ ▼ ▼
Origin / master node Edge / slave nodes CDN
livekit + core livekit edge (fan-out, LL-HLS,
─▶ shared redis relay tracks) mass fan-out
─▶ ingress/egress ─▶ egress HLS/LL-HLS ▲
workers producer ───────────────┘
Interactive │ Mass viewers
viewers ────────┘ (100k+)
(moderate scale)

Master/edge is a modeling convenience, not a LiveKit concept. LiveKit itself is peer + shared redis, not strict master/slave. StreamHub models an origin (where publishers land) plus edge (fan-out / relay) plus a control-plane (streamhub-core master) that routes.

What LiveKit already gives you vs. what StreamHub builds

Section titled “What LiveKit already gives you vs. what StreamHub builds”
Concern LiveKit (free) StreamHub builds
WebRTC session affinity Pins each room to a single node via shared redis — correct by construction, a WebRTC session lives on one server. Nothing extra; the router just needs to know which node owns a room.
Cluster membership Peer nodes + shared redis + same API key/secret + node reachable by IP. A node registry (nodes table) plus a join flow: an edge presents a cluster token and publishes its reachable endpoint/IP.
Transcode scaling Ingress/egress are workers coordinated via redis. A pool of ingress/egress workers spread across nodes, scaled horizontally; metrics show where work landed.
Viewer scaling SFU fan-out per subscription. Routing interactive viewers to an edge; HLS/CDN for the mass tail (see Distribution).

Control plane — proven against real servers

Section titled “Control plane — proven against real servers”

On 2026-07-02 the cluster module was exercised end-to-end between two real production-class servers on the same OVH subnet (node01, 4c/8GB, docker-compose; stream01-prod, 8c/8GB, systemd, serving live cameras), without touching prod’s live media:

Step Result
POST /cluster/join with X-Cluster-Token from stream01 → node01 200, returns nodeId, the cluster-wide LiveKit keys, and publicWsUrl
POST /cluster/heartbeat with real stats (cpu/ram/cores) {ok:true}, stats persisted
Auto-registration Registry shows 2 nodes
GET /cluster/nodes (global sk_ bearer) Both nodes active, stale:false, with stats
PATCH /cluster/nodes/:id {status:'draining'} stream01 → draining (stops receiving new rooms), verified, restored to active
Staleness detection stale=true automatically once last_seen_at is older than 90s (node considered down)

Conclusion: the cluster’s control plane — join by token, the node registry, heartbeats with metrics, dead-node detection, and drain/activate/remove administration — works against two real nodes. That is the full extent of what’s live today; the media plane (a room actually served cross-node) was intentionally not exercised against production (see below).

  1. Room placement is LiveKit’s, not StreamHub’s. There is no StreamHub-built load balancer for WebRTC. Each room is served by exactly one LiveKit node, chosen by LiveKit itself via the shared redis — correct by construction, since a WebRTC session has to live on a single server. A client that signals against any node in the cluster is routed to whichever node owns that room; the media itself (7882/udp, RTMP 1935, WHIP 8080) goes directly to that node’s public IP (use_external_ip).
  2. StreamHub’s placement lever is node status in the registry: active receives new rooms, draining stops receiving new rooms but keeps existing ones alive, disabled is excluded entirely. This is the mechanism the 2-node test demonstrated. A capacity/region-aware scheduler does not exist yet — today, fine-grained placement is 100% LiveKit’s internal logic, gated only by the coarse active/draining/disabled switch.
  3. Viewer ceiling per node is the NIC. Roughly 300–400 WebRTC viewers at 2.5 Mbps saturate a 1 Gbps link. Scaling past that is not “add more WebRTC nodes” — see Distribution for the HLS/CDN/P2P path.

Standing up cross-node media (a room served on node01 while stream01 also participates) requires three things, none of which are true in the current production deployment:

  1. A shared, reachable redis between nodes. Production currently binds redis to 127.0.0.1 with no password — not reachable from another host.
  2. Identical LiveKit keys across all nodes. The join flow already distributes these.
  3. Every livekit-server pointed at the same redis. Production LiveKit currently points at address: localhost:6379.

Meeting these against a live production node means restarting its redis and LiveKit — a few seconds of downtime for every camera currently connected — which is why the 2-node test stopped at the control plane.

Order of work to make “add an edge = run the join” true without live multi-node risk:

  1. Per-app SQLite — done. App state travels with the app; the global DB stays a small shared control-plane store. See Data model.
  2. Node registry + join — done and proven above.
  3. LiveKit multi-node config — shared redis reachable across nodes, shared API key/secret, correct external IP per node. Not done in production.
  4. Router — room→node affinity (LiveKit already does this) plus real viewer/transcode balancing and the WebRTC-edge-vs-HLS/CDN routing decision. Not built.

Bringing up a real media mesh is meant to require provisioning plus a join, not a data or schema migration:

# Origin (a node that can tolerate a restart)
install.sh --cluster-redis-bind <origin-IP> # binds redis + password, opens firewall to edge only
# Edge (fresh node)
install.sh --join --master-token <clt_...> --master-ip <origin-IP> --master-url https://<origin>
# brings up livekit/ingress/egress pointed at the origin's redis, using the origin's keys

Recommended validation path: two fresh nodes of the same shape, or a maintenance window on an existing node with explicit acceptance of the camera cutover and the cross-node recording gap above.

Each node exposes streamhub-core’s /metrics and LiveKit’s native Prometheus metrics; a central Prometheus scrapes all nodes, Grafana aggregates. Cluster-relevant signals: per-node CPU and bandwidth, egress queue depth, relay track counts, S3 upload failures, and per-tenant usage vs. quota.