Cluster

This content is for the 1.0 version. Switch to the latest version for up-to-date documentation.

Topology (target)

                     Control plane — streamhub-core (master)
                     ┌───────────────────────────────────┐
       Publishers ──▶│ node registry (join by token + IP) │
                      │ router (room→node affinity        │
                      │         + viewer/transcode         │
                      │         balancing)                 │
                     └──────────────┬──────────────────────┘
                                    │
              ┌─────────────────────┼─────────────────────┐
              ▼                     ▼                      ▼
     Origin / master node    Edge / slave nodes        CDN
     livekit + core          livekit edge (fan-out,     LL-HLS,
     ─▶ shared redis          relay tracks)              mass fan-out
     ─▶ ingress/egress        ─▶ egress HLS/LL-HLS       ▲
        workers                  producer ───────────────┘
                                                          ▲
                                          Interactive     │  Mass viewers
                                          viewers ────────┘  (100k+)
                                          (moderate scale)

Master/edge is a modeling convenience, not a LiveKit concept. LiveKit itself is peer + shared redis, not strict master/slave. StreamHub models an origin (where publishers land) plus edge (fan-out / relay) plus a control-plane (streamhub-core master) that routes.

What LiveKit already gives you vs. what StreamHub builds

Concern	LiveKit (free)	StreamHub builds
WebRTC session affinity	Pins each room to a single node via shared redis — correct by construction, a WebRTC session lives on one server.	Nothing extra; the router just needs to know which node owns a room.
Cluster membership	Peer nodes + shared redis + same API key/secret + node reachable by IP.	A node registry (`nodes` table) plus a join flow: an edge presents a cluster token and publishes its reachable endpoint/IP.
Transcode scaling	Ingress/egress are workers coordinated via redis.	A pool of ingress/egress workers spread across nodes, scaled horizontally; metrics show where work landed.
Viewer scaling	SFU fan-out per subscription.	Routing interactive viewers to an edge; HLS/CDN for the mass tail (see Distribution).

Control plane — proven against real servers

On 2026-07-02 the cluster module was exercised end-to-end between two real production-class servers on the same OVH subnet (node01, 4c/8GB, docker-compose; stream01-prod, 8c/8GB, systemd, serving live cameras), without touching prod’s live media:

Step	Result
`POST /cluster/join` with `X-Cluster-Token` from stream01 → node01	`200`, returns `nodeId`, the cluster-wide LiveKit keys, and `publicWsUrl`
`POST /cluster/heartbeat` with real `stats` (cpu/ram/cores)	`{ok:true}`, stats persisted
Auto-registration	Registry shows 2 nodes
`GET /cluster/nodes` (global `sk_` bearer)	Both nodes `active`, `stale:false`, with stats
`PATCH /cluster/nodes/:id {status:'draining'}`	stream01 → `draining` (stops receiving new rooms), verified, restored to `active`
Staleness detection	`stale=true` automatically once `last_seen_at` is older than 90s (node considered down)

Conclusion: the cluster’s control plane — join by token, the node registry, heartbeats with metrics, dead-node detection, and drain/activate/remove administration — works against two real nodes. That is the full extent of what’s live today; the media plane (a room actually served cross-node) was intentionally not exercised against production (see below).

How balancing works

Room placement is LiveKit’s, not StreamHub’s. There is no StreamHub-built load balancer for WebRTC. Each room is served by exactly one LiveKit node, chosen by LiveKit itself via the shared redis — correct by construction, since a WebRTC session has to live on a single server. A client that signals against any node in the cluster is routed to whichever node owns that room; the media itself (7882/udp, RTMP 1935, WHIP 8080) goes directly to that node’s public IP (use_external_ip).
StreamHub’s placement lever is node status in the registry: active receives new rooms, draining stops receiving new rooms but keeps existing ones alive, disabled is excluded entirely. This is the mechanism the 2-node test demonstrated. A capacity/region-aware scheduler does not exist yet — today, fine-grained placement is 100% LiveKit’s internal logic, gated only by the coarse active/draining/disabled switch.
Viewer ceiling per node is the NIC. Roughly 300–400 WebRTC viewers at 2.5 Mbps saturate a 1 Gbps link. Scaling past that is not “add more WebRTC nodes” — see Distribution for the HLS/CDN/P2P path.

What a real media mesh still needs

Standing up cross-node media (a room served on node01 while stream01 also participates) requires three things, none of which are true in the current production deployment:

A shared, reachable redis between nodes. Production currently binds redis to 127.0.0.1 with no password — not reachable from another host.
Identical LiveKit keys across all nodes. The join flow already distributes these.
Every livekit-server pointed at the same redis. Production LiveKit currently points at address: localhost:6379.

Meeting these against a live production node means restarting its redis and LiveKit — a few seconds of downtime for every camera currently connected — which is why the 2-node test stopped at the control plane.

Order of work to make “add an edge = run the join” true without live multi-node risk:

Per-app SQLite — done. App state travels with the app; the global DB stays a small shared control-plane store. See Data model.
Node registry + join — done and proven above.
LiveKit multi-node config — shared redis reachable across nodes, shared API key/secret, correct external IP per node. Not done in production.
Router — room→node affinity (LiveKit already does this) plus real viewer/transcode balancing and the WebRTC-edge-vs-HLS/CDN routing decision. Not built.

Runbook shape (once ready)

Bringing up a real media mesh is meant to require provisioning plus a join, not a data or schema migration:

# Origin (a node that can tolerate a restart)
install.sh --cluster-redis-bind <origin-IP>     # binds redis + password, opens firewall to edge only

# Edge (fresh node)
install.sh --join --master-token <clt_...> --master-ip <origin-IP> --master-url https://<origin>
# brings up livekit/ingress/egress pointed at the origin's redis, using the origin's keys

Recommended validation path: two fresh nodes of the same shape, or a maintenance window on an existing node with explicit acceptance of the camera cutover and the cross-node recording gap above.

Observability across nodes

Each node exposes streamhub-core’s /metrics and LiveKit’s native Prometheus metrics; a central Prometheus scrapes all nodes, Grafana aggregates. Cluster-relevant signals: per-node CPU and bandwidth, egress queue depth, relay track counts, S3 upload failures, and per-tenant usage vs. quota.