Cluster
This content is for the 1.0 version. Switch to the latest version for up-to-date documentation.
Topology (target)
Section titled “Topology (target)” Control plane — streamhub-core (master) ┌───────────────────────────────────┐ Publishers ──▶│ node registry (join by token + IP) │ │ router (room→node affinity │ │ + viewer/transcode │ │ balancing) │ └──────────────┬──────────────────────┘ │ ┌─────────────────────┼─────────────────────┐ ▼ ▼ ▼ Origin / master node Edge / slave nodes CDN livekit + core livekit edge (fan-out, LL-HLS, ─▶ shared redis relay tracks) mass fan-out ─▶ ingress/egress ─▶ egress HLS/LL-HLS ▲ workers producer ───────────────┘ ▲ Interactive │ Mass viewers viewers ────────┘ (100k+) (moderate scale)Master/edge is a modeling convenience, not a LiveKit concept. LiveKit itself is peer + shared redis, not strict master/slave. StreamHub models an origin (where publishers land) plus edge (fan-out / relay) plus a control-plane (streamhub-core master) that routes.
What LiveKit already gives you vs. what StreamHub builds
Section titled “What LiveKit already gives you vs. what StreamHub builds”| Concern | LiveKit (free) | StreamHub builds |
|---|---|---|
| WebRTC session affinity | Pins each room to a single node via shared redis — correct by construction, a WebRTC session lives on one server. | Nothing extra; the router just needs to know which node owns a room. |
| Cluster membership | Peer nodes + shared redis + same API key/secret + node reachable by IP. | A node registry (nodes table) plus a join flow: an edge presents a cluster token and publishes its reachable endpoint/IP. |
| Transcode scaling | Ingress/egress are workers coordinated via redis. | A pool of ingress/egress workers spread across nodes, scaled horizontally; metrics show where work landed. |
| Viewer scaling | SFU fan-out per subscription. | Routing interactive viewers to an edge; HLS/CDN for the mass tail (see Distribution). |
Control plane — proven against real servers
Section titled “Control plane — proven against real servers”On 2026-07-02 the cluster module was exercised end-to-end between two real production-class
servers on the same OVH subnet (node01, 4c/8GB, docker-compose; stream01-prod, 8c/8GB,
systemd, serving live cameras), without touching prod’s live media:
| Step | Result |
|---|---|
POST /cluster/join with X-Cluster-Token from stream01 → node01 |
200, returns nodeId, the cluster-wide LiveKit keys, and publicWsUrl |
POST /cluster/heartbeat with real stats (cpu/ram/cores) |
{ok:true}, stats persisted |
| Auto-registration | Registry shows 2 nodes |
GET /cluster/nodes (global sk_ bearer) |
Both nodes active, stale:false, with stats |
PATCH /cluster/nodes/:id {status:'draining'} |
stream01 → draining (stops receiving new rooms), verified, restored to active |
| Staleness detection | stale=true automatically once last_seen_at is older than 90s (node considered down) |
Conclusion: the cluster’s control plane — join by token, the node registry, heartbeats with metrics, dead-node detection, and drain/activate/remove administration — works against two real nodes. That is the full extent of what’s live today; the media plane (a room actually served cross-node) was intentionally not exercised against production (see below).
How balancing works
Section titled “How balancing works”- Room placement is LiveKit’s, not StreamHub’s. There is no StreamHub-built load balancer
for WebRTC. Each room is served by exactly one LiveKit node, chosen by LiveKit itself via
the shared redis — correct by construction, since a WebRTC session has to live on a single
server. A client that signals against any node in the cluster is routed to whichever node owns
that room; the media itself (
7882/udp, RTMP1935, WHIP8080) goes directly to that node’s public IP (use_external_ip). - StreamHub’s placement lever is node
statusin the registry:activereceives new rooms,drainingstops receiving new rooms but keeps existing ones alive,disabledis excluded entirely. This is the mechanism the 2-node test demonstrated. A capacity/region-aware scheduler does not exist yet — today, fine-grained placement is 100% LiveKit’s internal logic, gated only by the coarseactive/draining/disabledswitch. - Viewer ceiling per node is the NIC. Roughly 300–400 WebRTC viewers at 2.5 Mbps saturate a 1 Gbps link. Scaling past that is not “add more WebRTC nodes” — see Distribution for the HLS/CDN/P2P path.
What a real media mesh still needs
Section titled “What a real media mesh still needs”Standing up cross-node media (a room served on node01 while stream01 also participates)
requires three things, none of which are true in the current production deployment:
- A shared, reachable redis between nodes. Production currently binds redis to
127.0.0.1with no password — not reachable from another host. - Identical LiveKit keys across all nodes. The join flow already distributes these.
- Every
livekit-serverpointed at the same redis. Production LiveKit currently points ataddress: localhost:6379.
Meeting these against a live production node means restarting its redis and LiveKit — a few seconds of downtime for every camera currently connected — which is why the 2-node test stopped at the control plane.
Order of work to make “add an edge = run the join” true without live multi-node risk:
- Per-app SQLite — done. App state travels with the app; the global DB stays a small shared control-plane store. See Data model.
- Node registry + join — done and proven above.
- LiveKit multi-node config — shared redis reachable across nodes, shared API key/secret, correct external IP per node. Not done in production.
- Router — room→node affinity (LiveKit already does this) plus real viewer/transcode balancing and the WebRTC-edge-vs-HLS/CDN routing decision. Not built.
Runbook shape (once ready)
Section titled “Runbook shape (once ready)”Bringing up a real media mesh is meant to require provisioning plus a join, not a data or schema migration:
# Origin (a node that can tolerate a restart)install.sh --cluster-redis-bind <origin-IP> # binds redis + password, opens firewall to edge only
# Edge (fresh node)install.sh --join --master-token <clt_...> --master-ip <origin-IP> --master-url https://<origin># brings up livekit/ingress/egress pointed at the origin's redis, using the origin's keysRecommended validation path: two fresh nodes of the same shape, or a maintenance window on an existing node with explicit acceptance of the camera cutover and the cross-node recording gap above.
Observability across nodes
Section titled “Observability across nodes”Each node exposes streamhub-core’s /metrics and LiveKit’s native Prometheus metrics; a central
Prometheus scrapes all nodes, Grafana aggregates. Cluster-relevant signals: per-node CPU and
bandwidth, egress queue depth, relay track counts, S3 upload failures, and per-tenant usage vs.
quota.