Capacity planning & latency tuning

Numbers on this page are measured against a real production node — 8 vCPU / 8 GB RAM, no GPU (OVH, plain-server deploy) — and marked [measured] vs [extrapolated]. Method: bounded measurements against production without saturating it (start one stream + one egress, measure, stop), then extrapolate the ceiling.

The bottleneck: egress RAM

On an 8 vCPU / 8 GB node the bottleneck is RAM consumed by room-composite egress (headless Chrome) — each recording costs ~1.27 GB [measured]. That means an 8 GB node supports roughly 4–5 concurrent room-composite recordings.

Component	RAM	CPU	Note
Host idle (no traffic)	936 MB used / 6.8 GB free	load ~0.04	8 GB total + 8 GB swap (0 used)
egress idle (warm Chrome pool)	819 MB	0.7%	container, no active recording
ingress idle	270 MB	0.05%
livekit-server (native)	~77 MB	~0.1%	systemd binary, not a container
core (NestJS)	~111 MB	~0.2%
1 room-composite recording (Chrome)	1.274 GiB [measured]	~0.9 core (87%)	13 Chrome processes, 177 PIDs; +455 MB over the idle container
1 publisher + N WebRTC subscribers	tens of MB in LiveKit	low	the SFU is lightweight; the viewer ceiling is the NIC (see below)
1 ws-mjpeg camera (ESP32)	a few MB [extrapolated]	~0	one buffered frame + a ~50 KB WS connection; the real cost is bandwidth

Recovery is fast: stopping a recording returns egress to ~820 MB within seconds.

Concurrent recording ceiling

Budget: 8 GB − ~1.5 GB (system + core + LiveKit + redis + idle ingress) ≈ 6.5 GB available for egress.

Egress mode	RAM per recording	Concurrency on 8 GB	When to use it
Room-composite (Chrome) — current default	~1.27 GB [measured]	~4–5 [extrapolated]	multi-participant composition / layout / overlays
Track-composite (ffmpeg)	~250–400 MB [extrapolated]	~15–20	one audio + one video track → MP4 (the common case)
Track egress (ffmpeg)	~150–250 MB [extrapolated]	~25+	a single raw track

Recommendation: prefer track-egress for single-publisher recordings

LiveKit offers several egress modes: Room Composite (headless Chrome, composes an HTML layout — the ~1.27 GB measured above) vs Track / Track-Composite (pure ffmpeg, no browser). For StreamHub’s dominant case — one camera/publisher → one MP4 — there’s nothing to compose: track-composite ffmpeg produces the same MP4 at roughly 1/5 the RAM and without the 13 Chrome processes.

Simple recording (1 publisher) → track-composite ffmpeg. 5× more recordings per node.
Composition (multiple participants, picture-in-picture, overlays, branding) → room-composite Chrome — the extra cost is justified there.
This is a change to recording.service (choose the egress type based on the number of publishers in the room, or a per-app flag). High impact, medium effort — it directly addresses the resource ceiling.

Viewer and camera ceilings

WebRTC (interactive): the SFU forwards packets, so the ceiling is the NIC (~1 Gbps → roughly 300–400 viewers per node at ~2.5 Mbps). For mass audiences, use HLS
- a CDN/P2P layer instead of scaling WebRTC directly.
HLS served by the node: cacheable — with a CDN pull zone, an 8 GB origin can feed events in the 10k–100k viewer range.
ws-mjpeg cameras (CCTV): RAM cost is negligible, so the limit is network bandwidth. QVGA at 8 fps (~0.5 Mbps/camera) → hundreds of cameras on a 1 Gbps node (615 QVGA cameras ≈ 315 Mbps, well within budget). VGA at 15 fps (~2–3 Mbps/camera) saturates the NIC much sooner — for large fleets, use QVGA and/or shard across nodes.

Offloading to a GPU node

Joining a GPU node to the cluster (install.sh --join, sharing the origin’s redis) lets you move the expensive work off the CPU-only origin:

Workload	On 8c/8GB (no GPU)	On a GPU node (e.g. RTX 3090)	Gain
Transcode ladder (RTMP ingest, adaptive VOD)	CPU x264, expensive	NVENC in parallel (dozens of encodes)	frees the origin’s CPU; the ladder becomes effectively free
Egress / recording	Chrome at 1.27 GB each	GPU acceleration + much more RAM headroom	dozens of concurrent recordings
YOLO (plugin)	CPU, slow	CUDA	real-time multi-camera inference

With transcode and egress offloaded to the GPU node, the origin is left to do what it’s efficient at: control plane, SFU, and ws-mjpeg ingest.

Sizing rules of thumb (per 8 vCPU / 8 GB node, no GPU)

Profile	Realistic capacity
CCTV (ws-mjpeg)	hundreds of QVGA cameras — network-bound, not RAM-bound; the cheap case
Recording (room-composite Chrome, current default)	~4–5 concurrent recordings
Recording (track-egress ffmpeg, recommended)	~15–20 concurrent recordings
Live interactive (WebRTC)	~300–400 viewers (or a few large rooms)
Mass event	1 origin + CDN/P2P → 10k–100k viewers (not served directly by the node)

Suggested quota defaults per app on a shared 8 GB node: max_concurrent_streams ~10, max_concurrent_recordings ~3 with Chrome egress (~10 with ffmpeg track-egress).

Optimization roadmap (priority order)

[High / medium effort] Track-egress ffmpeg for simple recordings — 5× the concurrency, directly addresses the resource ceiling.
[High / small effort] Raise EGRESS_MEM_LIMIT on recording-heavy nodes (or leave it uncapped and monitor).
[High / large effort] Offload transcode + egress to a GPU node (NVENC) via cluster join.
[Medium / medium effort] Egress autoscaling — don’t keep the Chrome pool warm (819 MB idle) when the node isn’t recording, or switch to ffmpeg, which has no such base cost.
[Medium] Observability — deploy the Prometheus/Grafana stack to watch the ceiling live and alert before saturation (RAM > 85%).

Latency tuning (WebRTC glass-to-glass)

Bottom line: StreamHub already meets a ≤0.5s WebRTC latency target with ~2.5× margin, measured empirically against production. The WebRTC pipeline itself adds only ~35ms on top of physical network RTT.

Method: a clock burned into pixels — a reproducible harness (bench/latency/) encodes Date.now() plus a checksum into visual blocks, decoded frame-by-frame on the subscriber. Publisher and subscriber run on the same machine (same clock), over a real network path.

Measurements (client RTT 158 ms, 30s runs each)

#	Path / configuration	p50	p95
1	WebRTC — defaults (simulcast on, VP8)	193 ms	204 ms
2	WebRTC — repeat run (stability check)	193 ms	203 ms
3	WebRTC — simulcast off	294 ms	332 ms
4	WebRTC — H.264	213 ms	223 ms
5	WebRTC — under load (HLS Chrome egress active + a `/play` viewer)	212 ms	232 ms
6	WebRTC — after L1 tuning (16 MB UDP buffers)	193 ms	241 ms*
7	HLS-live (room-composite segmented output)	15.2 s	15.2 s
8	RTMP ingress (OBS/ffmpeg → transcode)	~2 s	—

* the p95 bump is ICE re-warm noise right after a LiveKit restart, not a regression — the p50 didn’t move. UDP buffers protect the tail under load; they don’t change the idle-path latency.

Key takeaways:

The pipeline itself (capture + encode + SFU + adaptive jitter buffer + decode + render) costs ~35 ms (193 ms measured − 158 ms RTT, since the round trip is counted twice when publisher and subscriber share a machine). In a real topology: glass-to-glass ≈ 35ms + one-way(publisher→server) + one-way(server→viewer) — with clients in-region (~20ms one-way) that’s roughly 75ms glass-to-glass.
Simulcast ON is better for latency, not worse (+100ms with it off) — the subscriber starts on the fast layer. Don’t disable it to “optimize.”
VP8 (the default) beats H.264 by ~20ms in this browser pipeline.
An active Chrome egress costs only +19ms p50 with idle CPU headroom — compositing coexists fine with the SFU as long as there’s CPU available.
The #1 lever for latency below ~190ms is moving the server closer to the client (edge nodes / cluster via install.sh --join), not tuning LiveKit further — the default adaptive jitter buffer and congestion control (TWCC/NACK/PLI) already operate near-optimally on a good network.

Server-side tuning already applied / recommended

What’s already correct and shouldn’t be touched: a single UDP media mux (7882) with use_external_ip: true (media goes directly to the public IP, not through nginx — nginx only proxies the /rtc signaling path), default congestion control, and a shared redis for node affinity.

#	Lever	Status	Effect
L1	Kernel UDP buffers 4MB→16MB (`net.core.rmem_max`/`wmem_max=16777216`)	applied on `stream01` (`/etc/sysctl.d/99-streamhub-webrtc.conf`) and in `install.sh` for new nodes	avoids drops/NACKs under concurrent load; protects p95/p99
L2	Isolate egress (Chrome) — today it runs with no CPU/RAM limit on the same host	pending	an unbounded Chrome instance can consume most of the host’s RAM or steal cycles from the SFU; add `mem_limit`+`cpus`, or move egress to the GPU node
L3	Low playout delay — not a server config; set per track/room from the SDK	pending (a “low-latency” preset)	forces the jitter buffer to its minimum; helps on jittery networks, risks frozen frames
L4	Codecs: keep VP8 as default; don’t enable AV1 for low-latency use cases with weak client hardware	documented	AV1/VP9 raise encode/decode cost on weak hardware
L5	TURN disabled — clients with UDP blocked fall back to ICE-TCP (7881), which is much worse or fails to connect	pending decision	enabling TURN/UDP improves coverage for ~5–15% of restrictive networks without affecting the rest’s p50
L6	Double signaling hop `wss → nginx → 7880`	leave as-is	only affects the initial handshake, not media

HLS and RTMP — what’s realistically achievable

HLS today: ~15s (room-composite → .ts segments). To bring it down: shorter (2s) segments plus a short liveSyncDuration in the player (~6–8s); the real jump is LL-HLS/CMAF (~2–3s), which is a larger, separate effort.
RTMP: ~2s — inherent to RTMP ingest plus ingress transcode. For low-latency publishing, use WHIP (WebRTC ingest, port 8080) instead, which follows the ~35ms path, or publish directly from the browser.

Reproducing the measurements

See bench/latency/README.md in the repo. Rule of thumb: measure before and after every change — the harness takes about 40 seconds per run, so there’s no excuse not to.