Capacity planning & latency tuning
Numbers on this page are measured against a real production node — 8 vCPU / 8 GB RAM, no GPU (OVH, plain-server deploy) — and marked [measured] vs [extrapolated]. Method: bounded measurements against production without saturating it (start one stream + one egress, measure, stop), then extrapolate the ceiling.
The bottleneck: egress RAM
Section titled “The bottleneck: egress RAM”On an 8 vCPU / 8 GB node the bottleneck is RAM consumed by room-composite egress (headless Chrome) — each recording costs ~1.27 GB [measured]. That means an 8 GB node supports roughly 4–5 concurrent room-composite recordings.
| Component | RAM | CPU | Note |
|---|---|---|---|
| Host idle (no traffic) | 936 MB used / 6.8 GB free | load ~0.04 | 8 GB total + 8 GB swap (0 used) |
| egress idle (warm Chrome pool) | 819 MB | 0.7% | container, no active recording |
| ingress idle | 270 MB | 0.05% | |
| livekit-server (native) | ~77 MB | ~0.1% | systemd binary, not a container |
| core (NestJS) | ~111 MB | ~0.2% | |
| 1 room-composite recording (Chrome) | 1.274 GiB [measured] | ~0.9 core (87%) | 13 Chrome processes, 177 PIDs; +455 MB over the idle container |
| 1 publisher + N WebRTC subscribers | tens of MB in LiveKit | low | the SFU is lightweight; the viewer ceiling is the NIC (see below) |
| 1 ws-mjpeg camera (ESP32) | a few MB [extrapolated] | ~0 | one buffered frame + a ~50 KB WS connection; the real cost is bandwidth |
Recovery is fast: stopping a recording returns egress to ~820 MB within seconds.
Concurrent recording ceiling
Section titled “Concurrent recording ceiling”Budget: 8 GB − ~1.5 GB (system + core + LiveKit + redis + idle ingress) ≈ 6.5 GB available for egress.
| Egress mode | RAM per recording | Concurrency on 8 GB | When to use it |
|---|---|---|---|
| Room-composite (Chrome) — current default | ~1.27 GB [measured] | ~4–5 [extrapolated] | multi-participant composition / layout / overlays |
| Track-composite (ffmpeg) | ~250–400 MB [extrapolated] | ~15–20 | one audio + one video track → MP4 (the common case) |
| Track egress (ffmpeg) | ~150–250 MB [extrapolated] | ~25+ | a single raw track |
Recommendation: prefer track-egress for single-publisher recordings
Section titled “Recommendation: prefer track-egress for single-publisher recordings”LiveKit offers several egress modes: Room Composite (headless Chrome, composes an HTML layout — the ~1.27 GB measured above) vs Track / Track-Composite (pure ffmpeg, no browser). For StreamHub’s dominant case — one camera/publisher → one MP4 — there’s nothing to compose: track-composite ffmpeg produces the same MP4 at roughly 1/5 the RAM and without the 13 Chrome processes.
- Simple recording (1 publisher) → track-composite ffmpeg. 5× more recordings per node.
- Composition (multiple participants, picture-in-picture, overlays, branding) → room-composite Chrome — the extra cost is justified there.
- This is a change to
recording.service(choose the egress type based on the number of publishers in the room, or a per-app flag). High impact, medium effort — it directly addresses the resource ceiling.
Viewer and camera ceilings
Section titled “Viewer and camera ceilings”- WebRTC (interactive): the SFU forwards packets, so the ceiling is the NIC
(~1 Gbps → roughly 300–400 viewers per node at ~2.5 Mbps). For mass audiences, use HLS
- a CDN/P2P layer instead of scaling WebRTC directly.
- HLS served by the node: cacheable — with a CDN pull zone, an 8 GB origin can feed events in the 10k–100k viewer range.
- ws-mjpeg cameras (CCTV): RAM cost is negligible, so the limit is network bandwidth. QVGA at 8 fps (~0.5 Mbps/camera) → hundreds of cameras on a 1 Gbps node (615 QVGA cameras ≈ 315 Mbps, well within budget). VGA at 15 fps (~2–3 Mbps/camera) saturates the NIC much sooner — for large fleets, use QVGA and/or shard across nodes.
Offloading to a GPU node
Section titled “Offloading to a GPU node”Joining a GPU node to the cluster (install.sh --join, sharing the origin’s redis) lets you
move the expensive work off the CPU-only origin:
| Workload | On 8c/8GB (no GPU) | On a GPU node (e.g. RTX 3090) | Gain |
|---|---|---|---|
| Transcode ladder (RTMP ingest, adaptive VOD) | CPU x264, expensive | NVENC in parallel (dozens of encodes) | frees the origin’s CPU; the ladder becomes effectively free |
| Egress / recording | Chrome at 1.27 GB each | GPU acceleration + much more RAM headroom | dozens of concurrent recordings |
| YOLO (plugin) | CPU, slow | CUDA | real-time multi-camera inference |
With transcode and egress offloaded to the GPU node, the origin is left to do what it’s efficient at: control plane, SFU, and ws-mjpeg ingest.
Sizing rules of thumb (per 8 vCPU / 8 GB node, no GPU)
Section titled “Sizing rules of thumb (per 8 vCPU / 8 GB node, no GPU)”| Profile | Realistic capacity |
|---|---|
| CCTV (ws-mjpeg) | hundreds of QVGA cameras — network-bound, not RAM-bound; the cheap case |
| Recording (room-composite Chrome, current default) | ~4–5 concurrent recordings |
| Recording (track-egress ffmpeg, recommended) | ~15–20 concurrent recordings |
| Live interactive (WebRTC) | ~300–400 viewers (or a few large rooms) |
| Mass event | 1 origin + CDN/P2P → 10k–100k viewers (not served directly by the node) |
Suggested quota defaults per app on a shared 8 GB node: max_concurrent_streams ~10,
max_concurrent_recordings ~3 with Chrome egress (~10 with ffmpeg track-egress).
Optimization roadmap (priority order)
Section titled “Optimization roadmap (priority order)”- [High / medium effort] Track-egress ffmpeg for simple recordings — 5× the concurrency, directly addresses the resource ceiling.
- [High / small effort] Raise
EGRESS_MEM_LIMITon recording-heavy nodes (or leave it uncapped and monitor). - [High / large effort] Offload transcode + egress to a GPU node (NVENC) via cluster join.
- [Medium / medium effort] Egress autoscaling — don’t keep the Chrome pool warm (819 MB idle) when the node isn’t recording, or switch to ffmpeg, which has no such base cost.
- [Medium] Observability — deploy the Prometheus/Grafana stack to watch the ceiling live and alert before saturation (RAM > 85%).
Latency tuning (WebRTC glass-to-glass)
Section titled “Latency tuning (WebRTC glass-to-glass)”Bottom line: StreamHub already meets a ≤0.5s WebRTC latency target with ~2.5× margin, measured empirically against production. The WebRTC pipeline itself adds only ~35ms on top of physical network RTT.
Method: a clock burned into pixels — a reproducible harness (bench/latency/) encodes
Date.now() plus a checksum into visual blocks, decoded frame-by-frame on the subscriber.
Publisher and subscriber run on the same machine (same clock), over a real network path.
Measurements (client RTT 158 ms, 30s runs each)
Section titled “Measurements (client RTT 158 ms, 30s runs each)”| # | Path / configuration | p50 | p95 |
|---|---|---|---|
| 1 | WebRTC — defaults (simulcast on, VP8) | 193 ms | 204 ms |
| 2 | WebRTC — repeat run (stability check) | 193 ms | 203 ms |
| 3 | WebRTC — simulcast off | 294 ms | 332 ms |
| 4 | WebRTC — H.264 | 213 ms | 223 ms |
| 5 | WebRTC — under load (HLS Chrome egress active + a /play viewer) |
212 ms | 232 ms |
| 6 | WebRTC — after L1 tuning (16 MB UDP buffers) | 193 ms | 241 ms* |
| 7 | HLS-live (room-composite segmented output) | 15.2 s | 15.2 s |
| 8 | RTMP ingress (OBS/ffmpeg → transcode) | ~2 s | — |
* the p95 bump is ICE re-warm noise right after a LiveKit restart, not a regression — the p50 didn’t move. UDP buffers protect the tail under load; they don’t change the idle-path latency.
Key takeaways:
- The pipeline itself (capture + encode + SFU + adaptive jitter buffer + decode + render)
costs ~35 ms (193 ms measured − 158 ms RTT, since the round trip is counted twice when
publisher and subscriber share a machine). In a real topology:
glass-to-glass ≈ 35ms + one-way(publisher→server) + one-way(server→viewer)— with clients in-region (~20ms one-way) that’s roughly 75ms glass-to-glass. - Simulcast ON is better for latency, not worse (+100ms with it off) — the subscriber starts on the fast layer. Don’t disable it to “optimize.”
- VP8 (the default) beats H.264 by ~20ms in this browser pipeline.
- An active Chrome egress costs only +19ms p50 with idle CPU headroom — compositing coexists fine with the SFU as long as there’s CPU available.
- The #1 lever for latency below ~190ms is moving the server closer to the client
(edge nodes / cluster via
install.sh --join), not tuning LiveKit further — the default adaptive jitter buffer and congestion control (TWCC/NACK/PLI) already operate near-optimally on a good network.
Server-side tuning already applied / recommended
Section titled “Server-side tuning already applied / recommended”What’s already correct and shouldn’t be touched: a single UDP media mux (7882) with
use_external_ip: true (media goes directly to the public IP, not through nginx — nginx
only proxies the /rtc signaling path), default congestion control, and a shared redis for
node affinity.
| # | Lever | Status | Effect |
|---|---|---|---|
| L1 | Kernel UDP buffers 4MB→16MB (net.core.rmem_max/wmem_max=16777216) |
applied on stream01 (/etc/sysctl.d/99-streamhub-webrtc.conf) and in install.sh for new nodes |
avoids drops/NACKs under concurrent load; protects p95/p99 |
| L2 | Isolate egress (Chrome) — today it runs with no CPU/RAM limit on the same host | pending | an unbounded Chrome instance can consume most of the host’s RAM or steal cycles from the SFU; add mem_limit+cpus, or move egress to the GPU node |
| L3 | Low playout delay — not a server config; set per track/room from the SDK | pending (a “low-latency” preset) | forces the jitter buffer to its minimum; helps on jittery networks, risks frozen frames |
| L4 | Codecs: keep VP8 as default; don’t enable AV1 for low-latency use cases with weak client hardware | documented | AV1/VP9 raise encode/decode cost on weak hardware |
| L5 | TURN disabled — clients with UDP blocked fall back to ICE-TCP (7881), which is much worse or fails to connect | pending decision | enabling TURN/UDP improves coverage for ~5–15% of restrictive networks without affecting the rest’s p50 |
| L6 | Double signaling hop wss → nginx → 7880 |
leave as-is | only affects the initial handshake, not media |
HLS and RTMP — what’s realistically achievable
Section titled “HLS and RTMP — what’s realistically achievable”- HLS today: ~15s (room-composite →
.tssegments). To bring it down: shorter (2s) segments plus a shortliveSyncDurationin the player (~6–8s); the real jump is LL-HLS/CMAF (~2–3s), which is a larger, separate effort. - RTMP: ~2s — inherent to RTMP ingest plus ingress transcode. For low-latency publishing, use WHIP (WebRTC ingest, port 8080) instead, which follows the ~35ms path, or publish directly from the browser.
Reproducing the measurements
Section titled “Reproducing the measurements”See bench/latency/README.md in the repo. Rule of thumb: measure before and after every
change — the harness takes about 40 seconds per run, so there’s no excuse not to.