Skip to content

Capacity planning & latency tuning

Numbers on this page are measured against a real production node — 8 vCPU / 8 GB RAM, no GPU (OVH, plain-server deploy) — and marked [measured] vs [extrapolated]. Method: bounded measurements against production without saturating it (start one stream + one egress, measure, stop), then extrapolate the ceiling.

On an 8 vCPU / 8 GB node the bottleneck is RAM consumed by room-composite egress (headless Chrome) — each recording costs ~1.27 GB [measured]. That means an 8 GB node supports roughly 4–5 concurrent room-composite recordings.

Component RAM CPU Note
Host idle (no traffic) 936 MB used / 6.8 GB free load ~0.04 8 GB total + 8 GB swap (0 used)
egress idle (warm Chrome pool) 819 MB 0.7% container, no active recording
ingress idle 270 MB 0.05%
livekit-server (native) ~77 MB ~0.1% systemd binary, not a container
core (NestJS) ~111 MB ~0.2%
1 room-composite recording (Chrome) 1.274 GiB [measured] ~0.9 core (87%) 13 Chrome processes, 177 PIDs; +455 MB over the idle container
1 publisher + N WebRTC subscribers tens of MB in LiveKit low the SFU is lightweight; the viewer ceiling is the NIC (see below)
1 ws-mjpeg camera (ESP32) a few MB [extrapolated] ~0 one buffered frame + a ~50 KB WS connection; the real cost is bandwidth

Recovery is fast: stopping a recording returns egress to ~820 MB within seconds.

Budget: 8 GB − ~1.5 GB (system + core + LiveKit + redis + idle ingress) ≈ 6.5 GB available for egress.

Egress mode RAM per recording Concurrency on 8 GB When to use it
Room-composite (Chrome) — current default ~1.27 GB [measured] ~4–5 [extrapolated] multi-participant composition / layout / overlays
Track-composite (ffmpeg) ~250–400 MB [extrapolated] ~15–20 one audio + one video track → MP4 (the common case)
Track egress (ffmpeg) ~150–250 MB [extrapolated] ~25+ a single raw track

Recommendation: prefer track-egress for single-publisher recordings

Section titled “Recommendation: prefer track-egress for single-publisher recordings”

LiveKit offers several egress modes: Room Composite (headless Chrome, composes an HTML layout — the ~1.27 GB measured above) vs Track / Track-Composite (pure ffmpeg, no browser). For StreamHub’s dominant case — one camera/publisher → one MP4 — there’s nothing to compose: track-composite ffmpeg produces the same MP4 at roughly 1/5 the RAM and without the 13 Chrome processes.

  • Simple recording (1 publisher) → track-composite ffmpeg. 5× more recordings per node.
  • Composition (multiple participants, picture-in-picture, overlays, branding) → room-composite Chrome — the extra cost is justified there.
  • This is a change to recording.service (choose the egress type based on the number of publishers in the room, or a per-app flag). High impact, medium effort — it directly addresses the resource ceiling.
  • WebRTC (interactive): the SFU forwards packets, so the ceiling is the NIC (~1 Gbps → roughly 300–400 viewers per node at ~2.5 Mbps). For mass audiences, use HLS
    • a CDN/P2P layer instead of scaling WebRTC directly.
  • HLS served by the node: cacheable — with a CDN pull zone, an 8 GB origin can feed events in the 10k–100k viewer range.
  • ws-mjpeg cameras (CCTV): RAM cost is negligible, so the limit is network bandwidth. QVGA at 8 fps (~0.5 Mbps/camera) → hundreds of cameras on a 1 Gbps node (615 QVGA cameras ≈ 315 Mbps, well within budget). VGA at 15 fps (~2–3 Mbps/camera) saturates the NIC much sooner — for large fleets, use QVGA and/or shard across nodes.

Joining a GPU node to the cluster (install.sh --join, sharing the origin’s redis) lets you move the expensive work off the CPU-only origin:

Workload On 8c/8GB (no GPU) On a GPU node (e.g. RTX 3090) Gain
Transcode ladder (RTMP ingest, adaptive VOD) CPU x264, expensive NVENC in parallel (dozens of encodes) frees the origin’s CPU; the ladder becomes effectively free
Egress / recording Chrome at 1.27 GB each GPU acceleration + much more RAM headroom dozens of concurrent recordings
YOLO (plugin) CPU, slow CUDA real-time multi-camera inference

With transcode and egress offloaded to the GPU node, the origin is left to do what it’s efficient at: control plane, SFU, and ws-mjpeg ingest.

Sizing rules of thumb (per 8 vCPU / 8 GB node, no GPU)

Section titled “Sizing rules of thumb (per 8 vCPU / 8 GB node, no GPU)”
Profile Realistic capacity
CCTV (ws-mjpeg) hundreds of QVGA cameras — network-bound, not RAM-bound; the cheap case
Recording (room-composite Chrome, current default) ~4–5 concurrent recordings
Recording (track-egress ffmpeg, recommended) ~15–20 concurrent recordings
Live interactive (WebRTC) ~300–400 viewers (or a few large rooms)
Mass event 1 origin + CDN/P2P → 10k–100k viewers (not served directly by the node)

Suggested quota defaults per app on a shared 8 GB node: max_concurrent_streams ~10, max_concurrent_recordings ~3 with Chrome egress (~10 with ffmpeg track-egress).

  1. [High / medium effort] Track-egress ffmpeg for simple recordings — 5× the concurrency, directly addresses the resource ceiling.
  2. [High / small effort] Raise EGRESS_MEM_LIMIT on recording-heavy nodes (or leave it uncapped and monitor).
  3. [High / large effort] Offload transcode + egress to a GPU node (NVENC) via cluster join.
  4. [Medium / medium effort] Egress autoscaling — don’t keep the Chrome pool warm (819 MB idle) when the node isn’t recording, or switch to ffmpeg, which has no such base cost.
  5. [Medium] Observability — deploy the Prometheus/Grafana stack to watch the ceiling live and alert before saturation (RAM > 85%).

Bottom line: StreamHub already meets a ≤0.5s WebRTC latency target with ~2.5× margin, measured empirically against production. The WebRTC pipeline itself adds only ~35ms on top of physical network RTT.

Method: a clock burned into pixels — a reproducible harness (bench/latency/) encodes Date.now() plus a checksum into visual blocks, decoded frame-by-frame on the subscriber. Publisher and subscriber run on the same machine (same clock), over a real network path.

Measurements (client RTT 158 ms, 30s runs each)

Section titled “Measurements (client RTT 158 ms, 30s runs each)”
# Path / configuration p50 p95
1 WebRTC — defaults (simulcast on, VP8) 193 ms 204 ms
2 WebRTC — repeat run (stability check) 193 ms 203 ms
3 WebRTC — simulcast off 294 ms 332 ms
4 WebRTC — H.264 213 ms 223 ms
5 WebRTC — under load (HLS Chrome egress active + a /play viewer) 212 ms 232 ms
6 WebRTC — after L1 tuning (16 MB UDP buffers) 193 ms 241 ms*
7 HLS-live (room-composite segmented output) 15.2 s 15.2 s
8 RTMP ingress (OBS/ffmpeg → transcode) ~2 s

* the p95 bump is ICE re-warm noise right after a LiveKit restart, not a regression — the p50 didn’t move. UDP buffers protect the tail under load; they don’t change the idle-path latency.

Key takeaways:

  • The pipeline itself (capture + encode + SFU + adaptive jitter buffer + decode + render) costs ~35 ms (193 ms measured − 158 ms RTT, since the round trip is counted twice when publisher and subscriber share a machine). In a real topology: glass-to-glass ≈ 35ms + one-way(publisher→server) + one-way(server→viewer) — with clients in-region (~20ms one-way) that’s roughly 75ms glass-to-glass.
  • Simulcast ON is better for latency, not worse (+100ms with it off) — the subscriber starts on the fast layer. Don’t disable it to “optimize.”
  • VP8 (the default) beats H.264 by ~20ms in this browser pipeline.
  • An active Chrome egress costs only +19ms p50 with idle CPU headroom — compositing coexists fine with the SFU as long as there’s CPU available.
  • The #1 lever for latency below ~190ms is moving the server closer to the client (edge nodes / cluster via install.sh --join), not tuning LiveKit further — the default adaptive jitter buffer and congestion control (TWCC/NACK/PLI) already operate near-optimally on a good network.
Section titled “Server-side tuning already applied / recommended”

What’s already correct and shouldn’t be touched: a single UDP media mux (7882) with use_external_ip: true (media goes directly to the public IP, not through nginx — nginx only proxies the /rtc signaling path), default congestion control, and a shared redis for node affinity.

# Lever Status Effect
L1 Kernel UDP buffers 4MB→16MB (net.core.rmem_max/wmem_max=16777216) applied on stream01 (/etc/sysctl.d/99-streamhub-webrtc.conf) and in install.sh for new nodes avoids drops/NACKs under concurrent load; protects p95/p99
L2 Isolate egress (Chrome) — today it runs with no CPU/RAM limit on the same host pending an unbounded Chrome instance can consume most of the host’s RAM or steal cycles from the SFU; add mem_limit+cpus, or move egress to the GPU node
L3 Low playout delay — not a server config; set per track/room from the SDK pending (a “low-latency” preset) forces the jitter buffer to its minimum; helps on jittery networks, risks frozen frames
L4 Codecs: keep VP8 as default; don’t enable AV1 for low-latency use cases with weak client hardware documented AV1/VP9 raise encode/decode cost on weak hardware
L5 TURN disabled — clients with UDP blocked fall back to ICE-TCP (7881), which is much worse or fails to connect pending decision enabling TURN/UDP improves coverage for ~5–15% of restrictive networks without affecting the rest’s p50
L6 Double signaling hop wss → nginx → 7880 leave as-is only affects the initial handshake, not media

HLS and RTMP — what’s realistically achievable

Section titled “HLS and RTMP — what’s realistically achievable”
  • HLS today: ~15s (room-composite → .ts segments). To bring it down: shorter (2s) segments plus a short liveSyncDuration in the player (~6–8s); the real jump is LL-HLS/CMAF (~2–3s), which is a larger, separate effort.
  • RTMP: ~2s — inherent to RTMP ingest plus ingress transcode. For low-latency publishing, use WHIP (WebRTC ingest, port 8080) instead, which follows the ~35ms path, or publish directly from the browser.

See bench/latency/README.md in the repo. Rule of thumb: measure before and after every change — the harness takes about 40 seconds per run, so there’s no excuse not to.