ESP32-CAM direct WebSocket ingest
The ESP32-CAM (AI-Thinker, OV2640 sensor) has no H.264 encoder and no practical WebRTC
path, so it can’t push RTMP or WebRTC directly. StreamHub supports a direct WebSocket
ingest designed specifically for this class of device: the camera opens one wss://
connection and streams raw JPEG frames — no relay process, no transcoding, no per-camera
ffmpeg. Viewers see the feed over MJPEG, sub-second.
ESP32-CAM ──wss:// (1 JPEG frame per binary message)──► core frame hub ──► MJPEG HTTP / WS viewersThis replaces the older relay path (ESP32 → ffmpeg transcode → RTMP → HLS, 3–10s latency), which is still valid if you specifically need HLS or LiveKit-side recording today.
Why WebSocket + JPEG, not WebRTC or RTMP
Section titled “Why WebSocket + JPEG, not WebRTC or RTMP”- The OV2640 sensor compresses JPEG in hardware — the ESP32 itself does zero encoding, it just moves the buffer. MJPEG (a sequence of JPEGs) is essentially free on this CPU.
- The classic AI-Thinker ESP32 has no H.264 encoder. Everything that requires H.264 — standard RTMP/FLV, WebRTC video toward a browser, HLS — needs a transcode somewhere. WebRTC stacks exist for newer chips (ESP32-S3/P4 with hardware H.264), but not for the classic AI-Thinker boards most CCTV fleets use today.
- The firmware side of WS ingest is just
esp_camera_fb_get()→webSocket.sendBIN(fb->buf, fb->len)— one TLS socket, no framing, no encode. It adds only ~40–60 KB of RAM for TLS buffers on top of a normal camera sketch. wss://passes through the same domain and port 443 the deploy already proxies — no new ports to open, no NAT issues (the device only ever makes outbound connections).
Provisioning a camera key
Section titled “Provisioning a camera key”-
Mint a
wsk_stream key for the app/room the camera will publish to (sameingress:createpermission as RTMP ingress).Terminal window curl -s -X POST https://streamhub.example.com/api/v1/apps/live/ws-ingest \-H "Authorization: Bearer $STREAMHUB_TOKEN" -H 'Content-Type: application/json' \-d '{"room":"cam1"}'{"data": {"id": "wsi_ab12","streamKey": "wsk_…","room": "live-cam1","wsUrl": "wss://streamhub.example.com/ingest/ws","mjpegUrl": "https://streamhub.example.com/live/live/cam1/mjpeg","playerUrl": "https://streamhub.example.com/play/live/cam1"}} -
Flash the firmware with that key — see Firmware below.
-
Watch it connect — the camera appears in
GET /apps/:app/streamslike any other stream, and the dashboard’s Ingress tab shows a live camera card with aframe.jpgthumbnail. -
View it at the returned
mjpegUrl,frame.jpg, or the normal/play/<app>/<room>page — the player detects thews-mjpegstream type automatically and switches to MJPEG rendering instead of the LiveKit player.
Other provisioning endpoints:
GET /api/v1/apps/:app/ws-ingest # list keys for the appDELETE /api/v1/apps/:app/ws-ingest/:id # revoke (closes the active connection, if any)The wire protocol
Section titled “The wire protocol”Endpoint: wss://<domain>/ingest/ws
Auth, in the HTTP handshake before upgrade — two equivalent forms:
- Header (preferred — doesn’t end up in access logs):
Authorization: Bearer wsk_<key>+ query?app=<app>&room=<room>[&identity=<id>]. - Query fallback (for browsers/tests that can’t set WS headers):
?app=<app>&room=<room>&key=wsk_<key>[&identity=<id>].
If the same key already has an active connection, the new one wins — the old socket is
closed with code 4409, so a flaky camera that reconnects never gets stuck behind its own
zombie socket.
| Direction | WS frame type | Payload |
|---|---|---|
| server → device | text | {"type":"ready","room":"live-cam1","streamId":"live-cam1/wscam-abc1","maxFps":15,"maxFrameBytes":262144,"idleTimeoutSec":30} |
| device → server | binary | one message = one complete JPEG frame (raw JFIF bytes). No custom header — the server timestamps on receipt. |
| device → server | text (optional) | {"type":"stats","fps":12,"rssi":-61,"heapFree":41232} roughly every 30s — shown in the dashboard. |
| server → device | text | {"type":"error","code":"...","message":"..."} before an abnormal close. |
Keepalive: the server pings every 15s (2 missed pongs = dead connection). The firmware
uses arduinoWebSockets’ enableHeartbeat(15000, 3000, 2). With no frames for
idleTimeoutSec (default 30s), the server closes with 4408 and ends the stream.
Limits: maxFrameBytes (default 256 KB) — an oversized frame closes with 4413.
maxFps (default 15) is enforced with a server-side token bucket; frames over the limit are
silently dropped (and counted), never disconnected. The server keeps only the last frame
per camera — nothing is queued on the ingest side, so memory is bounded by design.
Close codes
Section titled “Close codes”| Code | Meaning |
|---|---|
| 1000 | normal close (device powers off) |
| 4401 | invalid key / unknown app / room mismatch |
| 4403 | app doesn’t have WS ingest enabled, or quota exceeded |
| 4408 | idle timeout (no frames received) |
| 4409 | replaced by a new connection using the same key |
| 4413 | frame exceeds maxFrameBytes |
| 4429 | handshake rate limit (per IP) |
Playback
Section titled “Playback”MJPEG multipart HTTP — the CCTV mode:
GET https://<domain>/live/<app>/<room>/mjpeg (+ ?token=<playToken> if publicPlayback is off)Content-Type: multipart/x-mixed-replace; boundary=frameWorks in a bare <img src="…/mjpeg">, in VLC, or any NVR-style viewer — no JS dependency.
The viewer gets the last known frame immediately on connect. There’s also
GET /live/<app>/<room>/frame.jpg for a single current-frame snapshot — useful for dashboard
thumbnails without spinning up ffmpeg.
WebSocket viewer feed (used by the web player):
wss://<domain>/live/ws?app=<app>&room=<room>[&token=<playToken>]Auth follows the same rule as /play today: public by room when features.publicPlayback is
on (the default); otherwise it requires the existing play-token
(GET /api/v1/apps/:app/play-token/:room) as ?token=.
Latency
Section titled “Latency”| Stage | Time |
|---|---|
| Capture + JPEG encode (in sensor) | 30–70 ms |
| WiFi + TCP/TLS device→server | 5–20 ms (LAN) / +RTT (WAN, typically 20–80 ms) |
| Frame hub → viewer | < 5 ms |
| Browser render | 16–33 ms |
| Total | ~100–250 ms LAN, ~150–400 ms WAN |
Compare that to the relay path (ESP32 → ffmpeg → RTMP → LiveKit → HLS): 3–10 seconds. This is the latency payoff of the direct-ingest path.
Firmware
Section titled “Firmware”A complete, commented Arduino sketch is available at
esp32cam_ws_ingest.ino
(AI-Thinker board, Arduino ESP32 core, using
Links2004/arduinoWebSockets). It:
- Initializes
esp32-camera(VGA,PIXFORMAT_JPEG, quality 12,fb_count 2,CAMERA_GRAB_LATESTso it always sends the freshest frame, never a backlog). - Connects with
webSocket.beginSSL(host, 443, "/ingest/ws?app=live&room=cam1")+setExtraHeaders("Authorization: Bearer wsk_…")+enableHeartbeat(15000, 3000, 2)+setReconnectInterval(3000)for lifelong auto-reconnect. - Waits for the
{"type":"ready",…}message before entering the capture loop: throttles to a target FPS,esp_camera_fb_get()→sendBIN(fb->buf, fb->len)→fb_return. - Sends a
{"type":"stats",…}message every 30s (heap/RSSI/fps), visible in the dashboard. - For production, replace the unverified TLS with
beginSslWithCA(...)and the domain’s CA root (commented in the sketch).
Recommended camera profiles
Section titled “Recommended camera profiles”| Profile | Resolution / fps | Bandwidth | Use case |
|---|---|---|---|
| CCTV (large fleets) | QVGA @ 5–8 fps | ~0.4–0.8 Mbps/camera | hundreds of cameras on one node |
| Balanced | VGA @ 10–15 fps, quality 12 | ~2–3 Mbps/camera | smaller fleets, better image quality |
At 615 cameras, QVGA @ 8 fps totals ~315 Mbps — comfortably under a 1 Gbps NIC. The same fleet at VGA @ 15 fps would need ~1.85 Gbps, which saturates a single 1 GbE link — shard across nodes or drop resolution/fps for large deployments.
Simulate a camera without hardware
Section titled “Simulate a camera without hardware”# 1) Mint a keycurl -s -X POST https://<domain>/api/v1/apps/live/ws-ingest \ -H "Authorization: Bearer $STREAMHUB_TOKEN" -H 'Content-Type: application/json' \ -d '{"room":"cam1"}'# → { data: { streamKey: "wsk_…", wsUrl: "wss://<domain>/ingest/ws?app=live&room=cam1", … } }
# 2) Send one JPEG per second as binary frames (websocat)while true; do cat frame.jpg; sleep 1; done | \ websocat --binary -H "Authorization: Bearer wsk_XXXX" \ "wss://<domain>/ingest/ws?app=live&room=cam1"
# 3) Watch it# GET https://<domain>/live/live/cam1/frame.jpg (last frame)# <img src="https://<domain>/live/live/cam1/mjpeg"> (MJPEG stream)# https://<domain>/play/live/live-cam1 (public player, MJPEG mode)