Skip to content

ESP32-CAM direct WebSocket ingest

This content is for the 1.0 version. Switch to the latest version for up-to-date documentation.

The ESP32-CAM (AI-Thinker, OV2640 sensor) has no H.264 encoder and no practical WebRTC path, so it can’t push RTMP or WebRTC directly. StreamHub supports a direct WebSocket ingest designed specifically for this class of device: the camera opens one wss:// connection and streams raw JPEG frames — no relay process, no transcoding, no per-camera ffmpeg. Viewers see the feed over MJPEG, sub-second.

ESP32-CAM ──wss:// (1 JPEG frame per binary message)──► core frame hub ──► MJPEG HTTP / WS viewers

This replaces the older relay path (ESP32 → ffmpeg transcode → RTMP → HLS, 3–10s latency), which is still valid if you specifically need HLS or LiveKit-side recording today.

  • The OV2640 sensor compresses JPEG in hardware — the ESP32 itself does zero encoding, it just moves the buffer. MJPEG (a sequence of JPEGs) is essentially free on this CPU.
  • The classic AI-Thinker ESP32 has no H.264 encoder. Everything that requires H.264 — standard RTMP/FLV, WebRTC video toward a browser, HLS — needs a transcode somewhere. WebRTC stacks exist for newer chips (ESP32-S3/P4 with hardware H.264), but not for the classic AI-Thinker boards most CCTV fleets use today.
  • The firmware side of WS ingest is just esp_camera_fb_get()webSocket.sendBIN(fb->buf, fb->len) — one TLS socket, no framing, no encode. It adds only ~40–60 KB of RAM for TLS buffers on top of a normal camera sketch.
  • wss:// passes through the same domain and port 443 the deploy already proxies — no new ports to open, no NAT issues (the device only ever makes outbound connections).
  1. Mint a wsk_ stream key for the app/room the camera will publish to (same ingress:create permission as RTMP ingress).

    Terminal window
    curl -s -X POST https://streamhub.example.com/api/v1/apps/live/ws-ingest \
    -H "Authorization: Bearer $STREAMHUB_TOKEN" -H 'Content-Type: application/json' \
    -d '{"room":"cam1"}'
    {
    "data": {
    "id": "wsi_ab12",
    "streamKey": "wsk_…",
    "room": "live-cam1",
    "wsUrl": "wss://streamhub.example.com/ingest/ws",
    "mjpegUrl": "https://streamhub.example.com/live/live/cam1/mjpeg",
    "playerUrl": "https://streamhub.example.com/play/live/cam1"
    }
    }
  2. Flash the firmware with that key — see Firmware below.

  3. Watch it connect — the camera appears in GET /apps/:app/streams like any other stream, and the dashboard’s Ingress tab shows a live camera card with a frame.jpg thumbnail.

  4. View it at the returned mjpegUrl, frame.jpg, or the normal /play/<app>/<room> page — the player detects the ws-mjpeg stream type automatically and switches to MJPEG rendering instead of the LiveKit player.

Other provisioning endpoints:

GET /api/v1/apps/:app/ws-ingest # list keys for the app
DELETE /api/v1/apps/:app/ws-ingest/:id # revoke (closes the active connection, if any)

Endpoint: wss://<domain>/ingest/ws

Auth, in the HTTP handshake before upgrade — two equivalent forms:

  • Header (preferred — doesn’t end up in access logs): Authorization: Bearer wsk_<key> + query ?app=<app>&room=<room>[&identity=<id>].
  • Query fallback (for browsers/tests that can’t set WS headers): ?app=<app>&room=<room>&key=wsk_<key>[&identity=<id>].

If the same key already has an active connection, the new one wins — the old socket is closed with code 4409, so a flaky camera that reconnects never gets stuck behind its own zombie socket.

Direction WS frame type Payload
server → device text {"type":"ready","room":"live-cam1","streamId":"live-cam1/wscam-abc1","maxFps":15,"maxFrameBytes":262144,"idleTimeoutSec":30}
device → server binary one message = one complete JPEG frame (raw JFIF bytes). No custom header — the server timestamps on receipt.
device → server text (optional) {"type":"stats","fps":12,"rssi":-61,"heapFree":41232} roughly every 30s — shown in the dashboard.
server → device text {"type":"error","code":"...","message":"..."} before an abnormal close.

Keepalive: the server pings every 15s (2 missed pongs = dead connection). The firmware uses arduinoWebSockets’ enableHeartbeat(15000, 3000, 2). With no frames for idleTimeoutSec (default 30s), the server closes with 4408 and ends the stream.

Limits: maxFrameBytes (default 256 KB) — an oversized frame closes with 4413. maxFps (default 15) is enforced with a server-side token bucket; frames over the limit are silently dropped (and counted), never disconnected. The server keeps only the last frame per camera — nothing is queued on the ingest side, so memory is bounded by design.

Code Meaning
1000 normal close (device powers off)
4401 invalid key / unknown app / room mismatch
4403 app doesn’t have WS ingest enabled, or quota exceeded
4408 idle timeout (no frames received)
4409 replaced by a new connection using the same key
4413 frame exceeds maxFrameBytes
4429 handshake rate limit (per IP)

MJPEG multipart HTTP — the CCTV mode:

GET https://<domain>/live/<app>/<room>/mjpeg (+ ?token=<playToken> if publicPlayback is off)
Content-Type: multipart/x-mixed-replace; boundary=frame

Works in a bare <img src="…/mjpeg">, in VLC, or any NVR-style viewer — no JS dependency. The viewer gets the last known frame immediately on connect. There’s also GET /live/<app>/<room>/frame.jpg for a single current-frame snapshot — useful for dashboard thumbnails without spinning up ffmpeg.

WebSocket viewer feed (used by the web player):

wss://<domain>/live/ws?app=<app>&room=<room>[&token=<playToken>]

Auth follows the same rule as /play today: public by room when features.publicPlayback is on (the default); otherwise it requires the existing play-token (GET /api/v1/apps/:app/play-token/:room) as ?token=.

Stage Time
Capture + JPEG encode (in sensor) 30–70 ms
WiFi + TCP/TLS device→server 5–20 ms (LAN) / +RTT (WAN, typically 20–80 ms)
Frame hub → viewer < 5 ms
Browser render 16–33 ms
Total ~100–250 ms LAN, ~150–400 ms WAN

Compare that to the relay path (ESP32 → ffmpeg → RTMP → LiveKit → HLS): 3–10 seconds. This is the latency payoff of the direct-ingest path.

A complete, commented Arduino sketch is available at esp32cam_ws_ingest.ino (AI-Thinker board, Arduino ESP32 core, using Links2004/arduinoWebSockets). It:

  1. Initializes esp32-camera (VGA, PIXFORMAT_JPEG, quality 12, fb_count 2, CAMERA_GRAB_LATEST so it always sends the freshest frame, never a backlog).
  2. Connects with webSocket.beginSSL(host, 443, "/ingest/ws?app=live&room=cam1") + setExtraHeaders("Authorization: Bearer wsk_…") + enableHeartbeat(15000, 3000, 2) + setReconnectInterval(3000) for lifelong auto-reconnect.
  3. Waits for the {"type":"ready",…} message before entering the capture loop: throttles to a target FPS, esp_camera_fb_get()sendBIN(fb->buf, fb->len)fb_return.
  4. Sends a {"type":"stats",…} message every 30s (heap/RSSI/fps), visible in the dashboard.
  5. For production, replace the unverified TLS with beginSslWithCA(...) and the domain’s CA root (commented in the sketch).
Profile Resolution / fps Bandwidth Use case
CCTV (large fleets) QVGA @ 5–8 fps ~0.4–0.8 Mbps/camera hundreds of cameras on one node
Balanced VGA @ 10–15 fps, quality 12 ~2–3 Mbps/camera smaller fleets, better image quality

At 615 cameras, QVGA @ 8 fps totals ~315 Mbps — comfortably under a 1 Gbps NIC. The same fleet at VGA @ 15 fps would need ~1.85 Gbps, which saturates a single 1 GbE link — shard across nodes or drop resolution/fps for large deployments.

Terminal window
# 1) Mint a key
curl -s -X POST https://<domain>/api/v1/apps/live/ws-ingest \
-H "Authorization: Bearer $STREAMHUB_TOKEN" -H 'Content-Type: application/json' \
-d '{"room":"cam1"}'
# → { data: { streamKey: "wsk_…", wsUrl: "wss://<domain>/ingest/ws?app=live&room=cam1", … } }
# 2) Send one JPEG per second as binary frames (websocat)
while true; do cat frame.jpg; sleep 1; done | \
websocat --binary -H "Authorization: Bearer wsk_XXXX" \
"wss://<domain>/ingest/ws?app=live&room=cam1"
# 3) Watch it
# GET https://<domain>/live/live/cam1/frame.jpg (last frame)
# <img src="https://<domain>/live/live/cam1/mjpeg"> (MJPEG stream)
# https://<domain>/play/live/live-cam1 (public player, MJPEG mode)