Skip to content

Live Audio Streaming — Design Spec

Date: 2026-05-22 Status: Draft for user review Scope: server (xylolabs-server), SDK (xylolabs-sdk), HAL (xylolabs-hal-esp), transcoder (xylolabs-transcode), admin frontend (frontend/), operator app (frontend-app/) Coexists with: existing session-based chunked upload path (/api/v1/ingest/sessions/*) — unchanged

1. Motivation

The current ingestion pipeline is session-based: a device opens an ingest_session, batches audio + metadata for ~500 ms at a time, posts via HTTP, and the server flushes to S3 every ~10 s. There is no live audio egress; the SSE live tail is metadata-only and capped at 30 minutes.

We want to add continuous live audio streaming: a device pushes a never-ending stream of LC3 frames, multiple listeners (browser, mobile app, server consumers) tune in at <3 s latency, and every byte is simultaneously archived to S3 so the timeline view can offer scrub-back.

Hardware target for the first cut: ESP32-S3 + 4-channel PDM mic array, LC3 codec (XAP). CPU/RAM headroom is comfortable (~17% CPU @ 96 kHz × 4ch, ~32 KB RAM per CODEC-ANALYSIS.md).

2. Decisions (from brainstorming)

Question Decision
Live-listen latency target 1–3 s glass-to-ear (PDM capture → encoder → server → fan-out → listener decode + playback)
Listener clients Operator browser dashboard, frontend-app SPA, mobile app (native LC3), server-side consumers; needs LC3 / PCM / AAC / Opus outputs
Concurrent listeners ~10 concurrent listening sessions per server node; server-side transcoding budget acceptable (ARM Neoverse, 4 cores)
Archive policy Dual path: LC3 zstd (lossless) + AAC HLS segments (browser-playable) — both to S3
Relationship to batch sessions Coexist. New live_streams resource; existing ingest_sessions untouched
Browser playback transport LL-HLS (AAC fMP4 segments, hls.js)
Stream identity Stable per-device logical channel: stream_key = "{facility_slug}/{device_uid}/{port_index}". Reboots / WiFi flaps keep the same stream_id. Manifest uses EXT-X-DISCONTINUITY to mark gaps
Integration with existing device timeline The new live_streams and per-stream archive segments must surface in the existing GET /api/v1/devices/{id}/timeseries response so clicking a device shows live hero + scrub-back track

3. Architecture (high-level)

DEVICE (ESP32-S3)
  4ch PDM mic → I2S DMA → ring → LC3 enc (4ch interleaved) → WS push
                                                               │
                                       50–100 ms batch, XMBP framing
                                                               │
                                                               ▼
SERVER
  ingest_ws → LiveAudioManager
                ├─ broadcast<LiveAudioFrame> per stream
                │
                ├─ archive_lc3   → S3 (lossless, zstd chunks)
                ├─ archive_hls   → FFmpeg(LC3→AAC fMP4)
                │                  → memory ring (60 s) + S3 PUT (.m4s)
                │
                └─ on-demand encoders, lifecycle = subscriber count:
                     ├─ listen.ws?format=lc3   (passthrough)
                     ├─ listen.ws?format=pcm   (LC3 → s16le)
                     ├─ listen.ws?format=aac   (FFmpeg LC3 → AAC ADTS)
                     ├─ listen.ws?format=opus  (FFmpeg LC3 → Ogg/Opus)
                     └─ listen.m3u8 + segments (LL-HLS, served from ring)

Single fan-out channelLiveAudioManager demuxes each device's WS once; all output paths (LC3/PCM/AAC/Opus subscribers + the two archive tasks) attach as broadcast subscribers. PCM decode is lazy: only runs when at least one downstream needs it.

HLS segment dual-signing — newly-encoded fMP4 segments go into both a 60-second in-memory ring (for low-latency live playback) and asynchronously to S3 (for DVR + replay). Live listeners never wait on S3 propagation.

4. Data model

Three new tables, none of which touch existing ingest_sessions.

live_streams — logical channel (one row per device port)

column type notes
id uuid PK
facility_id uuid → facilities RBAC scope
device_id uuid → devices
port_index smallint for multi-mic boards (default 0)
stream_key text UNIQUE {facility_slug}/{device_uid}/{port_index}
display_name text operator label
channels smallint CHECK 1..=8 1–4 for now
channel_names text[] e.g. {front, rear, L, R}
sample_rate_hz int 16/24/48/96 kHz
codec text DEFAULT 'lc3'
bitrate_per_channel_bps int DEFAULT 64000
frame_duration_us int DEFAULT 10000
transcode_profile text DEFAULT 'default' per-stream AAC/Opus bitrate override
state text idle / live / paused
last_connected_at timestamptz
last_disconnected_at timestamptz
total_seconds_live bigint cumulative uptime
retention_days int facility default override
metadata jsonb
created_at / updated_at / deleted_at timestamptz soft delete
UNIQUE (facility_id, device_id, port_index) WHERE deleted_at IS NULL;
INDEX (facility_id, state) WHERE deleted_at IS NULL;
INDEX (stream_key);

live_stream_connections — per-WS audit log

column type notes
id uuid PK
stream_id uuid → live_streams
api_key_id uuid → api_keys
started_at timestamptz NOT NULL DEFAULT now()
ended_at timestamptz NULL = in progress
client_addr inet
disconnect_reason text client_close/idle_timeout/server_close/error/replaced
samples_received bigint
bytes_received bigint
last_batch_seq int XMBP batch seq tracking

live_archive_segments — unified LC3 + HLS segment index

column type notes
id uuid PK
stream_id uuid → live_streams
kind text lc3_zstd / hls_init / hls_m4s
s3_key text object key
sequence_num bigint HLS media seq (NULL for lc3_zstd)
start_us bigint unix microseconds
duration_us bigint
byte_size bigint
discontinuity bool DEFAULT false HLS discontinuity marker
created_at timestamptz NOT NULL DEFAULT now()
INDEX (stream_id, start_us);
INDEX (stream_id, kind, sequence_num);
INDEX (created_at);  -- retention prune

api_keys scope additions (text[], no schema change)

  • live:ingest — device push permission
  • live:listen — listener pull permission

Both follow the existing media:read convention (cycle 5 2026-05-22, P672 followup).

Migration files

20260522120000_create_live_streams.sql
20260522120100_create_live_stream_connections.sql
20260522120200_create_live_archive_segments.sql

Strictly monotonic per CLAUDE.md migration rule. Verify before commit: ls crates/xylolabs-db/migrations | awk -F_ '{print $1}' | sort -c

5. Server components

New module: crates/xylolabs-server/src/live/

src/live/
├── manager.rs        # LiveAudioManager — orchestrates WS + fan-out + archive
├── frame.rs          # LiveAudioFrame { stream_id, seq, pts_us, lc3_payload, channels }
├── transcode.rs      # PerFormatEncoder trait + FFmpeg subprocess (AAC, Opus)
├── pcm.rs            # LC3 → PCM s16le (in-process, liblc3 Rust)
├── archive_lc3.rs    # LC3 zstd chunked archiver (re-uses IngestManager logic)
├── archive_hls.rs    # HLS fMP4 segmenter + S3 PUT + memory ring
├── hls_playlist.rs   # LL-HLS .m3u8 builder (live + DVR window)
└── connection.rs     # per-device WS handler (BatchSequence tracking)

LiveAudioManager

pub struct LiveAudioManager {
    streams: DashMap<Uuid, Arc<LiveStreamRuntime>>,
    s3: S3Client,
    db: PgPool,
    config: LiveConfig,
}

struct LiveStreamRuntime {
    stream_id: Uuid,
    facility_id: Uuid,
    audio_tx: broadcast::Sender<LiveAudioFrame>,
    pcm_tx: broadcast::Sender<Bytes>,
    encoders: RwLock<HashMap<EncoderKey, Arc<EncoderTask>>>,
    hls_ring: Arc<HlsMemoryRing>,
    state: RwLock<LiveStreamState>,
}

Single-connection-per-stream invariant

When a second WS arrives for the same stream_key, the existing connection is closed with code 4001 "replaced". UI shows the warning. Prevents two devices pushing into the same logical channel.

Lazy encoder lifecycle

  • PCM decode task spawns when any of {PCM listener, AAC encoder, Opus encoder} attaches.
  • AAC/Opus encoder spawns on first listener subscription, dies 30 s after last unsubscribe.
  • HLS archive encoder runs continuously while state=live (segments needed for both archive + future late joiners).

HlsMemoryRing

struct HlsMemoryRing {
    init_segment: ArcSwap<Bytes>,
    segments: parking_lot::Mutex<VecDeque<HlsSegment>>,  // last ~60 s
    parts: parking_lot::Mutex<VecDeque<HlsPart>>,         // LL-HLS partials (200 ms)
    media_sequence: AtomicU64,
}

Live listeners get segments from memory; the same bytes are PUT to S3 in the background for DVR.

Existing module touches

  • state.rs — add live: Arc<LiveAudioManager> to AppState
  • router.rs — register new routes (see §6)
  • routes/device_timeline.rs — extend response with live_streams + live_segments
  • middleware/rbac.rs — add live:read / live:listen / live:manage perms

Background workers

  • LiveRetentionWorker — every 30 min, prune live_archive_segments older than facility.retention_days + S3 batch delete.
  • LiveHealthMonitor — every 10 s, transition state=live → idle when last_connected_at < now - 30 s. Attaches EXT-X-DISCONTINUITY on the next segment after gap recovery.

6. API surface

Device ingestion

method path auth notes
POST /api/v1/live/streams JWT operator creates a logical stream (optional; auto-provision also supported)
GET (WS) /api/v1/live/streams/{key}/ingest API key, live:ingest one connection per stream_key; 4001 "replaced" on dup

Listener egress

All require JWT OR API key with live:listen (same dual-auth pattern as media:read).

method path format notes
GET (WS) /listen.ws?format={lc3|pcm|aac|opus}&bitrate={kbps}&channels=... per format first frame is JSON hello, then binary u64 BE pts_us + payload
GET /listen.m3u8?codec=aac&bitrate=128 LL-HLS accepts ?token=jwt for <video> element
GET /init.mp4 fMP4 immutable per stream, 1y cache
GET /segments/{seq}.m4s fMP4 segment served from memory ring (<60 s old) or S3
GET /segments/{seq}.{part}.m4s LL-HLS partial 200 ms
GET /segments/{seq}.m4s?vod={start_us}-{end_us} fMP4 DVR within retention_days

CRUD

method path auth
GET /api/v1/live/streams?facility_id=…&device_id=…&state=live JWT, live:read
GET /api/v1/live/streams/{id} JWT
PATCH /api/v1/live/streams/{id} JWT, live:manage
DELETE /api/v1/live/streams/{id} JWT, live:manage (soft delete)
POST /api/v1/auth/live-token JWT

Scope wildcards

The existing API-key scope check (crates/xylolabs-server/src/middleware/api_key_auth.rs::api_key_has_scope) accepts "*" as a wildcard. Keys minted with ["*"] automatically satisfy both live:ingest and live:listen; no migration of existing super-keys is required.

Existing endpoint extension

GET /api/v1/devices/{id}/timeseries response gains:

{
  // existing fields…
  "live_streams":  [{ "stream_id", "stream_key", "display_name", "channels", "state", "last_connected_at" }],
  "live_segments": [{ "stream_id", "start_us", "end_us", "kind", "discontinuity" }]
}

Single fetch → metadata charts + recording events + live hero + scrub track.

Error surface (new i18n keys)

  • errors.live.streamNotLive — "스트림이 현재 송출 중이 아닙니다"
  • errors.live.formatUnsupported — "지원하지 않는 포맷입니다: {format}"
  • errors.live.replaced — "다른 위치에서 같은 디바이스가 연결되어 이전 세션이 종료됐습니다"
  • errors.live.subscriberLimit — "스트림당 동시 청취 한도를 초과했습니다"

WS close codes

code meaning
4001 replaced (ingest)
4002 auth_revoked
4003 facility_mismatch
4004 stream_not_live (listen)
4005 format_unsupported (listen)
1011 internal error

7. SDK changes (Rust no_std)

New module: xylolabs-sdk/src/live/

src/live/
├── client.rs        # LiveClient — long-lived WS
├── transport_ws.rs  # embedded-websocket over embedded-tls + embassy-net
├── encoder_4ch.rs   # N-channel interleaved LC3
└── reconnect.rs     # exponential backoff, stable stream_key

LiveClient

pub struct LiveClient<P: Platform, T: WsTransport, const N_CH: usize> {
    transport: T,
    encoder: Lc3MultiChannel<N_CH>,
    stream_key: heapless::String<128>,
    ring: RingBuffer<{ N_CH * STANDARD_MAX_FRAME_SAMPLES }>,
    batch_ms: u16,
    api_key: heapless::String<64>,
}
  • Configurable batch_ms (50–100 ms for live; default 50).
  • Reconnect: 200 ms → 2× → 30 s cap. Same stream_key across reboots → same stream_id → listener URL stable.

ESP32-S3 4ch PDM example

New: sdk/examples/esp32-s3-4ch-live/. Adds to xylolabs-hal-esp:

pub fn new_pdm_4ch(pio: PIO0, clk_pin: u8, data_pins: [u8; 4]) -> Pdm4ChDriver;

DMA double-buffer [i16; 4 * 480] × 2. Embassy task: PDM DMA → client.push_pcm().

Existing SDK unchanged

StreamingClient (session/batch), HttpTransport, single-channel LC3 — untouched. Opt-in live path.

8. Frontend changes

Admin frontend (frontend/)

New components: LiveAudioPlayer, LiveStreamBadge, LiveSegmentTrack, ChannelMixer.

LiveAudioPlayer uses hls.js@^1.5 with lowLatencyMode: true. Auth: short-lived (5 min) JWT issued by POST /api/v1/auth/live-token, embedded in M3U8 URL as ?token=... (because <video> cannot carry custom headers). Media Session API for OS lockscreen / Bluetooth controls.

DeviceTimelinePage.tsx integration

  • Top hero: 🔴 LIVE badge + display_name + channel count + Listen button if device has state=live stream.
  • Live segment track: filled bar overlay on the recording_events track. Gaps render as EXT-X-DISCONTINUITY markers. Click → opens player in VOD mode at that timestamp.

frontend-app/

Symmetric integration with reduced UI (operator-facing). Same LiveAudioPlayer lifted to a shared package if duplication grows.

9. Transcoding pipeline

FFmpeg subprocess strategy

FFmpeg ≥ 6 required (for LL-HLS EXT-X-PART + independent_segments HLS flag combination). The production Docker image already ships FFmpeg ≥ 7 per the existing xylolabs-transcode pipeline; the live path reuses the same binary.

Long-running encoder per (stream_id, format) (not segment-by-segment spawn). stdin = PCM s16le, stdout = AAC fMP4 or Ogg/Opus. Idle 30 s after last subscriber → reaped. HLS archive encoder runs continuously while state=live.

AAC HLS command (template)

ffmpeg -fflags +nobuffer -flush_packets 1 -muxdelay 0 \
  -f s16le -ar 48000 -ac 4 -i pipe:0 \
  -c:a aac -b:a 128k -ac 2 -af "pan=stereo|c0<c0+c1|c1<c2+c3" \
  -f hls \
    -hls_segment_type fmp4 -hls_time 1.0 \
    -hls_flags omit_endlist+independent_segments+delete_segments+temp_file+program_date_time \
    -hls_segment_filename "segments/%d.m4s" \
    -hls_list_size 60 -hls_playlist_type event \
    /dev/null

4ch → stereo downmix for first cut; multi-channel AAC (HE-AAC v2) deferred.

Resource budget

  • ~15 MB FFmpeg RSS per encoder × ~1.5 formats avg × 10 streams ≈ 225 MB. t4g.medium 4 GiB safe.
  • AAC encode 48 kHz stereo ≈ 5% × 1 Graviton2 core × 10 streams = 50% × 1 core.

10. Operations

CSP (crates/xylolabs-server/src/router.rs)

connect-src 'self' wss://api.xylolabs.com blob:;
media-src   'self' blob:;

(LL-HLS is same-origin so no extra origin needed.)

nginx (deploy/nginx/)

location /api/v1/live/streams/.*/ingest {
    proxy_pass http://app:3000;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    proxy_read_timeout 7d; proxy_send_timeout 7d;
    proxy_buffering off;
}
location /api/v1/live/streams/.*/listen.ws { ... same ... }

location /api/v1/live/streams/.*/listen.m3u8 {
    proxy_pass http://app:3000;
    proxy_cache off;
    add_header Cache-Control "no-cache" always;
}
location /api/v1/live/streams/.*/segments/ {
    proxy_pass http://app:3000;
    proxy_cache_valid 200 60s;
}

Migration / deploy order

  1. SQL migrations (3 files, strictly monotonic)
  2. Server with new routes + manager
  3. nginx reload (new locations)
  4. CSP header update (in server router code)
  5. Frontend (LiveAudioPlayer + DeviceTimelinePage integration)
  6. SDK firmware OTA (opt-in flag)

Each step is independently revertable.

11. Testing

Unit (server lib)

  • HlsPlaylist builder: various seq/discontinuity combos → canonical M3U8 text
  • pcm.rs LC3 → s16le round-trip accuracy (samples_in == samples_out within float epsilon for sample data)
  • LiveAudioManager pubsub fan-out: subscribers count == sender attached count
  • Encoder lifecycle: spawn on subscribe, reap 30 s after last unsubscribe

Integration (server tests/)

  • api_live_ingest.rs — WS connect → push 100 frames → listener SSE receives. Docker postgres + minio.
  • api_live_listen_ws.rs — 4 formats each: hello JSON + binary frames
  • api_live_hls.rs — m3u8 fetch → init.mp4 → segment stream; assert EXT-X-PROGRAM-DATE-TIME accuracy ≤ 1 s
  • api_live_archive.rs — stream close → both LC3 zstd + HLS segments PUT to S3 + DB rows indexed
  • api_live_replaced.rs — second WS to same stream_key closes first with 4001

E2E

tests/e2e-live-stream/ — ESP32 simulator → live push → headless browser → assert end-to-end glass-to-ear latency 1–3 s.

Load

  • tests/burnin-live/ — 50 concurrent streams × 4ch × 1h → memory/CPU snapshot
  • tests/listener-fanout/ — 100 concurrent listeners × 1 stream → broadcast channel backpressure

12. Rollout (phased, ~8–9 weeks)

Phase Scope Duration
0 DB migrations, scope additions, transcode skeleton, LiveStreamActor 1 wk
1 LiveAudioManager, WS ingest, LC3 zstd archive, integration tests 1 wk
2 broadcast fan-out, per-format encoders, WS egress (4 formats), token endpoint 1 wk
3 HLS segmenter, memory ring, LL-HLS playlist, DVR window 1 wk
4 Frontend LiveAudioPlayer, DeviceTimelinePage hero + segment track 1.5 wk
5 SDK LiveClient, ESP32-S3 4ch PDM HAL, 4ch LC3 encoder, OTA flag 2 wk
6 E2E + burn-in + docs (docs/LIVE-STREAMING.{en,ko}.md) + API.md update 1 wk

Phases 0–3 are server-only and can ship without affecting users. 4–5 can run in parallel (web vs firmware teams).

Per-phase gate

  • All new + existing tests pass
  • cargo clippy -- -D warnings clean
  • Browser 3 viewports (mobile/tablet/desktop) zero pageerror
  • Deploy + health check 200 + endpoint auth matrix verified

13. Out of scope (deferred to subsequent specs)

  • WebRTC sub-second listener path (would require SFU)
  • Adaptive bitrate (multi-quality HLS rendition)
  • Multi-channel AAC (HE-AAC v2)
  • Cross-facility public share links (per-stream token grants)
  • "Recording window" UI (explicit record start/stop on top of always-archive)
  • Listener-side encryption beyond TLS (audio-payload-level)