Live Audio Streaming — Design Spec

Status note (2026-07-15): this is the original design draft, kept for historical context. The shipped implementation supports LC3 and PCM listener formats only — AAC/Opus outputs and browser LL-HLS playback below were never built (parse_listen_format in routes/live.rs accepts only lc3/pcm; LiveAudioManager::hls_playlist/hls_segment in live/manager.rs are Phase 0 skeletons that return NotImplemented). The LiveHealthMonitor and LiveRetentionWorker classes described below were never built as such; the shipped retention/cleanup/orphan-reaper workers are plain async functions (spawn_retention_task, spawn_cleanup_task, spawn_orphan_reaper) in live/manager.rs. docs/API.en.md §25 (and docs/API.ko.md §25) is the authoritative description of current behavior — treat everything below as the pre-implementation proposal, not a spec of what runs today.

Date: 2026-05-22 Status: Draft for user review Scope: server (xylolabs-server), SDK (xylolabs-sdk), HAL (xylolabs-hal-esp), transcoder (xylolabs-transcode), admin frontend (frontend/), operator app (frontend-app/) Coexists with: existing session-based chunked upload path (/api/v1/ingest/sessions/*) — unchanged

1. Motivation

The current ingestion pipeline is session-based: a device opens an ingest_session, batches audio + metadata for ~500 ms at a time, posts via HTTP, and the server flushes to S3 every ~10 s. There is no live audio egress; the SSE live tail is metadata-only and capped at 30 minutes.

We want to add continuous live audio streaming: a device pushes a never-ending stream of LC3 frames, multiple listeners (browser, mobile app, server consumers) tune in at <3 s latency, and every byte is simultaneously archived to S3 so the timeline view can offer scrub-back.

Hardware target for the first cut: ESP32-S3 + 4-channel PDM mic array, LC3 codec (XAP). CPU/RAM headroom is comfortable (~17% CPU @ 96 kHz × 4ch, ~32 KB RAM per CODEC-ANALYSIS.md).

2. Decisions (from brainstorming)

Question	Decision
Live-listen latency target	1–3 s glass-to-ear (PDM capture → encoder → server → fan-out → listener decode + playback)
Listener clients	Operator browser dashboard, `frontend-app` SPA, mobile app (native LC3), server-side consumers; needs LC3 / PCM / AAC / Opus outputs
Concurrent listeners	~10 concurrent listening sessions per server node; server-side transcoding budget acceptable (ARM Neoverse, 4 cores)
Archive policy	Dual path: LC3 zstd (lossless) + AAC HLS segments (browser-playable) — both to S3
Relationship to batch sessions	Coexist. New `live_streams` resource; existing `ingest_sessions` untouched
Browser playback transport	LL-HLS (AAC fMP4 segments, hls.js)
Stream identity	Stable per-device logical channel: `stream_key = "{facility_slug}/{device_uid}/{port_index}"`. Reboots / WiFi flaps keep the same `stream_id`. Manifest uses `EXT-X-DISCONTINUITY` to mark gaps
Integration with existing device timeline	The new `live_streams` and per-stream archive segments must surface in the existing `GET /api/v1/devices/{id}/timeseries` response so clicking a device shows live hero + scrub-back track

3. Architecture (high-level)

DEVICE (ESP32-S3)
  4ch PDM mic → I2S DMA → ring → LC3 enc (4ch interleaved) → WS push
                                                               │
                                       50–100 ms batch, XMBP framing
                                                               │
                                                               ▼
SERVER
  ingest_ws → LiveAudioManager
                ├─ broadcast<LiveAudioFrame> per stream
                │
                ├─ archive_lc3   → S3 (lossless, zstd chunks)
                ├─ archive_hls   → FFmpeg(LC3→AAC fMP4)
                │                  → memory ring (60 s) + S3 PUT (.m4s)
                │
                └─ on-demand encoders, lifecycle = subscriber count:
                     ├─ listen.ws?format=lc3   (passthrough)
                     ├─ listen.ws?format=pcm   (LC3 → s16le)
                     ├─ listen.ws?format=aac   (FFmpeg LC3 → AAC ADTS)
                     ├─ listen.ws?format=opus  (FFmpeg LC3 → Ogg/Opus)
                     └─ listen.m3u8 + segments (LL-HLS, served from ring)

Single fan-out channel — LiveAudioManager demuxes each device's WS once; all output paths (LC3/PCM/AAC/Opus subscribers + the two archive tasks) attach as broadcast subscribers. PCM decode is lazy: only runs when at least one downstream needs it.

HLS segment dual-signing — newly-encoded fMP4 segments go into both a 60-second in-memory ring (for low-latency live playback) and asynchronously to S3 (for DVR + replay). Live listeners never wait on S3 propagation.

4. Data model

Three new tables, none of which touch existing ingest_sessions.

`live_streams` — logical channel (one row per device port)

column	type	notes
`id`	uuid PK
`facility_id`	uuid → facilities	RBAC scope
`device_id`	uuid → devices
`port_index`	smallint	for multi-mic boards (default 0)
`stream_key`	text UNIQUE	`{facility_slug}/{device_uid}/{port_index}`
`display_name`	text	operator label
`channels`	smallint CHECK 1..=8	1–4 for now
`channel_names`	text[]	e.g. `{front, rear, L, R}`
`sample_rate_hz`	int	16/24/48/96 kHz
`codec`	text DEFAULT 'lc3'
`bitrate_per_channel_bps`	int DEFAULT 64000
`frame_duration_us`	int DEFAULT 10000
`transcode_profile`	text DEFAULT 'default'	per-stream AAC/Opus bitrate override
`state`	text	`idle` / `live` / `paused`
`last_connected_at`	timestamptz
`last_disconnected_at`	timestamptz
`total_seconds_live`	bigint	cumulative uptime
`retention_days`	int	facility default override
`metadata`	jsonb
`created_at` / `updated_at` / `deleted_at`	timestamptz	soft delete

UNIQUE (facility_id, device_id, port_index) WHERE deleted_at IS NULL;
INDEX (facility_id, state) WHERE deleted_at IS NULL;
INDEX (stream_key);

`live_stream_connections` — per-WS audit log

column	type	notes
`id`	uuid PK
`stream_id`	uuid → live_streams
`api_key_id`	uuid → api_keys
`started_at`	timestamptz NOT NULL DEFAULT now()
`ended_at`	timestamptz	NULL = in progress
`client_addr`	inet
`disconnect_reason`	text	`client_close`/`idle_timeout`/`server_close`/`error`/`replaced`
`samples_received`	bigint
`bytes_received`	bigint
`last_batch_seq`	int	XMBP batch seq tracking

`live_archive_segments` — unified LC3 + HLS segment index

column	type	notes
`id`	uuid PK
`stream_id`	uuid → live_streams
`kind`	text	`lc3_zstd` / `hls_init` / `hls_m4s`
`s3_key`	text	object key
`sequence_num`	bigint	HLS media seq (NULL for lc3_zstd)
`start_us`	bigint	unix microseconds
`duration_us`	bigint
`byte_size`	bigint
`discontinuity`	bool DEFAULT false	HLS discontinuity marker
`created_at`	timestamptz NOT NULL DEFAULT now()

INDEX (stream_id, start_us);
INDEX (stream_id, kind, sequence_num);
INDEX (created_at);  -- retention prune

`api_keys` scope additions (text[], no schema change)

live:ingest — device push permission
live:listen — listener pull permission

Both follow the existing media:read convention (cycle 5 2026-05-22, P672 followup).

Migration files

20260522120000_create_live_streams.sql
20260522120100_create_live_stream_connections.sql
20260522120200_create_live_archive_segments.sql

Strictly monotonic per CLAUDE.md migration rule. Verify before commit: ls crates/xylolabs-db/migrations | awk -F_ '{print $1}' | sort -c

5. Server components

New module: `crates/xylolabs-server/src/live/`

src/live/
├── manager.rs        # LiveAudioManager — orchestrates WS + fan-out + archive
├── frame.rs          # LiveAudioFrame { stream_id, seq, pts_us, lc3_payload, channels }
├── transcode.rs      # PerFormatEncoder trait + FFmpeg subprocess (AAC, Opus)
├── pcm.rs            # LC3 → PCM s16le (in-process, liblc3 Rust)
├── archive_lc3.rs    # LC3 zstd chunked archiver (re-uses IngestManager logic)
├── archive_hls.rs    # HLS fMP4 segmenter + S3 PUT + memory ring
├── hls_playlist.rs   # LL-HLS .m3u8 builder (live + DVR window)
└── connection.rs     # per-device WS handler (BatchSequence tracking)

`LiveAudioManager`

pub struct LiveAudioManager {
    streams: DashMap<Uuid, Arc<LiveStreamRuntime>>,
    s3: S3Client,
    db: PgPool,
    config: LiveConfig,
}

struct LiveStreamRuntime {
    stream_id: Uuid,
    facility_id: Uuid,
    audio_tx: broadcast::Sender<LiveAudioFrame>,
    pcm_tx: broadcast::Sender<Bytes>,
    encoders: RwLock<HashMap<EncoderKey, Arc<EncoderTask>>>,
    hls_ring: Arc<HlsMemoryRing>,
    state: RwLock<LiveStreamState>,
}

Single-connection-per-stream invariant

When a second WS arrives for the same stream_key, the existing connection is closed with code 4001 "replaced". UI shows the warning. Prevents two devices pushing into the same logical channel.

Lazy encoder lifecycle

PCM decode task spawns when any of {PCM listener, AAC encoder, Opus encoder} attaches.
AAC/Opus encoder spawns on first listener subscription, dies 30 s after last unsubscribe.
HLS archive encoder runs continuously while state=live (segments needed for both archive + future late joiners).

`HlsMemoryRing`

struct HlsMemoryRing {
    init_segment: ArcSwap<Bytes>,
    segments: parking_lot::Mutex<VecDeque<HlsSegment>>,  // last ~60 s
    parts: parking_lot::Mutex<VecDeque<HlsPart>>,         // LL-HLS partials (200 ms)
    media_sequence: AtomicU64,
}

Live listeners get segments from memory; the same bytes are PUT to S3 in the background for DVR.

Existing module touches

state.rs — add live: Arc<LiveAudioManager> to AppState
router.rs — register new routes (see §6)
routes/device_timeline.rs — extend response with live_streams + live_segments
middleware/rbac.rs — add live:read / live:listen / live:manage perms

Background workers

LiveRetentionWorker — every 30 min, prune live_archive_segments older than facility.retention_days + S3 batch delete.
LiveHealthMonitor — every 10 s, transition state=live → idle when last_connected_at < now - 30 s. Attaches EXT-X-DISCONTINUITY on the next segment after gap recovery.

6. API surface

Device ingestion

method	path	auth	notes
`POST`	`/api/v1/live/streams`	JWT	operator creates a logical stream (optional; auto-provision also supported)
`GET` (WS)	`/api/v1/live/streams/{key}/ingest`	API key, `live:ingest`	one connection per stream_key; 4001 "replaced" on dup

Listener egress

All require JWT OR API key with live:listen (same dual-auth pattern as media:read).

method	path	format	notes
`GET` (WS)	`/listen.ws?format={lc3\|pcm\|aac\|opus}&bitrate={kbps}&channels=...`	per `format`	first frame is JSON hello, then binary `u64 BE pts_us + payload`
`GET`	`/listen.m3u8?codec=aac&bitrate=128`	LL-HLS	accepts `?token=jwt` for `<video>` element
`GET`	`/init.mp4`	fMP4	immutable per stream, 1y cache
`GET`	`/segments/{seq}.m4s`	fMP4 segment	served from memory ring (<60 s old) or S3
`GET`	`/segments/{seq}.{part}.m4s`	LL-HLS partial	200 ms
`GET`	`/segments/{seq}.m4s?vod={start_us}-{end_us}`	fMP4	DVR within `retention_days`

CRUD

method	path	auth
`GET`	`/api/v1/live/streams?facility_id=…&device_id=…&state=live`	JWT, `live:read`
`GET`	`/api/v1/live/streams/{id}`	JWT
`PATCH`	`/api/v1/live/streams/{id}`	JWT, `live:manage`
`DELETE`	`/api/v1/live/streams/{id}`	JWT, `live:manage` (soft delete)
`POST`	`/api/v1/auth/live-token`	JWT

Scope wildcards

The existing API-key scope check (crates/xylolabs-server/src/middleware/api_key_auth.rs::api_key_has_scope) accepts "*" as a wildcard. Keys minted with ["*"] automatically satisfy both live:ingest and live:listen; no migration of existing super-keys is required.

Existing endpoint extension

GET /api/v1/devices/{id}/timeseries response gains:

{
  // existing fields…
  "live_streams":  [{ "stream_id", "stream_key", "display_name", "channels", "state", "last_connected_at" }],
  "live_segments": [{ "stream_id", "start_us", "end_us", "kind", "discontinuity" }]
}

Single fetch → metadata charts + recording events + live hero + scrub track.

Error surface (new i18n keys)

errors.live.streamNotLive, "스트림이 현재 송출 중이 아닙니다"
errors.live.formatUnsupported, "지원하지 않는 포맷입니다: {format}"
errors.live.replaced, "다른 위치에서 같은 디바이스가 연결되어 이전 세션이 종료됐습니다"
errors.live.subscriberLimit, "스트림당 동시 청취 한도를 초과했습니다"

WS close codes

code	meaning
4001	`replaced` (ingest)
4002	`auth_revoked`
4003	`facility_mismatch`
4004	`stream_not_live` (listen)
4005	`format_unsupported` (listen)
1011	internal error

7. SDK changes (Rust no_std)

New module: `xylolabs-sdk/src/live/`

src/live/
├── client.rs        # LiveClient — long-lived WS
├── transport_ws.rs  # embedded-websocket over embedded-tls + embassy-net
├── encoder_4ch.rs   # N-channel interleaved LC3
└── reconnect.rs     # exponential backoff, stable stream_key

`LiveClient`

pub struct LiveClient<P: Platform, T: WsTransport, const N_CH: usize> {
    transport: T,
    encoder: Lc3MultiChannel<N_CH>,
    stream_key: heapless::String<128>,
    ring: RingBuffer<{ N_CH * STANDARD_MAX_FRAME_SAMPLES }>,
    batch_ms: u16,
    api_key: heapless::String<64>,
}

Configurable batch_ms (50–100 ms for live; default 50).
Reconnect: 200 ms → 2× → 30 s cap. Same stream_key across reboots → same stream_id → listener URL stable.

ESP32-S3 4ch PDM example

New: sdk/examples/esp32-s3-4ch-live/. Adds to xylolabs-hal-esp:

pub fn new_pdm_4ch(pio: PIO0, clk_pin: u8, data_pins: [u8; 4]) -> Pdm4ChDriver;

DMA double-buffer [i16; 4 * 480] × 2. Embassy task: PDM DMA → client.push_pcm().

Existing SDK unchanged

StreamingClient (session/batch), HttpTransport, single-channel LC3 — untouched. Opt-in live path.

8. Frontend changes

Admin frontend (`frontend/`)

New components: LiveAudioPlayer, LiveStreamBadge, LiveSegmentTrack, ChannelMixer.

LiveAudioPlayer uses hls.js@^1.5 with lowLatencyMode: true. Auth: short-lived (5 min) JWT issued by POST /api/v1/auth/live-token, embedded in M3U8 URL as ?token=... (because <video> cannot carry custom headers). Media Session API for OS lockscreen / Bluetooth controls.

`DeviceTimelinePage.tsx` integration

Top hero: 🔴 LIVE badge + display_name + channel count + Listen button if device has state=live stream.
Live segment track: filled bar overlay on the recording_events track. Gaps render as EXT-X-DISCONTINUITY markers. Click → opens player in VOD mode at that timestamp.

`frontend-app/`

Symmetric integration with reduced UI (operator-facing). Same LiveAudioPlayer lifted to a shared package if duplication grows.

9. Transcoding pipeline

FFmpeg subprocess strategy

FFmpeg ≥ 6 required (for LL-HLS EXT-X-PART + independent_segments HLS flag combination). The production Docker image already ships FFmpeg ≥ 7 per the existing xylolabs-transcode pipeline; the live path reuses the same binary.

Long-running encoder per (stream_id, format) (not segment-by-segment spawn). stdin = PCM s16le, stdout = AAC fMP4 or Ogg/Opus. Idle 30 s after last subscriber → reaped. HLS archive encoder runs continuously while state=live.

AAC HLS command (template)

ffmpeg -fflags +nobuffer -flush_packets 1 -muxdelay 0 \
  -f s16le -ar 48000 -ac 4 -i pipe:0 \
  -c:a aac -b:a 128k -ac 2 -af "pan=stereo|c0<c0+c1|c1<c2+c3" \
  -f hls \
    -hls_segment_type fmp4 -hls_time 1.0 \
    -hls_flags omit_endlist+independent_segments+delete_segments+temp_file+program_date_time \
    -hls_segment_filename "segments/%d.m4s" \
    -hls_list_size 60 -hls_playlist_type event \
    /dev/null

4ch → stereo downmix for first cut; multi-channel AAC (HE-AAC v2) deferred.

Resource budget

~15 MB FFmpeg RSS per encoder × ~1.5 formats avg × 10 streams ≈ 225 MB. t4g.medium 4 GiB safe.
AAC encode 48 kHz stereo ≈ 5% × 1 Graviton2 core × 10 streams = 50% × 1 core.

10. Operations

CSP (`crates/xylolabs-server/src/router.rs`)

connect-src 'self' wss://api.xylolabs.com blob:;
media-src   'self' blob:;

(LL-HLS is same-origin so no extra origin needed.)

nginx (`deploy/nginx/`)

location /api/v1/live/streams/.*/ingest {
    proxy_pass http://app:3000;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    proxy_read_timeout 7d; proxy_send_timeout 7d;
    proxy_buffering off;
}
location /api/v1/live/streams/.*/listen.ws { ... same ... }

location /api/v1/live/streams/.*/listen.m3u8 {
    proxy_pass http://app:3000;
    proxy_cache off;
    add_header Cache-Control "no-cache" always;
}
location /api/v1/live/streams/.*/segments/ {
    proxy_pass http://app:3000;
    proxy_cache_valid 200 60s;
}

Migration / deploy order

SQL migrations (3 files, strictly monotonic)
Server with new routes + manager
nginx reload (new locations)
CSP header update (in server router code)
Frontend (LiveAudioPlayer + DeviceTimelinePage integration)
SDK firmware OTA (opt-in flag)

Each step is independently revertable.

11. Testing

Unit (server lib)

HlsPlaylist builder: various seq/discontinuity combos → canonical M3U8 text
pcm.rs LC3 → s16le round-trip accuracy (samples_in == samples_out within float epsilon for sample data)
LiveAudioManager pubsub fan-out: subscribers count == sender attached count
Encoder lifecycle: spawn on subscribe, reap 30 s after last unsubscribe

Integration (server `tests/`)

api_live_ingest.rs — WS connect → push 100 frames → listener SSE receives. Docker postgres + minio.
api_live_listen_ws.rs — 4 formats each: hello JSON + binary frames
api_live_hls.rs — m3u8 fetch → init.mp4 → segment stream; assert EXT-X-PROGRAM-DATE-TIME accuracy ≤ 1 s
api_live_archive.rs — stream close → both LC3 zstd + HLS segments PUT to S3 + DB rows indexed
api_live_replaced.rs — second WS to same stream_key closes first with 4001

E2E

tests/e2e-live-stream/ — ESP32 simulator → live push → headless browser → assert end-to-end glass-to-ear latency 1–3 s.

Load

tests/burnin-live/ — 50 concurrent streams × 4ch × 1h → memory/CPU snapshot
tests/listener-fanout/ — 100 concurrent listeners × 1 stream → broadcast channel backpressure

12. Rollout (phased, ~8–9 weeks)

Phase	Scope	Duration
0	DB migrations, scope additions, transcode skeleton, `LiveStreamActor`	1 wk
1	`LiveAudioManager`, WS ingest, LC3 zstd archive, integration tests	1 wk
2	broadcast fan-out, per-format encoders, WS egress (4 formats), token endpoint	1 wk
3	HLS segmenter, memory ring, LL-HLS playlist, DVR window	1 wk
4	Frontend `LiveAudioPlayer`, DeviceTimelinePage hero + segment track	1.5 wk
5	SDK `LiveClient`, ESP32-S3 4ch PDM HAL, 4ch LC3 encoder, OTA flag	2 wk
6	E2E + burn-in + docs (`docs/LIVE-STREAMING.{en,ko}.md`) + API.md update	1 wk

Phases 0–3 are server-only and can ship without affecting users. 4–5 can run in parallel (web vs firmware teams).

Per-phase gate

All new + existing tests pass
cargo clippy -- -D warnings clean
Browser 3 viewports (mobile/tablet/desktop) zero pageerror
Deploy + health check 200 + endpoint auth matrix verified

13. Out of scope (deferred to subsequent specs)

WebRTC sub-second listener path (would require SFU)
Adaptive bitrate (multi-quality HLS rendition)
Multi-channel AAC (HE-AAC v2)
Cross-facility public share links (per-stream token grants)
"Recording window" UI (explicit record start/stop on top of always-archive)
Listener-side encryption beyond TLS (audio-payload-level)