Platform Evaluation — MCU Feasibility & Server Concurrency¶

Xylolabs API — Platform evaluation: MCU codec feasibility and server concurrency analysis Revision: 2026-03-29

1. Overview¶

This document covers two distinct performance domains:

MCU codec feasibility — whether target MCUs can run XAP or ADPCM encoding within a 10ms frame budget, and whether SRAM is sufficient.
Server concurrency limits — how many simultaneous ingest streams the Axum server handles before contention becomes a problem, and where the failure modes are.

Benchmark data in sections 2–4 was measured on an Apple M-series host in --release mode. MCU figures in sections 5–6 are extrapolated from host measurements using published MIPS estimates and DSP acceleration factors. Server analysis in section 7 is based on static code review of lock hierarchy and resource allocation.

2. XAP Codec Performance¶

2.1 Encode Time Per Frame — Mono, 10ms¶

Measured on host. MCU encode time scales inversely with clock rate relative to host.

Rate (Hz)	Samples	Host avg (us)	Budget% (host)
8,000	80	0.6	0.006%
16,000	160	2.3	0.023%
24,000	240	5.0	0.050%
32,000	320	9.4	0.094%
48,000	480	163.0	1.630%
96,000	960	630.9	6.309%

2.2 Cosine Table Threshold¶

The precomputed cosine table covers N <= 320 samples. This means:

8–32 kHz at 10ms (80–320 samples): table lookup path, zero trig calls, fast.
48 kHz and above at 10ms (480+ samples): falls back to runtime cosf() per MDCT coefficient.

The discontinuity at 48 kHz is 17x slower than the 32 kHz rate, not a proportional increase. The jump from 9.4 us (32 kHz) to 163.0 us (48 kHz) reflects this boundary exactly. On MCU targets with CMSIS-DSP or hardware FFT, this path is replaced entirely and the discontinuity disappears.

2.3 Channel Scaling — 16 kHz, 10ms¶

Channels are processed independently with shared header/setup overhead, producing sub-linear scaling.

Channels	Host avg (us)	Per-channel (us)	Scaling vs 1ch
1	2.4	2.4	1.00x
2	4.7	2.35	1.96x
3	7.1	2.37	2.96x
4	9.2	2.30	3.83x

4-channel scaling is 3.83x rather than 4.0x. The sub-linearity comes from amortized de-interleave and header write cost shared across all channels.

2.4 Four-Channel Stress — Key Sample Rates, 10ms¶

Rate (Hz)	Channels	Host avg (us)	Host Budget%
16,000	4	9.1	0.091%
48,000	4	684.2	6.842%
96,000	4	2,520.6	25.206%

At 4ch@96 kHz on the host, 25.2% of the 10ms window is spent encoding. On MCU targets, DSP acceleration brings this to a manageable range for M4F and PIE-capable parts (see section 5).

3. ADPCM Codec Performance¶

3.1 ADPCM vs XAP at 4ch@96 kHz, 10ms¶

Codec	Host avg (us)	Host Budget%	vs XAP
ADPCM	22.72	0.227%	—
XAP	2,520.6	25.206%	110x slower

ADPCM is 110x cheaper than XAP at the same configuration. For MCUs without DSP extensions, ADPCM is the only viable codec at 96 kHz. The tradeoff is audio quality: ADPCM is 4:1 compression with audible artifacts under some conditions; XAP achieves 10:1 with perceptually transparent output.

4. DSP/FPU Acceleration Analysis¶

Feature	Parts	Mechanism	Speedup vs soft-float
CMSIS-DSP arm_rfft_fast_f32	Cortex-M4F, M33	Dual MAC (SMLAD), paired multiplies	30–40%
ESP32-S3 PIE SIMD	ESP32-S3	128-bit vector ops, 4x int32 parallel	~60%
Cosine table (N<=320)	All targets	Precomputed fixed-point, no trig calls	Eliminates cosf() path
Runtime cosf() (N>320)	All targets (fallback)	Per-coefficient float trig	Baseline (17x slower than table path)
No FPU (M3, RP2040, C3)	RP2040, ESP32-C3, STM32F103	Soft-float fallback	XAP not feasible

CMSIS-DSP replaces the software MDCT with arm_rfft_fast_f32, eliminating both the cosine table discontinuity and the cosf() fallback. This is the primary reason MCU budget percentages diverge sharply from host measurements.

Parts without hardware FPU (RP2040 Cortex-M0+, ESP32-C3 RV32IMC, STM32F103 Cortex-M3) cannot run XAP at any practical sample rate within budget. These targets are limited to ADPCM.

5. MCU Feasibility Matrix¶

CPU% figures use the MCU-scaled encode time with DSP acceleration applied. "Max Config" is the highest XAP or ADPCM configuration that fits within approximately 80% CPU budget.

Target	Clock	SRAM	DSP	Max Config	CPU%	RAM%	Verdict
RP2350	150 MHz	520 KB	M33 DSP+FPU	4ch@96kHz XAP	52.0%	16.9%	COMFORTABLE
ESP32-S3	240 MHz	512 KB	PIE SIMD+FPU	4ch@96kHz XAP	37.1%	17.2%	COMFORTABLE
STM32F411	100 MHz	128 KB	M4F DSP+FPU	4ch@96kHz XAP	71.0%	68.8%	TIGHT
nRF52840	64 MHz	256 KB	M4F DSP+FPU	4ch@48kHz XAP	67.2%	17.2%	FEASIBLE
nRF9160	64 MHz	256 KB	M33 DSP+FPU	4ch@48kHz XAP	73.4%	17.2%	TIGHT
STM32WB55	64 MHz	256 KB	M4F DSP+FPU	4ch@48kHz XAP	60.9%	19.5%	FEASIBLE
STM32WBA55	100 MHz	128 KB	M33 DSP+FPU	4ch@96kHz XAP	74.0%	39.1%	TIGHT
RP2040	133 MHz	264 KB	None	ADPCM 4ch@96kHz	3.0%	10.6%	ADPCM ONLY
ESP32-C3	160 MHz	400 KB	M ext only	ADPCM 4ch@96kHz	2.5%	7.0%	ADPCM ONLY
STM32F103	72 MHz	20 KB	None	ADPCM 2ch@24kHz	1.4%	60.0%	SENSOR ONLY

Verdict definitions: - COMFORTABLE: Headroom >= 48%; suitable for production deployment with additional sensor/housekeeping load. - FEASIBLE: 33–67% CPU; workable but leaves limited headroom. Profile under full load before shipping. - TIGHT: 67–80% CPU or >35% RAM; technically meets spec but leaves minimal margin. Requires careful interrupt budgeting and no background tasks competing for cycles. - ADPCM ONLY: No hardware FPU; XAP is not feasible. ADPCM runs comfortably. - SENSOR ONLY: Inadequate SRAM for full audio pipeline. Suitable for sensor-only deployments (accelerometer, temperature) with ADPCM at reduced channel count and sample rate.

6. Memory Budget per Target¶

All figures in KB. "SDK+Codec" includes SDK static allocations and codec working buffers. "Ring Buf" is the audio ring buffer sized for 2× the maximum frame. "XMBP" is the metadata framing buffer. "HTTP" is the XMBP/HTTP client stack. "Stack" is the main task stack plus ISR stack.

Target	SRAM	SDK+Codec	Ring Buf	XMBP	HTTP	Stack	Used	Avail	RAM%
RP2350	520	20	32	16	4	16	88	432	16.9%
ESP32-S3	512	20	64	16	8	16	124	388	24.2%
STM32F411	128	20	8	4	4	8	44	84	34.4%
nRF52840	256	20	16	8	4	8	56	200	21.9%
nRF9160	256	20	16	8	4	8	56	200	21.9%
STM32WB55	256	20	16	2	0	12	50	206	19.5%
STM32WBA55	128	20	16	2	0	12	50	78	39.1%
RP2040	264	8	8	4	4	8	32	232	12.1%
ESP32-C3	400	8	8	4	4	8	32	368	8.0%
STM32F103	20	4	4	2	2	4	16	4	80.0%

Notes: - STM32F411 RAM% of 34.4% is deceptively low — the 128 KB total leaves only 84 KB available, which provides no margin for an RTOS heap or additional sensor buffers. - STM32F103's 4 KB available is effectively zero headroom. RTOS tick lists and heap fragments will consume this immediately. - ESP32-S3 ring buffer is doubled (64 KB) to accommodate the higher throughput at 4ch@96 kHz with PSRAM not assumed. - STM32WB55/WBA55 HTTP column is 0 KB because these targets use BLE transport only; no HTTP client stack is allocated.

7. Server Concurrency Analysis¶

7.1 Architecture¶

The server is an Axum application running on Tokio's multi-threaded scheduler. Key concurrency constructs:

Sessions map: RwLock<HashMap<SessionId, Arc<SessionState>>> — read-heavy, write only on session open/close.
Per-session buffer: Mutex<StreamBuffer> inside each SessionState — locked only during ingest chunk processing and flush.
API key cache: lock-free mini_moka::sync::Cache with 60-second TTL — no contention under read load.
DB connection pool: sqlx::PgPool with configurable size (default 20 connections).
Transcode queue: Postgres LISTEN/NOTIFY with 10-second polling fallback, semaphore-limited to 2 concurrent jobs.

7.2 Lock Hierarchy¶

No circular lock acquisition is possible given this ordering:

Arc references (no locks)
  -> Sessions RwLock          (held briefly for map lookup)
    -> Per-Session Mutex      (held during chunk processing)
      -> I/O / DB             (no application locks held during I/O)

The RwLock on the sessions map is released before the per-session Mutex is acquired. DB operations occur after both locks are released. Deadlock potential is zero by construction.

7.3 Concurrency Capacity¶

Concurrent Streams	Assessment	Notes
<50	SAFE	No lock contention. DB pool at <25% capacity. Sub-50ms flush latency.
50–200	CAUTION	DB pool at 50–100% capacity. Flush latency increases under burst.
200–1,000	RISKY	Pool exhaustion likely. Backpressure kicks in; see hazard #1.
>1,000	UNSAFE	Registration failures probable. Memory spikes from buffered upload segments.

These thresholds assume the default 20-connection DB pool. Increasing the pool (see recommendations) shifts the CAUTION boundary upward proportionally.

7.4 Bottlenecks¶

1. DB connection pool (20 connections by default)

The pool is the primary chokepoint. Each ingest flush and each transcode job dispatch consumes a connection for the duration of the query. At 100+ concurrent streams, simultaneous flush windows exhaust all 20 connections; subsequent flush attempts block until a connection is returned. This produces latency spikes at the flush boundary, not uniform backpressure.

2. Simultaneous flush window contention

Sessions flushing at the same wall-clock time (e.g., all opened within the same second) concentrate DB connection demand into a narrow window. The flush interval is per-session, not globally staggered, so initial connection spikes are possible at startup.

3. Broadcast channel overflow

Live event subscribers use a broadcast channel with capacity 256. When a subscriber cannot drain the channel fast enough, messages are silently dropped. There is no back-channel notification to the subscriber that events were lost.

8. Hazards and Recommendations¶

8.1 Hazards¶

#	Hazard	Severity	Detail
1	Backpressure returns HTTP 500	High	DB pool exhaustion surfaces as a generic 500 response. Clients cannot distinguish overload from a server bug. Retry-after semantics are impossible.
2	FFmpeg process orphaning	Medium	If the transcode worker task panics after spawning FFmpeg, the child process continues running with no supervisor. Process accumulation degrades host performance silently.
3	No metrics endpoint	Medium	No Prometheus endpoint exists for concurrent session count, DB pool utilization, queue depth, or lock contention. Capacity limits in section 7.3 cannot be verified in production without adding instrumentation first.
4	Transcode job orphaning	Low	A worker crash leaves the job in a "claimed" state. The 1-hour timeout is the only recovery mechanism. At scale, multiple orphaned jobs accumulate and delay legitimate work.
5	S3 large file full buffering	Low	Upload files up to 256 MB are fully buffered in RAM before S3 multipart upload begins. At >1,000 concurrent streams this contributes to memory spikes noted in section 7.3.

8.2 Recommendations¶

Priority	Action	Addresses
High	Increase DB pool: 20 → 50+ connections for production deployments	Hazard #1, bottleneck #1
High	Return HTTP 429 with `Retry-After` header when pool is exhausted or ingest queue is full	Hazard #1
Medium	Add Prometheus metrics endpoint: session count, pool utilization, queue depth	Hazard #3
Medium	Spawn FFmpeg in a process group; kill the group on worker exit or panic	Hazard #2
Low	Add transcode worker heartbeat: periodic DB timestamp update; reaper thread reclaims jobs with stale heartbeats	Hazard #4
Low	Stream large S3 uploads using multipart from the first write rather than buffering the full object	Hazard #5

PERFORMANCE-EVALUATION.md — Detailed host benchmark tables for XAP and ADPCM at all sample rates, frame durations, and channel counts. Source data for sections 2–3 of this document.
PERFORMANCE-PROFILE.md — Profiling methodology, measurement setup, and raw timing distributions.
CODEC-ANALYSIS.md — Codec comparison (XAP vs Opus, AAC, ADPCM, SBC) including MIPS estimates, compression ratios, quality assessment, and licensing.