Skip to content

Platform Evaluation — MCU Feasibility & Server Concurrency

Xylolabs API — Platform evaluation: MCU codec feasibility and server concurrency analysis Revision: 2026-03-29


1. Overview

This document covers two distinct performance domains:

  1. MCU codec feasibility — whether target MCUs can run XAP or ADPCM encoding within a 10ms frame budget, and whether SRAM is sufficient.
  2. Server concurrency limits — how many simultaneous ingest streams the Axum server handles before contention becomes a problem, and where the failure modes are.

Benchmark data in sections 2–4 was measured on an Apple M-series host in --release mode. MCU figures in sections 5–6 are extrapolated from host measurements using published MIPS estimates and DSP acceleration factors. Server analysis in section 7 is based on static code review of lock hierarchy and resource allocation.


2. XAP Codec Performance

2.1 Encode Time Per Frame — Mono, 10ms

Measured on host. MCU encode time scales inversely with clock rate relative to host.

Rate (Hz) Samples Host avg (us) Budget% (host)
8,000 80 0.6 0.006%
16,000 160 2.3 0.023%
24,000 240 5.0 0.050%
32,000 320 9.4 0.094%
48,000 480 163.0 1.630%
96,000 960 630.9 6.309%

2.2 Cosine Table Threshold

The precomputed cosine table covers N <= 320 samples. This means:

  • 8–32 kHz at 10ms (80–320 samples): table lookup path, zero trig calls, fast.
  • 48 kHz and above at 10ms (480+ samples): falls back to runtime cosf() per MDCT coefficient.

The discontinuity at 48 kHz is 17x slower than the 32 kHz rate, not a proportional increase. The jump from 9.4 us (32 kHz) to 163.0 us (48 kHz) reflects this boundary exactly. On MCU targets with CMSIS-DSP or hardware FFT, this path is replaced entirely and the discontinuity disappears.

2.3 Channel Scaling — 16 kHz, 10ms

Channels are processed independently with shared header/setup overhead, producing sub-linear scaling.

Channels Host avg (us) Per-channel (us) Scaling vs 1ch
1 2.4 2.4 1.00x
2 4.7 2.35 1.96x
3 7.1 2.37 2.96x
4 9.2 2.30 3.83x

4-channel scaling is 3.83x rather than 4.0x. The sub-linearity comes from amortized de-interleave and header write cost shared across all channels.

2.4 Four-Channel Stress — Key Sample Rates, 10ms

Rate (Hz) Channels Host avg (us) Host Budget%
16,000 4 9.1 0.091%
48,000 4 684.2 6.842%
96,000 4 2,520.6 25.206%

At 4ch@96 kHz on the host, 25.2% of the 10ms window is spent encoding. On MCU targets, DSP acceleration brings this to a manageable range for M4F and PIE-capable parts (see section 5).


3. ADPCM Codec Performance

3.1 ADPCM vs XAP at 4ch@96 kHz, 10ms

Codec Host avg (us) Host Budget% vs XAP
ADPCM 22.72 0.227%
XAP 2,520.6 25.206% 110x slower

ADPCM is 110x cheaper than XAP at the same configuration. For MCUs without DSP extensions, ADPCM is the only viable codec at 96 kHz. The tradeoff is audio quality: ADPCM is 4:1 compression with audible artifacts under some conditions; XAP achieves 10:1 with perceptually transparent output.


4. DSP/FPU Acceleration Analysis

Feature Parts Mechanism Speedup vs soft-float
CMSIS-DSP arm_rfft_fast_f32 Cortex-M4F, M33 Dual MAC (SMLAD), paired multiplies 30–40%
ESP32-S3 PIE SIMD ESP32-S3 128-bit vector ops, 4x int32 parallel ~60%
Cosine table (N<=320) All targets Precomputed fixed-point, no trig calls Eliminates cosf() path
Runtime cosf() (N>320) All targets (fallback) Per-coefficient float trig Baseline (17x slower than table path)
No FPU (M3, RP2040, C3) RP2040, ESP32-C3, STM32F103 Soft-float fallback XAP not feasible

CMSIS-DSP replaces the software MDCT with arm_rfft_fast_f32, eliminating both the cosine table discontinuity and the cosf() fallback. This is the primary reason MCU budget percentages diverge sharply from host measurements.

Parts without hardware FPU (RP2040 Cortex-M0+, ESP32-C3 RV32IMC, STM32F103 Cortex-M3) cannot run XAP at any practical sample rate within budget. These targets are limited to ADPCM.


5. MCU Feasibility Matrix

CPU% figures use the MCU-scaled encode time with DSP acceleration applied. "Max Config" is the highest XAP or ADPCM configuration that fits within approximately 80% CPU budget.

Target Clock SRAM DSP Max Config CPU% RAM% Verdict
RP2350 150 MHz 520 KB M33 DSP+FPU 4ch@96kHz XAP 52.0% 16.9% COMFORTABLE
ESP32-S3 240 MHz 512 KB PIE SIMD+FPU 4ch@96kHz XAP 37.1% 17.2% COMFORTABLE
STM32F411 100 MHz 128 KB M4F DSP+FPU 4ch@96kHz XAP 71.0% 68.8% TIGHT
nRF52840 64 MHz 256 KB M4F DSP+FPU 4ch@48kHz XAP 67.2% 17.2% FEASIBLE
nRF9160 64 MHz 256 KB M33 DSP+FPU 4ch@48kHz XAP 73.4% 17.2% TIGHT
STM32WB55 64 MHz 256 KB M4F DSP+FPU 4ch@48kHz XAP 60.9% 19.5% FEASIBLE
STM32WBA55 100 MHz 128 KB M33 DSP+FPU 4ch@96kHz XAP 74.0% 39.1% TIGHT
RP2040 133 MHz 264 KB None ADPCM 4ch@96kHz 3.0% 10.6% ADPCM ONLY
ESP32-C3 160 MHz 400 KB M ext only ADPCM 4ch@96kHz 2.5% 7.0% ADPCM ONLY
STM32F103 72 MHz 20 KB None ADPCM 2ch@24kHz 1.4% 60.0% SENSOR ONLY

Verdict definitions: - COMFORTABLE: Headroom >= 48%; suitable for production deployment with additional sensor/housekeeping load. - FEASIBLE: 33–67% CPU; workable but leaves limited headroom. Profile under full load before shipping. - TIGHT: 67–80% CPU or >35% RAM; technically meets spec but leaves minimal margin. Requires careful interrupt budgeting and no background tasks competing for cycles. - ADPCM ONLY: No hardware FPU; XAP is not feasible. ADPCM runs comfortably. - SENSOR ONLY: Inadequate SRAM for full audio pipeline. Suitable for sensor-only deployments (accelerometer, temperature) with ADPCM at reduced channel count and sample rate.


6. Memory Budget per Target

All figures in KB. "SDK+Codec" includes SDK static allocations and codec working buffers. "Ring Buf" is the audio ring buffer sized for 2× the maximum frame. "XMBP" is the metadata framing buffer. "HTTP" is the XMBP/HTTP client stack. "Stack" is the main task stack plus ISR stack.

Target SRAM SDK+Codec Ring Buf XMBP HTTP Stack Used Avail RAM%
RP2350 520 20 32 16 4 16 88 432 16.9%
ESP32-S3 512 20 64 16 8 16 124 388 24.2%
STM32F411 128 20 8 4 4 8 44 84 34.4%
nRF52840 256 20 16 8 4 8 56 200 21.9%
nRF9160 256 20 16 8 4 8 56 200 21.9%
STM32WB55 256 20 16 2 0 12 50 206 19.5%
STM32WBA55 128 20 16 2 0 12 50 78 39.1%
RP2040 264 8 8 4 4 8 32 232 12.1%
ESP32-C3 400 8 8 4 4 8 32 368 8.0%
STM32F103 20 4 4 2 2 4 16 4 80.0%

Notes: - STM32F411 RAM% of 34.4% is deceptively low — the 128 KB total leaves only 84 KB available, which provides no margin for an RTOS heap or additional sensor buffers. - STM32F103's 4 KB available is effectively zero headroom. RTOS tick lists and heap fragments will consume this immediately. - ESP32-S3 ring buffer is doubled (64 KB) to accommodate the higher throughput at 4ch@96 kHz with PSRAM not assumed. - STM32WB55/WBA55 HTTP column is 0 KB because these targets use BLE transport only; no HTTP client stack is allocated.


7. Server Concurrency Analysis

7.1 Architecture

The server is an Axum application running on Tokio's multi-threaded scheduler. Key concurrency constructs:

  • Sessions map: RwLock<HashMap<SessionId, Arc<SessionState>>> — read-heavy, write only on session open/close.
  • Per-session buffer: Mutex<StreamBuffer> inside each SessionState — locked only during ingest chunk processing and flush.
  • API key cache: lock-free mini_moka::sync::Cache with 60-second TTL — no contention under read load.
  • DB connection pool: sqlx::PgPool with configurable size (default 20 connections).
  • Transcode queue: Postgres LISTEN/NOTIFY with 10-second polling fallback, semaphore-limited to 2 concurrent jobs.

7.2 Lock Hierarchy

No circular lock acquisition is possible given this ordering:

Arc references (no locks)
  -> Sessions RwLock          (held briefly for map lookup)
    -> Per-Session Mutex      (held during chunk processing)
      -> I/O / DB             (no application locks held during I/O)

The RwLock on the sessions map is released before the per-session Mutex is acquired. DB operations occur after both locks are released. Deadlock potential is zero by construction.

7.3 Concurrency Capacity

Concurrent Streams Assessment Notes
<50 SAFE No lock contention. DB pool at <25% capacity. Sub-50ms flush latency.
50–200 CAUTION DB pool at 50–100% capacity. Flush latency increases under burst.
200–1,000 RISKY Pool exhaustion likely. Backpressure kicks in; see hazard #1.
>1,000 UNSAFE Registration failures probable. Memory spikes from buffered upload segments.

These thresholds assume the default 20-connection DB pool. Increasing the pool (see recommendations) shifts the CAUTION boundary upward proportionally.

7.4 Bottlenecks

1. DB connection pool (20 connections by default)

The pool is the primary chokepoint. Each ingest flush and each transcode job dispatch consumes a connection for the duration of the query. At 100+ concurrent streams, simultaneous flush windows exhaust all 20 connections; subsequent flush attempts block until a connection is returned. This produces latency spikes at the flush boundary, not uniform backpressure.

2. Simultaneous flush window contention

Sessions flushing at the same wall-clock time (e.g., all opened within the same second) concentrate DB connection demand into a narrow window. The flush interval is per-session, not globally staggered, so initial connection spikes are possible at startup.

3. Broadcast channel overflow

Live event subscribers use a broadcast channel with capacity 256. When a subscriber cannot drain the channel fast enough, messages are silently dropped. There is no back-channel notification to the subscriber that events were lost.


8. Hazards and Recommendations

8.1 Hazards

# Hazard Severity Detail
1 Backpressure returns HTTP 500 High DB pool exhaustion surfaces as a generic 500 response. Clients cannot distinguish overload from a server bug. Retry-after semantics are impossible.
2 FFmpeg process orphaning Medium If the transcode worker task panics after spawning FFmpeg, the child process continues running with no supervisor. Process accumulation degrades host performance silently.
3 No metrics endpoint Medium No Prometheus endpoint exists for concurrent session count, DB pool utilization, queue depth, or lock contention. Capacity limits in section 7.3 cannot be verified in production without adding instrumentation first.
4 Transcode job orphaning Low A worker crash leaves the job in a "claimed" state. The 1-hour timeout is the only recovery mechanism. At scale, multiple orphaned jobs accumulate and delay legitimate work.
5 S3 large file full buffering Low Upload files up to 256 MB are fully buffered in RAM before S3 multipart upload begins. At >1,000 concurrent streams this contributes to memory spikes noted in section 7.3.

8.2 Recommendations

Priority Action Addresses
High Increase DB pool: 20 → 50+ connections for production deployments Hazard #1, bottleneck #1
High Return HTTP 429 with Retry-After header when pool is exhausted or ingest queue is full Hazard #1
Medium Add Prometheus metrics endpoint: session count, pool utilization, queue depth Hazard #3
Medium Spawn FFmpeg in a process group; kill the group on worker exit or panic Hazard #2
Low Add transcode worker heartbeat: periodic DB timestamp update; reaper thread reclaims jobs with stale heartbeats Hazard #4
Low Stream large S3 uploads using multipart from the first write rather than buffering the full object Hazard #5

  • PERFORMANCE-EVALUATION.md — Detailed host benchmark tables for XAP and ADPCM at all sample rates, frame durations, and channel counts. Source data for sections 2–3 of this document.
  • PERFORMANCE-PROFILE.md — Profiling methodology, measurement setup, and raw timing distributions.
  • CODEC-ANALYSIS.md — Codec comparison (XAP vs Opus, AAC, ADPCM, SBC) including MIPS estimates, compression ratios, quality assessment, and licensing.