Platform Evaluation — MCU Feasibility & Server Concurrency¶
Xylolabs API — Platform evaluation: MCU codec feasibility and server concurrency analysis Revision: 2026-03-29
1. Overview¶
This document covers two distinct performance domains:
- MCU codec feasibility — whether target MCUs can run XAP or ADPCM encoding within a 10ms frame budget, and whether SRAM is sufficient.
- Server concurrency limits — how many simultaneous ingest streams the Axum server handles before contention becomes a problem, and where the failure modes are.
Benchmark data in sections 2–4 was measured on an Apple M-series host in --release mode. MCU figures in sections 5–6 are extrapolated from host measurements using published MIPS estimates and DSP acceleration factors. Server analysis in section 7 is based on static code review of lock hierarchy and resource allocation.
2. XAP Codec Performance¶
2.1 Encode Time Per Frame — Mono, 10ms¶
Measured on host. MCU encode time scales inversely with clock rate relative to host.
| Rate (Hz) | Samples | Host avg (us) | Budget% (host) |
|---|---|---|---|
| 8,000 | 80 | 0.6 | 0.006% |
| 16,000 | 160 | 2.3 | 0.023% |
| 24,000 | 240 | 5.0 | 0.050% |
| 32,000 | 320 | 9.4 | 0.094% |
| 48,000 | 480 | 163.0 | 1.630% |
| 96,000 | 960 | 630.9 | 6.309% |
2.2 Cosine Table Threshold¶
The precomputed cosine table covers N <= 320 samples. This means:
- 8–32 kHz at 10ms (80–320 samples): table lookup path, zero trig calls, fast.
- 48 kHz and above at 10ms (480+ samples): falls back to runtime
cosf()per MDCT coefficient.
The discontinuity at 48 kHz is 17x slower than the 32 kHz rate, not a proportional increase. The jump from 9.4 us (32 kHz) to 163.0 us (48 kHz) reflects this boundary exactly. On MCU targets with CMSIS-DSP or hardware FFT, this path is replaced entirely and the discontinuity disappears.
2.3 Channel Scaling — 16 kHz, 10ms¶
Channels are processed independently with shared header/setup overhead, producing sub-linear scaling.
| Channels | Host avg (us) | Per-channel (us) | Scaling vs 1ch |
|---|---|---|---|
| 1 | 2.4 | 2.4 | 1.00x |
| 2 | 4.7 | 2.35 | 1.96x |
| 3 | 7.1 | 2.37 | 2.96x |
| 4 | 9.2 | 2.30 | 3.83x |
4-channel scaling is 3.83x rather than 4.0x. The sub-linearity comes from amortized de-interleave and header write cost shared across all channels.
2.4 Four-Channel Stress — Key Sample Rates, 10ms¶
| Rate (Hz) | Channels | Host avg (us) | Host Budget% |
|---|---|---|---|
| 16,000 | 4 | 9.1 | 0.091% |
| 48,000 | 4 | 684.2 | 6.842% |
| 96,000 | 4 | 2,520.6 | 25.206% |
At 4ch@96 kHz on the host, 25.2% of the 10ms window is spent encoding. On MCU targets, DSP acceleration brings this to a manageable range for M4F and PIE-capable parts (see section 5).
3. ADPCM Codec Performance¶
3.1 ADPCM vs XAP at 4ch@96 kHz, 10ms¶
| Codec | Host avg (us) | Host Budget% | vs XAP |
|---|---|---|---|
| ADPCM | 22.72 | 0.227% | — |
| XAP | 2,520.6 | 25.206% | 110x slower |
ADPCM is 110x cheaper than XAP at the same configuration. For MCUs without DSP extensions, ADPCM is the only viable codec at 96 kHz. The tradeoff is audio quality: ADPCM is 4:1 compression with audible artifacts under some conditions; XAP achieves 10:1 with perceptually transparent output.
4. DSP/FPU Acceleration Analysis¶
| Feature | Parts | Mechanism | Speedup vs soft-float |
|---|---|---|---|
| CMSIS-DSP arm_rfft_fast_f32 | Cortex-M4F, M33 | Dual MAC (SMLAD), paired multiplies | 30–40% |
| ESP32-S3 PIE SIMD | ESP32-S3 | 128-bit vector ops, 4x int32 parallel | ~60% |
| Cosine table (N<=320) | All targets | Precomputed fixed-point, no trig calls | Eliminates cosf() path |
| Runtime cosf() (N>320) | All targets (fallback) | Per-coefficient float trig | Baseline (17x slower than table path) |
| No FPU (M3, RP2040, C3) | RP2040, ESP32-C3, STM32F103 | Soft-float fallback | XAP not feasible |
CMSIS-DSP replaces the software MDCT with arm_rfft_fast_f32, eliminating both the cosine table discontinuity and the cosf() fallback. This is the primary reason MCU budget percentages diverge sharply from host measurements.
Parts without hardware FPU (RP2040 Cortex-M0+, ESP32-C3 RV32IMC, STM32F103 Cortex-M3) cannot run XAP at any practical sample rate within budget. These targets are limited to ADPCM.
5. MCU Feasibility Matrix¶
CPU% figures use the MCU-scaled encode time with DSP acceleration applied. "Max Config" is the highest XAP or ADPCM configuration that fits within approximately 80% CPU budget.
| Target | Clock | SRAM | DSP | Max Config | CPU% | RAM% | Verdict |
|---|---|---|---|---|---|---|---|
| RP2350 | 150 MHz | 520 KB | M33 DSP+FPU | 4ch@96kHz XAP | 52.0% | 16.9% | COMFORTABLE |
| ESP32-S3 | 240 MHz | 512 KB | PIE SIMD+FPU | 4ch@96kHz XAP | 37.1% | 17.2% | COMFORTABLE |
| STM32F411 | 100 MHz | 128 KB | M4F DSP+FPU | 4ch@96kHz XAP | 71.0% | 68.8% | TIGHT |
| nRF52840 | 64 MHz | 256 KB | M4F DSP+FPU | 4ch@48kHz XAP | 67.2% | 17.2% | FEASIBLE |
| nRF9160 | 64 MHz | 256 KB | M33 DSP+FPU | 4ch@48kHz XAP | 73.4% | 17.2% | TIGHT |
| STM32WB55 | 64 MHz | 256 KB | M4F DSP+FPU | 4ch@48kHz XAP | 60.9% | 19.5% | FEASIBLE |
| STM32WBA55 | 100 MHz | 128 KB | M33 DSP+FPU | 4ch@96kHz XAP | 74.0% | 39.1% | TIGHT |
| RP2040 | 133 MHz | 264 KB | None | ADPCM 4ch@96kHz | 3.0% | 10.6% | ADPCM ONLY |
| ESP32-C3 | 160 MHz | 400 KB | M ext only | ADPCM 4ch@96kHz | 2.5% | 7.0% | ADPCM ONLY |
| STM32F103 | 72 MHz | 20 KB | None | ADPCM 2ch@24kHz | 1.4% | 60.0% | SENSOR ONLY |
Verdict definitions: - COMFORTABLE: Headroom >= 48%; suitable for production deployment with additional sensor/housekeeping load. - FEASIBLE: 33–67% CPU; workable but leaves limited headroom. Profile under full load before shipping. - TIGHT: 67–80% CPU or >35% RAM; technically meets spec but leaves minimal margin. Requires careful interrupt budgeting and no background tasks competing for cycles. - ADPCM ONLY: No hardware FPU; XAP is not feasible. ADPCM runs comfortably. - SENSOR ONLY: Inadequate SRAM for full audio pipeline. Suitable for sensor-only deployments (accelerometer, temperature) with ADPCM at reduced channel count and sample rate.
6. Memory Budget per Target¶
All figures in KB. "SDK+Codec" includes SDK static allocations and codec working buffers. "Ring Buf" is the audio ring buffer sized for 2× the maximum frame. "XMBP" is the metadata framing buffer. "HTTP" is the XMBP/HTTP client stack. "Stack" is the main task stack plus ISR stack.
| Target | SRAM | SDK+Codec | Ring Buf | XMBP | HTTP | Stack | Used | Avail | RAM% |
|---|---|---|---|---|---|---|---|---|---|
| RP2350 | 520 | 20 | 32 | 16 | 4 | 16 | 88 | 432 | 16.9% |
| ESP32-S3 | 512 | 20 | 64 | 16 | 8 | 16 | 124 | 388 | 24.2% |
| STM32F411 | 128 | 20 | 8 | 4 | 4 | 8 | 44 | 84 | 34.4% |
| nRF52840 | 256 | 20 | 16 | 8 | 4 | 8 | 56 | 200 | 21.9% |
| nRF9160 | 256 | 20 | 16 | 8 | 4 | 8 | 56 | 200 | 21.9% |
| STM32WB55 | 256 | 20 | 16 | 2 | 0 | 12 | 50 | 206 | 19.5% |
| STM32WBA55 | 128 | 20 | 16 | 2 | 0 | 12 | 50 | 78 | 39.1% |
| RP2040 | 264 | 8 | 8 | 4 | 4 | 8 | 32 | 232 | 12.1% |
| ESP32-C3 | 400 | 8 | 8 | 4 | 4 | 8 | 32 | 368 | 8.0% |
| STM32F103 | 20 | 4 | 4 | 2 | 2 | 4 | 16 | 4 | 80.0% |
Notes: - STM32F411 RAM% of 34.4% is deceptively low — the 128 KB total leaves only 84 KB available, which provides no margin for an RTOS heap or additional sensor buffers. - STM32F103's 4 KB available is effectively zero headroom. RTOS tick lists and heap fragments will consume this immediately. - ESP32-S3 ring buffer is doubled (64 KB) to accommodate the higher throughput at 4ch@96 kHz with PSRAM not assumed. - STM32WB55/WBA55 HTTP column is 0 KB because these targets use BLE transport only; no HTTP client stack is allocated.
7. Server Concurrency Analysis¶
7.1 Architecture¶
The server is an Axum application running on Tokio's multi-threaded scheduler. Key concurrency constructs:
- Sessions map:
RwLock<HashMap<SessionId, Arc<SessionState>>>— read-heavy, write only on session open/close. - Per-session buffer:
Mutex<StreamBuffer>inside eachSessionState— locked only during ingest chunk processing and flush. - API key cache: lock-free
mini_moka::sync::Cachewith 60-second TTL — no contention under read load. - DB connection pool:
sqlx::PgPoolwith configurable size (default 20 connections). - Transcode queue: Postgres
LISTEN/NOTIFYwith 10-second polling fallback, semaphore-limited to 2 concurrent jobs.
7.2 Lock Hierarchy¶
No circular lock acquisition is possible given this ordering:
Arc references (no locks)
-> Sessions RwLock (held briefly for map lookup)
-> Per-Session Mutex (held during chunk processing)
-> I/O / DB (no application locks held during I/O)
The RwLock on the sessions map is released before the per-session Mutex is acquired. DB operations occur after both locks are released. Deadlock potential is zero by construction.
7.3 Concurrency Capacity¶
| Concurrent Streams | Assessment | Notes |
|---|---|---|
| <50 | SAFE | No lock contention. DB pool at <25% capacity. Sub-50ms flush latency. |
| 50–200 | CAUTION | DB pool at 50–100% capacity. Flush latency increases under burst. |
| 200–1,000 | RISKY | Pool exhaustion likely. Backpressure kicks in; see hazard #1. |
| >1,000 | UNSAFE | Registration failures probable. Memory spikes from buffered upload segments. |
These thresholds assume the default 20-connection DB pool. Increasing the pool (see recommendations) shifts the CAUTION boundary upward proportionally.
7.4 Bottlenecks¶
1. DB connection pool (20 connections by default)
The pool is the primary chokepoint. Each ingest flush and each transcode job dispatch consumes a connection for the duration of the query. At 100+ concurrent streams, simultaneous flush windows exhaust all 20 connections; subsequent flush attempts block until a connection is returned. This produces latency spikes at the flush boundary, not uniform backpressure.
2. Simultaneous flush window contention
Sessions flushing at the same wall-clock time (e.g., all opened within the same second) concentrate DB connection demand into a narrow window. The flush interval is per-session, not globally staggered, so initial connection spikes are possible at startup.
3. Broadcast channel overflow
Live event subscribers use a broadcast channel with capacity 256. When a subscriber cannot drain the channel fast enough, messages are silently dropped. There is no back-channel notification to the subscriber that events were lost.
8. Hazards and Recommendations¶
8.1 Hazards¶
| # | Hazard | Severity | Detail |
|---|---|---|---|
| 1 | Backpressure returns HTTP 500 | High | DB pool exhaustion surfaces as a generic 500 response. Clients cannot distinguish overload from a server bug. Retry-after semantics are impossible. |
| 2 | FFmpeg process orphaning | Medium | If the transcode worker task panics after spawning FFmpeg, the child process continues running with no supervisor. Process accumulation degrades host performance silently. |
| 3 | No metrics endpoint | Medium | No Prometheus endpoint exists for concurrent session count, DB pool utilization, queue depth, or lock contention. Capacity limits in section 7.3 cannot be verified in production without adding instrumentation first. |
| 4 | Transcode job orphaning | Low | A worker crash leaves the job in a "claimed" state. The 1-hour timeout is the only recovery mechanism. At scale, multiple orphaned jobs accumulate and delay legitimate work. |
| 5 | S3 large file full buffering | Low | Upload files up to 256 MB are fully buffered in RAM before S3 multipart upload begins. At >1,000 concurrent streams this contributes to memory spikes noted in section 7.3. |
8.2 Recommendations¶
| Priority | Action | Addresses |
|---|---|---|
| High | Increase DB pool: 20 → 50+ connections for production deployments | Hazard #1, bottleneck #1 |
| High | Return HTTP 429 with Retry-After header when pool is exhausted or ingest queue is full |
Hazard #1 |
| Medium | Add Prometheus metrics endpoint: session count, pool utilization, queue depth | Hazard #3 |
| Medium | Spawn FFmpeg in a process group; kill the group on worker exit or panic | Hazard #2 |
| Low | Add transcode worker heartbeat: periodic DB timestamp update; reaper thread reclaims jobs with stale heartbeats | Hazard #4 |
| Low | Stream large S3 uploads using multipart from the first write rather than buffering the full object | Hazard #5 |
9. Related Documents¶
- PERFORMANCE-EVALUATION.md — Detailed host benchmark tables for XAP and ADPCM at all sample rates, frame durations, and channel counts. Source data for sections 2–3 of this document.
- PERFORMANCE-PROFILE.md — Profiling methodology, measurement setup, and raw timing distributions.
- CODEC-ANALYSIS.md — Codec comparison (XAP vs Opus, AAC, ADPCM, SBC) including MIPS estimates, compression ratios, quality assessment, and licensing.