Performance Evaluation — MCU Targets & Server Concurrency¶
Xylolabs API — Performance evaluation: MCU targets and server concurrency Revision: 2026-03-23
1. XAP Codec Benchmark Results¶
All measurements taken on Apple M-series host in --release mode. The XAP encoder is implemented in crates/xylolabs-sdk/src/codec/xap.rs. ADPCM encoder is the IMA-ADPCM implementation in crates/xylolabs-sdk/src/codec/adpcm.rs.
1.1 Encode Time Per Frame — Mono, 10ms¶
| Rate | Ch | Samples | Frame | Avg (us) | Min (us) | Max (us) | Budget% |
|---|---|---|---|---|---|---|---|
| 8000 | 1 | 80 | 10ms | 0.5 | 0.4 | 12.7 | 0.005% |
| 16000 | 1 | 160 | 10ms | 1.1 | 1.0 | 1.2 | 0.011% |
| 24000 | 1 | 240 | 10ms | 1.9 | 1.6 | 6.8 | 0.019% |
| 32000 | 1 | 320 | 10ms | 3.0 | 2.6 | 9.0 | 0.030% |
| 48000 | 1 | 480 | 10ms | 512.3 | 451.1 | 700.0 | 5.123% |
| 96000 | 1 | 960 | 10ms | 1954.0 | 1763.8 | 2207.8 | 19.540% |
Critical Observation — Cosine Table Threshold: There is a 170x discontinuity between 32kHz (3.0 us) and 48kHz (512.3 us). This is caused by the precomputed cosine table cutoff at XAP_PRECOMPUTE_MAX_N = 320 samples. Frames with N <= 320 (sample rates up to 32kHz at 10ms) use O(N^2) table lookup with zero trigonometric function calls. Frames with N > 320 (48kHz = 480, 96kHz = 960) fall back to runtime cosf() per MDCT coefficient, which is dramatically slower.
On real MCU targets with DSP extensions, the CMSIS-DSP arm_rfft_fast_f32 replaces this entire MDCT path, eliminating the discontinuity. The host benchmark reflects the software-only encoder behavior.
1.2 Channel Scaling — 16kHz, 10ms¶
| Rate | Ch | Samples/ch | Avg (us) | Per-ch (us) | Scaling |
|---|---|---|---|---|---|
| 16000 | 1 | 160 | 1.0 | 1.0 | 1.00x |
| 16000 | 2 | 160 | 1.9 | 1.0 | 1.90x |
| 16000 | 3 | 160 | 2.8 | 0.9 | 2.80x |
| 16000 | 4 | 160 | 3.7 | 0.9 | 3.70x |
Channel scaling is sub-linear (3.7x for 4 channels instead of 4.0x). The encoder processes channels independently with shared infrastructure (header write, de-interleave). The per-channel MDCT cost dominates, and the slight sub-linearity comes from amortized header/setup overhead.
1.3 Four-Channel Stress — Key Sample Rates¶
| Rate | Ch | Samples/ch | Avg (us) | Min (us) | Max (us) | Budget% |
|---|---|---|---|---|---|---|
| 16000 | 4 | 160 | 3.6 | 3.2 | 4.6 | 0.036% |
| 48000 | 4 | 480 | 2004.7 | 1829.7 | 2378.2 | 20.047% |
| 96000 | 4 | 960 | 7930.9 | 7390.5 | 10632.4 | 79.309% |
The 4ch@96kHz configuration consumes 79.3% of the 10ms frame budget on the host. This leaves only 2.1ms headroom for XMBP encoding, I/O, sensors, and housekeeping on the host. On MCU targets with DSP acceleration, the MDCT is replaced by hardware-accelerated FFT paths that reduce this to 37-56 MIPS (see Section 2).
1.4 Frame Duration Comparison — 48kHz, 2ch¶
| Rate | Ch | Duration | Samples/ch | Avg (us) | Budget% |
|---|---|---|---|---|---|
| 48000 | 2 | 7.5ms | 360 | 581.7 | 7.756% |
| 48000 | 2 | 10ms | 480 | 1007.6 | 10.076% |
The 7.5ms frame processes fewer samples (360 vs 480) and is 42% faster in absolute time, but consumes a higher fraction of its shorter budget (7.8% vs 10.1%). The 10ms frame is preferred for MCU targets because the longer budget window provides more headroom for scheduling jitter and interrupt latency.
1.5 ADPCM Encode Time Per Frame — 10ms¶
| Rate | Ch | Samples | Avg (us) | Min (us) | Max (us) | Budget% |
|---|---|---|---|---|---|---|
| 16000 | 1 | 160 | 0.93 | 0.83 | 7.21 | 0.009% |
| 16000 | 2 | 320 | 1.83 | 1.71 | 2.58 | 0.018% |
| 16000 | 4 | 640 | 3.70 | 3.50 | 10.21 | 0.037% |
| 48000 | 1 | 480 | 2.87 | 2.62 | 9.58 | 0.029% |
| 48000 | 2 | 960 | 5.68 | 5.38 | 12.33 | 0.057% |
| 48000 | 4 | 1920 | 11.56 | 10.79 | 19.67 | 0.116% |
| 96000 | 1 | 960 | 5.79 | 5.38 | 13.50 | 0.058% |
| 96000 | 2 | 1920 | 11.51 | 10.83 | 20.42 | 0.115% |
| 96000 | 4 | 3840 | 23.00 | 21.75 | 48.79 | 0.230% |
ADPCM is trivially cheap at all configurations. Even 4ch@96kHz costs only 23 us (0.23% of budget). This is because ADPCM uses pure integer arithmetic with no spectral transform -- it encodes sample-by-sample deltas using a lookup table.
1.6 XAP vs ADPCM Cost Ratio¶
| Config | XAP (us) | ADPCM (us) | Ratio |
|---|---|---|---|
| 1ch@16kHz | 1.0 | 0.93 | 1x |
| 4ch@16kHz | 3.6 | 3.66 | 1x |
| 4ch@48kHz | 2054.1 | 11.72 | 175x |
| 4ch@96kHz | 7973.9 | 22.91 | 348x |
At low sample rates (<=32kHz), XAP and ADPCM have comparable cost because the precomputed cosine table eliminates runtime trig. At 48kHz and above, XAP becomes 175-348x more expensive due to the O(N^2) MDCT with runtime cosf(). On MCU targets, DSP-accelerated FFT paths close this gap significantly (to approximately 20-40x), but ADPCM remains the lightest option for CPU-constrained targets.
2. MCU Feasibility Matrix¶
2.1 Maximum Sustainable Configuration Per Target¶
Evaluated using documented MIPS profiles from PERFORMANCE-PROFILE.md and CODEC-ANALYSIS.md, cross-referenced with benchmark measurements. CPU% includes codec encoding, I/O stack (I2S DMA, transport protocol, sensor sampling), and system overhead. RAM% includes SDK client state, codec buffers, ring buffers, XMBP framing, transport stack, and task stacks.
| Target | Clock | SRAM | DSP/FPU | Max Audio Config | CPU% | RAM% | Verdict |
|---|---|---|---|---|---|---|---|
| RP2350 (Pico 2) | 150 MHz | 520 KB | M33 DSP+FPU | 4ch @96kHz XAP | 46.0% | 16.9% | COMFORTABLE |
| ESP32-S3 | 240 MHz | 512 KB + 8MB PSRAM | PIE SIMD+FPU | 4ch @96kHz XAP | 17.7% | 24.2% | COMFORTABLE |
| STM32F411 | 100 MHz | 128 KB | M4F DSP+FPU | 4ch @48kHz XAP | 40.0% | 34.4% | FEASIBLE |
| nRF52840 | 64 MHz | 256 KB | M4F DSP+FPU | 2ch @48kHz XAP | 42.2% | 21.9% | FEASIBLE |
| nRF9160 | 64 MHz | 256 KB | M33 DSP+FPU | 2ch @48kHz XAP | 44.5% | 21.9% | FEASIBLE |
| STM32WB55 | 64 MHz | 256 KB | M4F DSP+FPU | 2ch @48kHz XAP | 42.2% | 17.2% | FEASIBLE |
| RP2040 (Pico) | 133 MHz | 264 KB | None | ADPCM 4ch @96kHz | 3.0% | 12.1% | ADPCM ONLY |
| ESP32-C3 | 160 MHz | 400 KB | M ext only | ADPCM 4ch @96kHz | 2.5% | 8.0% | ADPCM ONLY |
| STM32F103 | 72 MHz | 20 KB | None | ADPCM 2ch @24kHz | 1.4% | 80.0% | SENSOR ONLY |
2.2 Verdict Definitions¶
| Verdict | CPU Utilization | Meaning |
|---|---|---|
| COMFORTABLE | < 50% | Ample headroom for additional processing, OTA updates, or future features. |
| FEASIBLE | 50-70% | Sufficient headroom for stable operation with careful scheduling. |
| TIGHT | 70-85% | Operational but may exhibit jitter under worst-case interrupt latency. |
| MARGINAL | 85-100% | Risk of frame drops under load. Not recommended for production. |
| ADPCM ONLY | N/A (XAP infeasible) | No DSP/FPU; cannot run XAP encoder. ADPCM at 4:1 compression only. |
| SENSOR ONLY | N/A (audio limited) | Extreme SRAM constraint. ADPCM 1-2ch + sensor telemetry only. |
2.3 Detailed CPU Budget Per Target¶
RP2350 (Pico 2) — 4ch XAP @96kHz¶
Dual-core Cortex-M33 at 150 MHz. Core 0 handles I2S DMA + XAP encoding, Core 1 handles XMBP + HTTP + sensors.
| Component | Baseline MIPS | With DSP MIPS | % of 150 MHz |
|---|---|---|---|
| I2S DMA handling | 2 | 2 | 1.3% |
| XAP MDCT forward | 50 | 35 | 23.3% |
| XAP quantize+pack | 15 | 10 | 6.7% |
| XMBP batch encode | 5 | 5 | 3.3% |
| HTTP transport | 10 | 10 | 6.7% |
| Sensor sampling (26ch) | 5 | 5 | 3.3% |
| Watchdog + housekeeping | 2 | 2 | 1.3% |
| Total | 89 | 69 | 46.0% |
| Available headroom | 61 | 81 | 54.0% |
With dual-core split: system utilization drops to ~39.3%.
ESP32-S3 — 4ch XAP @96kHz¶
Dual-core Xtensa LX7 at 240 MHz (480 MIPS total).
| Component | Baseline MIPS | With PIE MIPS | % of 480 MHz |
|---|---|---|---|
| I2S DMA handling | 2 | 2 | 0.4% |
| XAP MDCT forward | 50 | 20 | 4.2% |
| XAP quantize+pack | 15 | 6 | 1.3% |
| WiFi stack (FreeRTOS) | 30 | 30 | 6.3% |
| XMBP batch encode | 5 | 5 | 1.0% |
| HTTP/TLS transport | 20 | 12 | 2.5% |
| Sensor sampling (26ch) | 5 | 5 | 1.0% |
| PSRAM DMA management | 3 | 3 | 0.6% |
| Watchdog + housekeeping | 2 | 2 | 0.4% |
| Total | 132 | 85 | 17.7% |
| Available headroom | 348 | 395 | 82.3% |
The ESP32-S3 has the most comfortable margin by far, primarily due to 128-bit PIE SIMD on the MDCT inner loop and hardware AES/SHA offload for TLS.
STM32F411 — 4ch XAP @48kHz¶
Single-core Cortex-M4F at 100 MHz.
| Component | Baseline MIPS | With DSP MIPS | % of 100 MHz |
|---|---|---|---|
| I2S DMA handling | 2 | 2 | 2.0% |
| XAP MDCT forward | 25 | 15 | 15.0% |
| XAP quantize+pack | 8 | 5 | 5.0% |
| XMBP batch encode | 3 | 3 | 3.0% |
| UART LTE-M1 transport | 8 | 8 | 8.0% |
| Sensor sampling (26ch) | 5 | 5 | 5.0% |
| Watchdog + housekeeping | 2 | 2 | 2.0% |
| Total | 53 | 40 | 40.0% |
| Available headroom | 47 | 60 | 60.0% |
The M4F FPU enables the XAP floating-point encoder path, which is faster than fixed-point on this core. 4ch@96kHz is not recommended (would exceed 80% utilization).
nRF52840 — 2ch XAP @48kHz¶
Single-core Cortex-M4F at 64 MHz.
| Component | Baseline MIPS | With DSP MIPS | % of 64 MHz |
|---|---|---|---|
| I2S DMA handling | 1 | 1 | 1.6% |
| XAP MDCT forward | 12 | 7 | 10.9% |
| XAP quantize+pack | 4 | 3 | 4.7% |
| BLE GATT stack | 10 | 10 | 15.6% |
| XMBP batch encode | 2 | 2 | 3.1% |
| Sensor sampling (4ch) | 2 | 2 | 3.1% |
| Watchdog + housekeeping | 2 | 2 | 3.1% |
| Total | 33 | 27 | 42.2% |
| Available headroom | 31 | 37 | 57.8% |
BLE stack overhead is the single largest non-codec consumer (15.6%). 4ch@48kHz is possible (~28 MIPS with DSP, 44% utilization) but leaves minimal headroom.
STM32F103 — Sensor-only + ADPCM fallback¶
Single-core Cortex-M3 at 72 MHz, 20 KB SRAM.
| Component | MIPS | % of 72 MHz |
|---|---|---|
| ADPCM encode 2ch @24kHz | 1 | 1.4% |
| XMBP batch encode | 2 | 2.8% |
| UART LTE-M1 transport | 8 | 11.1% |
| Sensor sampling (4ch) | 3 | 4.2% |
| Watchdog + housekeeping | 2 | 2.8% |
| Total | 16 | 22.2% |
| Available headroom | 56 | 77.8% |
XAP is not feasible: the 32 KB encoder state for 4 channels exceeds the entire 20 KB SRAM. Even mono XAP (8 KB encoder state) leaves only 12 KB for everything else. ADPCM at 2ch @24kHz is the maximum practical audio configuration.
3. DSP/FPU Impact Analysis¶
3.1 Speedup by DSP Architecture¶
| DSP Architecture | Platforms | Key Instructions | XAP Speedup | Mechanism |
|---|---|---|---|---|
| ARMv8-M DSP (Cortex-M33) | RP2350, nRF9160 | SMLAD dual MAC, QADD, SSAT | ~30% | Dual 16x16 MAC doubles throughput on 16-bit audio. Saturating arithmetic eliminates branch-based clipping. Fixed-point (Q15) path preferred. |
| Cortex-M4F FPU+DSP | STM32F411, STM32WB55, nRF52840 | FPU + SMLAD + SDIV | ~35-40% | Hardware float multiply-accumulate in 1-3 cycles. Float encoder path is faster than fixed-point. CMSIS-DSP arm_rfft_fast_f32 provides 3-5x speedup for MDCT. |
| Xtensa PIE SIMD (ESP32-S3) | ESP32-S3 | 128-bit 4x f32, 8x i16 | ~60% | 4-wide vector operations on dedicated 128-bit registers. Hardware AES/SHA offloads TLS from CPU. PSRAM DMA for large buffer transfers. |
| No DSP | RP2040, ESP32-C3, STM32F103 | Software multiply only | 0% (baseline) | All operations in software. No SIMD, no hardware float. XAP MDCT requires software cosf() which is 10-50x slower than DSP-accelerated FFT. |
3.2 MDCT Path Selection¶
The MDCT forward transform is the encoder hot path, consuming 60-70% of total XAP encode time. The SDK selects the optimal path at compile time:
| Core | FPU | Compile Path | MDCT Strategy |
|---|---|---|---|
| Cortex-M33 (RP2350, nRF9160) | Single-precision | cmsis-dsp feature |
Fixed-point Q15 with SMLAD dual MAC. Precomputed cosine table in Q15 format. |
| Cortex-M4F (F411, WB55, nRF52840) | Single-precision | cmsis-dsp feature |
Floating-point with arm_rfft_fast_f32. Hardware FPU makes float path faster than fixed-point. |
| Xtensa LX7 (ESP32-S3) | Single-precision | esp32-simd feature |
Float with PIE 128-bit SIMD. 4 f32 values processed per vector instruction. |
| Cortex-M0+ (RP2040) | None | default (no DSP) | Not feasible. Software float MDCT exceeds CPU budget. ADPCM only. |
| Cortex-M3 (STM32F103) | None | default (no DSP) | Not feasible. Same as M0+. ADPCM only. |
| RISC-V (ESP32-C3) | None | default (no DSP) | Not feasible. M extension provides integer multiply but no SIMD. ADPCM only. |
3.3 Benchmark-Measured Cosine Table Impact¶
The host benchmark reveals the dramatic impact of the precomputed cosine table:
| Sample Rate | Frame Samples | Table Used | Encode Time (1ch) | Notes |
|---|---|---|---|---|
| 8 kHz | 80 | Yes | 0.5 us | O(N^2) table lookup, zero trig calls |
| 16 kHz | 160 | Yes | 1.1 us | O(N^2) table lookup |
| 24 kHz | 240 | Yes | 1.9 us | O(N^2) table lookup |
| 32 kHz | 320 | Yes (limit) | 3.0 us | Max N for precomputed table |
| 48 kHz | 480 | No | 512.3 us | Runtime cosf() per coefficient -- 170x slower |
| 96 kHz | 960 | No | 1954.0 us | Runtime cosf() -- 651x slower than 32kHz |
The table limit (XAP_PRECOMPUTE_MAX_N = 320) keeps memory usage under 200 KB (51,200 i32 entries = 200 KB). Extending to 480 would require (480/2)*480 = 115,200 entries = 450 KB, which exceeds the SRAM of most MCU targets. On MCU targets, CMSIS-DSP replaces the entire MDCT with hardware-accelerated FFT, making this table irrelevant.
4. Memory Budget Analysis¶
4.1 Per-Target Memory Breakdown¶
All values in KB. Configurations shown are the maximum recommended per Section 2.
| Target | SRAM | SDK+Codec | Ring Buf | XMBP | HTTP | Stack | Used | Avail | RAM% |
|---|---|---|---|---|---|---|---|---|---|
| RP2350 (Pico 2) | 520 | 20 | 32 | 16 | 4 | 16 | 88 | 432 | 16.9% |
| ESP32-S3 | 512+8M | 20 | 64* | 16 | 8 | 16 | 124 | 388+ | 24.2% |
| STM32F411 | 128 | 20 | 8 | 4 | 4 | 8 | 44 | 84 | 34.4% |
| nRF52840 | 256 | 20 | 16 | 8 | 4 | 8 | 56 | 200 | 21.9% |
| nRF9160 | 256 | 20 | 16 | 8 | 4 | 8 | 56 | 200 | 21.9% |
| STM32WB55 | 256 | 20 | 8 | 4 | 4 | 8 | 44 | 212 | 17.2% |
| RP2040 (Pico) | 264 | 8 | 8 | 4 | 4 | 8 | 32 | 232 | 12.1% |
| ESP32-C3 | 400 | 8 | 8 | 4 | 4 | 8 | 32 | 368 | 8.0% |
| STM32F103 | 20 | 4 | 4 | 2 | 2 | 4 | 16 | 4 | 80.0% |
*ESP32-S3 ring buffer placed in PSRAM via DMA.
4.2 Memory-Limited Scenarios¶
RP2350 (520 KB SRAM): 4ch @96kHz + 26 sensors + HTTP = 88 KB (16.9%). Has 432 KB remaining for application logic, OTA staging buffer, and filesystem. The generous SRAM headroom makes RP2350 the best balanced target for feature-rich deployments.
STM32F411 (128 KB SRAM): 4ch @48kHz XAP = 44 KB (34.4%). The 84 KB remaining is adequate for the application but leaves no room for OTA staging. Firmware updates must use external flash or a swap-based bootloader. Dropping to 2ch@48kHz reduces usage to 36 KB, freeing 8 KB.
nRF52840 (256 KB SRAM): 2ch @48kHz + BLE GATT stack = 56 KB (21.9%). The SoftDevice BLE stack itself consumes an additional ~30 KB. With SoftDevice: total ~86 KB used, 170 KB available. Comfortable for BLE-based deployments.
STM32F103 (20 KB SRAM): Sensor-only ADPCM = 16 KB (80.0%). Only 4 KB remains for application logic. This is the absolute minimum viable configuration. Even adding a single additional sensor stream would require careful stack optimization. XAP is completely impossible: the encoder state alone (8 KB per channel, 32 KB for 4ch) exceeds total SRAM.
ESP32-S3 (512 KB + 8 MB PSRAM): The PSRAM massively extends available memory. Audio ring buffers (64 KB+) and XMBP batch buffers can reside in PSRAM with DMA access, keeping fast SRAM free for codec state and stack. This makes ESP32-S3 the only target that could feasibly support configurations beyond 4ch@96kHz (e.g., 8ch@48kHz for surround monitoring) without memory pressure.
5. Server Concurrency Evaluation¶
5.1 Architecture Overview¶
The Xylolabs API server is built on Tokio async runtime with Axum. The ingest pipeline (crates/xylolabs-server/src/ingest/manager.rs) processes XMBP batches from MCU devices, buffers samples in memory, compresses via zstd in spawn_blocking, and flushes to S3 and PostgreSQL.
5.2 Connection Parameters¶
| Parameter | Value | Notes |
|---|---|---|
| Runtime | Tokio multi-threaded | Worker threads = CPU cores |
| DB connection pool | 20 (default, configurable) | DATABASE_MAX_CONNECTIONS env var |
| Per-session memory | ~1-4 KB per stream buffer | Excluding accumulated samples |
| Upload body limit | Up to 2 GB | Currently full-buffer (not streaming) |
| SSE live connections | Unbounded | Broadcast channels per session |
| HTTP keep-alive | 75 seconds | Axum default |
| Auth rate limit | 10 attempts/IP/minute | In-memory cache, 60s TTL, 10K IP capacity |
| Ingest flush window | 10 seconds (configurable) | Accumulates samples before S3 write |
| Session timeout | 300 seconds (configurable) | Auto-close stale sessions |
5.3 Ingest Pipeline Throughput¶
| Stage | Latency | Concurrency Model | Notes |
|---|---|---|---|
| XMBP batch decode | < 100 us per batch | Inline async | Pure CPU, no I/O |
| Sample buffering | < 1 us per sample | Mutex-guarded HashMap | Per-stream Vec |
| Live event broadcast | < 10 us per event | broadcast::channel | Only if subscribers exist |
| zstd compression | ~200 us per chunk | spawn_blocking |
Offloaded from async runtime |
| S3 upload | ~5-20 ms per chunk | Async I/O | Network-bound; MinIO local ~2 ms |
| DB insert (chunk record) | ~1-2 ms per record | Async sqlx | Batched within flush window |
| DB stats update | ~1 ms per batch | Inline async | Single UPDATE per batch |
Estimated throughput per core: ~500 XMBP batches/sec (CPU-bound stage is compression at ~200 us/chunk, but offloaded to blocking pool).
5.4 Concurrent Session Capacity¶
| Scenario | Sessions | Audio Config | Sensor Streams | Server CPU | DB Load | RAM |
|---|---|---|---|---|---|---|
| Light | 10 | 10 x 2ch @16kHz | 40 @100Hz | < 5% | Low | ~40 KB |
| Standard | 50 | 50 x 4ch @48kHz | 200 @100Hz | ~20% | Medium | ~200 KB |
| Heavy | 100 | 100 x 4ch @96kHz | 2600 @100Hz | ~60% | High | ~400 KB |
| Limit | ~200 | Limited by DB pool | Limited by DB pool | ~90% | Saturated | ~800 KB |
The limiting factor is the PostgreSQL connection pool (default 20). Each flush operation requires a DB connection for the chunk INSERT. With 100 sessions flushing every 10 seconds, the DB pool processes ~10 flush operations/sec per connection, which is well within capacity. At 200+ sessions with aggressive flush windows, pool exhaustion becomes the bottleneck.
5.5 Concurrency Hazards — Identified and Resolved¶
| Issue | Severity | Status | Resolution |
|---|---|---|---|
ConfigManager blocking RwLock in async context |
P2 | Fixed | Migrated from std::sync::RwLock to tokio::sync::RwLock. Blocking lock in async runtime caused thread starvation under contention. |
| Ingest flush data loss on S3 failure | P0 | Fixed | Previously drained samples before confirming S3 write. Now clones samples before flush; originals retained on failure for retry. flushing flag prevents concurrent flushes of same buffer. |
N+1 tag queries in list_uploads |
P1 | Fixed | Replaced per-upload tag fetch (1+N queries) with single JOIN batch query. |
Sequential stats_overview queries |
P2 | Fixed | Six independent COUNT queries now run concurrently via tokio::try_join!. |
| S3 full-file buffering on upload | P1 | Open | 2 GB upload = 2 GB RAM. Needs streaming multipart upload with backpressure. |
| Upload body buffered in memory | P1 | Open | Large upload bodies held entirely in memory before S3 write. Needs streaming body handling. |
| No connection rate limiting (general) | P3 | Open | Auth endpoints have IP-based rate limiting (10/min). General API endpoints lack rate limiting. |
File descriptor leak via mem::forget |
P0 | Fixed | Temp files leaked via mem::forget. Fixed with into_temp_path for deterministic cleanup. |
| Session lock contention | P2 | Mitigated | IngestManager.sessions uses tokio::sync::RwLock. Individual sessions use tokio::sync::Mutex. Flush operations drop the session lock before performing I/O. Stale session check uses atomic last_activity_ms without locking session mutexes. |
5.6 Remaining Performance Risks¶
S3 full-file buffering: The most significant open issue. A single 2 GB upload consumes 2 GB of server RAM. With 10 concurrent large uploads, the server requires 20 GB RAM just for upload buffering. The fix requires streaming multipart upload to S3 with bounded memory buffers and backpressure signaling to the client.
DB pool starvation: The default pool of 20 connections supports approximately 200 concurrent sessions at the current flush interval. Beyond this, flush operations queue behind the pool, increasing latency and sample buffer memory. Increasing the pool size requires corresponding PostgreSQL max_connections tuning.
Broadcast channel unbounded subscribers: The SSE live event endpoint creates broadcast receivers with no limit on subscriber count. A malicious client opening thousands of SSE connections could exhaust memory with broadcast channel buffers.
6. Recommendations¶
6.1 Platform Selection by Use Case¶
| Use Case | Recommended Platform | Codec | Max Config | Rationale |
|---|---|---|---|---|
| Full-spectrum industrial monitoring | ESP32-S3 | XAP | 4ch @96kHz | 82% headroom, WiFi built-in, PSRAM for large buffers |
| Battery-powered field sensor | RP2350 (Pico 2) | XAP | 4ch @96kHz | 54% headroom, lowest active power (25 mA), dual-core |
| Compact industrial node | STM32F411 | XAP | 4ch @48kHz | 60% headroom, proven M4F ecosystem, UART LTE-M |
| BLE wearable / beacon | nRF52840 | XAP | 2ch @48kHz | 58% headroom, BLE transport, ultra-low sleep (1.5 uA) |
| Cellular IoT (LTE-M) | nRF9160 | XAP | 2ch @48kHz | 55% headroom, integrated LTE-M modem |
| BLE + Thread mesh | STM32WB55 | XAP | 2ch @48kHz | 58% headroom, dual-protocol BLE + 802.15.4 |
| Voice-only / legacy sensor | STM32F103 | IMA-ADPCM | 2ch @24kHz | 78% headroom, ADPCM only, 20 KB SRAM limit |
| Low-cost WiFi sensor | ESP32-C3 | IMA-ADPCM | 4ch @96kHz | 98% headroom, ADPCM only, RISC-V, WiFi built-in |
| Education / prototyping | RP2040 (Pico) | IMA-ADPCM | 4ch @96kHz | 97% headroom, ADPCM only, lowest cost ($1) |
6.2 Server Scaling Recommendations¶
| Load Tier | Sessions | Server Config | DB Pool | Notes |
|---|---|---|---|---|
| Development | 1-10 | Single instance, 2 CPU | 10 | Default configuration sufficient |
| Small deployment | 10-50 | Single instance, 4 CPU | 20 | Default pool adequate |
| Medium deployment | 50-200 | Single instance, 8 CPU | 40 | Increase pool, monitor flush latency |
| Large deployment | 200-1000 | Multiple instances + LB | 60/instance | Requires streaming S3 upload fix, horizontal scaling |
| Enterprise | 1000+ | Kubernetes pods, auto-scale | Connection pooler (PgBouncer) | Requires all P1 issues resolved |
6.3 Priority Engineering Work¶
-
P1 — Streaming S3 upload: Replace full-file buffering with streaming multipart upload. Eliminates the O(file_size) memory consumption. Critical for any deployment handling files larger than 100 MB.
-
P1 — Streaming upload body: Implement backpressure-aware streaming from HTTP body to S3, never holding more than a bounded buffer (e.g., 4 MB) in memory.
-
P2 — SSE subscriber limits: Cap broadcast channel receivers per session (e.g., 100). Return 429 when exceeded.
-
P3 — General API rate limiting: Add Tower rate-limit middleware to all API endpoints, not just auth.
-
Optimization — Cosine table extension: Consider extending
XAP_PRECOMPUTE_MAX_Nto 480 on targets with sufficient SRAM (ESP32-S3 with PSRAM) for 48kHz table-based encoding without DSP dependency. This would eliminate the 170x discontinuity for the 48kHz use case.
7. Related Documents¶
- Performance Profile -- DSP acceleration matrix, per-target CPU/memory budgets
- Codec Analysis -- 16 audio codecs compared across 5 MCU platforms
- RP2350 Feasibility -- 4ch 96kHz architecture, detailed CPU/memory budget
- Pico 2 Platform Guide -- RP2350 hardware setup and build
- STM32 Platform Guide -- F103/F411/WB55/WBA55 configuration
- ESP32 Platform Guide -- S3/C3 WiFi, ESP-IDF integration
- SDK Overview -- Rust-first embedded SDK architecture