Performance Profiling — DSP Acceleration & Resource Budgets¶
Xylolabs API — Performance profiling: DSP acceleration and resource budgets Revision: 2026-03-23
1. DSP Acceleration Matrix¶
Each platform's DSP capabilities determine codec performance. Speedup percentages are measured against pure C baseline implementations.
| Platform | Core | Clock | DSP Extensions | FPU | XAP Speedup | ADPCM Speedup |
|---|---|---|---|---|---|---|
| RP2350 (Pico 2) | Cortex-M33 | 150 MHz | ARMv8-M DSP (SMLAD, SMLAL, QADD, SSAT) | Single-precision | ~30% (80->56 MIPS) | ~50% |
| RP2040 (Pico) | Cortex-M0+ | 133 MHz | None | None | N/A (not feasible) | Minimal |
| ESP32-S3 | 2x Xtensa LX7 | 240 MHz | 128-bit SIMD (PIE): 4x f32, 8x i16 vector | Single-precision | ~60% | ~40% |
| ESP32-C3 | RISC-V RV32IMC | 160 MHz | M extension (multiply only) | None | Marginal | Minimal |
| STM32F411 | Cortex-M4F | 100 MHz | FPU + DSP: SMLAD, barrel shifter | Single-precision | ~40% (float path) | ~30% |
| STM32F103 | Cortex-M3 | 72 MHz | None | None | N/A (not feasible) | Minimal |
| STM32WB55 | Cortex-M4F | 64 MHz | FPU + DSP (same as F411) | Single-precision | ~40% | ~30% |
| nRF52840 | Cortex-M4F | 64 MHz | FPU + DSP (same as F411) | Single-precision | ~30-40% | ~30% |
| nRF9160 | Cortex-M33 | 64 MHz | ARMv8-M DSP (same as RP2350) | Single-precision | ~30% | ~50% |
DSP Instruction Summary¶
ARMv8-M DSP (Cortex-M33: RP2350, nRF9160, STM32U585):
- SMLAL / UMLAL -- single-cycle 32x32->64 MAC for FIR and MDCT accumulation
- SMLAD / SMUAD -- dual 16x16->32 MAC, doubles throughput for 16-bit audio
- QADD / QSUB / SSAT / USAT -- saturating arithmetic, eliminates branch-based clipping
- SBFX / UBFX -- bit-field extract for XMBP binary protocol parsing
Cortex-M4F DSP (STM32F411, STM32WB55, nRF52840):
- All ARMv8-M DSP instructions above, plus:
- Single-precision FPU -- hardware float multiply-accumulate in 1-3 cycles
- SDIV / UDIV -- hardware integer divide in 2-12 cycles
Xtensa PIE (ESP32-S3): - 128-bit SIMD -- 4x f32 or 8x i16 per instruction - 16 x 128-bit dedicated vector registers - Hardware AES/SHA offloads TLS from CPU - PSRAM DMA for large audio buffer transfers
2. Per-Target Performance Budget¶
RP2350 (Pico 2) — 4ch XAP @96kHz¶
Single core budget (Core 0 for codec, Core 1 for I/O):
Component | MIPS (baseline) | MIPS (with DSP) | % of 150MHz
-----------------------|-----------------|-----------------|------------
I2S DMA handling | 2 | 2 | 1.3%
XAP MDCT forward | 50 | 35 | 23.3%
XAP quantize+pack | 15 | 10 | 6.7%
XMBP batch encode | 5 | 5 | 3.3%
HTTP transport | 10 | 10 | 6.7%
Sensor sampling (26ch) | 5 | 5 | 3.3%
Watchdog + housekeeping | 2 | 2 | 1.3%
-----------------------|-----------------|-----------------|------------
TOTAL | 89 | 69 | 46.0%
Available headroom | 61 | 81 | 54.0%
Dual-core split: Core 0 handles I2S DMA + XAP encoding (~37 MIPS with DSP, 24.7%). Core 1 handles XMBP, HTTP, sensors (~22 MIPS, 14.7%). Total system utilization: ~39.3%.
ESP32-S3 — 4ch XAP @96kHz¶
Dual-core budget (480 MIPS total):
Component | MIPS (baseline) | MIPS (with PIE) | % of 480MHz
-----------------------|-----------------|-----------------|------------
I2S DMA handling | 2 | 2 | 0.4%
XAP MDCT forward | 50 | 20 | 4.2%
XAP quantize+pack | 15 | 6 | 1.3%
WiFi stack (FreeRTOS) | 30 | 30 | 6.3%
XMBP batch encode | 5 | 5 | 1.0%
HTTP/TLS transport | 20 | 12 | 2.5%
Sensor sampling (26ch) | 5 | 5 | 1.0%
PSRAM DMA management | 3 | 3 | 0.6%
Watchdog + housekeeping | 2 | 2 | 0.4%
-----------------------|-----------------|-----------------|------------
TOTAL | 132 | 85 | 17.7%
Available headroom | 348 | 395 | 82.3%
PIE SIMD provides the largest absolute gain. TLS overhead is reduced by hardware AES/SHA acceleration.
STM32F411 — 4ch XAP @48kHz¶
Single-core budget (100 MIPS):
Component | MIPS (baseline) | MIPS (with DSP) | % of 100MHz
-----------------------|-----------------|-----------------|------------
I2S DMA handling | 2 | 2 | 2.0%
XAP MDCT forward | 25 | 15 | 15.0%
XAP quantize+pack | 8 | 5 | 5.0%
XMBP batch encode | 3 | 3 | 3.0%
UART LTE-M1 transport | 8 | 8 | 8.0%
Sensor sampling (26ch) | 5 | 5 | 5.0%
Watchdog + housekeeping | 2 | 2 | 2.0%
-----------------------|-----------------|-----------------|------------
TOTAL | 53 | 40 | 40.0%
Available headroom | 47 | 60 | 60.0%
The M4F FPU enables the XAP floating-point encoder path, which is faster than fixed-point on this core. CMSIS-DSP arm_rfft_fast_f32 provides 3-5x speedup for the MDCT.
nRF52840 — 2ch XAP @48kHz¶
Single-core budget (64 MIPS):
Component | MIPS (baseline) | MIPS (with DSP) | % of 64MHz
-----------------------|-----------------|-----------------|------------
I2S DMA handling | 1 | 1 | 1.6%
XAP MDCT forward | 12 | 7 | 10.9%
XAP quantize+pack | 4 | 3 | 4.7%
BLE GATT stack | 10 | 10 | 15.6%
XMBP batch encode | 2 | 2 | 3.1%
Sensor sampling (4ch) | 2 | 2 | 3.1%
Watchdog + housekeeping | 2 | 2 | 3.1%
-----------------------|-----------------|-----------------|------------
TOTAL | 33 | 27 | 42.2%
Available headroom | 31 | 37 | 57.8%
BLE stack overhead is significant. 4ch XAP @48kHz (~28 MIPS with DSP, 44% utilization) is feasible but tight.
STM32F103 — Sensor-only + ADPCM fallback¶
Single-core budget (72 MIPS, 20 KB SRAM):
Component | MIPS (baseline) | MIPS (optimized) | % of 72MHz
-----------------------|-----------------|------------------|------------
ADPCM encode 2ch @24kHz| 1 | 1 | 1.4%
XMBP batch encode | 2 | 2 | 2.8%
UART LTE-M1 transport | 8 | 8 | 11.1%
Sensor sampling (4ch) | 3 | 3 | 4.2%
Watchdog + housekeeping | 2 | 2 | 2.8%
-----------------------|-----------------|------------------|------------
TOTAL | 16 | 16 | 22.2%
Available headroom | 56 | 56 | 77.8%
No DSP extension available. XAP is not feasible (32 KB encoder state exceeds 20 KB total SRAM). ADPCM at 2ch @24kHz is the maximum audio capability.
3. Memory Budget Per Target¶
All values in KB. "SDK Client" includes XylolabsClient state machine, session state, and configuration.
| Target | Total SRAM | SDK Client | XAP Encoder | Ring Buffer | XMBP | HTTP | Stack | Available |
|---|---|---|---|---|---|---|---|---|
| RP2350 | 520 KB | 12 KB | 8 KB | 32 KB | 16 KB | 4 KB | 16 KB | 432 KB |
| ESP32-S3 | 512 KB + 8MB PSRAM | 12 KB | 8 KB | 64 KB (PSRAM) | 16 KB | 8 KB | 16 KB | ~8 MB |
| STM32F411 | 128 KB | 12 KB | 8 KB | 8 KB | 4 KB | 4 KB | 8 KB | 84 KB |
| STM32U585 | 786 KB | 12 KB | 8 KB | 32 KB | 16 KB | 4 KB | 16 KB | 698 KB |
| nRF52840 | 256 KB | 12 KB | 8 KB | 16 KB | 8 KB | 4 KB | 8 KB | 200 KB |
| nRF9160 | 256 KB | 12 KB | 8 KB | 16 KB | 8 KB | 4 KB | 8 KB | 200 KB |
| STM32WB55 | 256 KB | 12 KB | 8 KB | 8 KB | 4 KB | 4 KB | 8 KB | 212 KB |
| STM32F103 | 20 KB | 4 KB | -- | 4 KB | 2 KB | 2 KB | 4 KB | 4 KB |
| RP2040 | 264 KB | 4 KB | -- | 8 KB | 4 KB | 4 KB | 8 KB | 236 KB |
| ESP32-C3 | 400 KB | 4 KB | -- | 8 KB | 4 KB | 4 KB | 8 KB | 372 KB |
Notes¶
- XAP Encoder: 8 KB per channel for XAP encoder state; table shows 4ch total (32 KB) amortized. Platforms marked
--cannot run XAP. - Ring Buffer: Audio DMA double buffer for I2S capture. ESP32-S3 places this in PSRAM via DMA.
- Stack: Dual-core platforms (RP2350, ESP32-S3) allocate 8 KB per core.
- STM32F103: Extremely constrained. Only sensor + ADPCM 2ch fits. The 4 KB "available" is the absolute minimum for application logic.
4. API Server Concurrency Profile¶
Connection Handling¶
The Xylolabs API server is built on Tokio async runtime with Axum:
| Parameter | Value | Notes |
|---|---|---|
| Runtime | Tokio multi-threaded | Worker threads = CPU cores |
| DB pool | 20 connections (configurable) | DATABASE_MAX_CONNECTIONS env var |
| Per-request memory | ~1-2 KB (metadata) | Excluding upload body |
| Upload body limit | Up to 2 GB | Full-file buffering (to be fixed) |
| SSE connections | Unbounded | Per-session broadcast channels |
| HTTP keep-alive | 75 seconds | Axum default |
Ingest Pipeline Throughput¶
| Stage | Latency | Notes |
|---|---|---|
| XMBP decode | <100 us per 2kHz batch | Benchmarked on x86-64 server |
| XAP frame decode | ~5 us per frame | XAP decoder (server-side, no constraints) |
| zstd compression | ~200 us per chunk | Offloaded to spawn_blocking |
| S3 write | ~5-20 ms per chunk | Network-bound, MinIO local ~2 ms |
| DB insert | ~1-2 ms per record | Batched within flush window |
| Flush window | 10 seconds (configurable) | Accumulates samples before write |
Concurrent Session Capacity¶
| Scenario | Sessions | Audio Streams | Sensor Streams | Server CPU | DB Load |
|---|---|---|---|---|---|
| Light | 10 | 10 x 2ch @16kHz | 40 @100Hz | <5% | Low |
| Standard | 50 | 50 x 4ch @48kHz | 200 @100Hz | ~20% | Medium |
| Heavy | 100 | 100 x 4ch @96kHz | 2600 @100Hz | ~60% | High |
| Limit | ~200 | Limited by DB pool | Limited by DB pool | ~90% | Saturated |
Bottlenecks Identified (Performance Review)¶
| Issue | Severity | Status | Fix |
|---|---|---|---|
N+1 tag queries in list_uploads |
P1 | Fixed | Batch fetch with single JOIN query |
Sequential stats_overview queries |
P2 | Fixed | tokio::try_join! parallel execution |
| S3 full-file buffering on upload | P1 | TODO | Streaming multipart upload |
ConfigManager blocking RwLock |
P2 | Fixed | Migrated to tokio::sync::RwLock |
| Upload body buffered in memory | P1 | TODO | Streaming body with backpressure |
| No connection rate limiting | P3 | TODO | Tower rate-limit middleware |
5. Burn-In Test Results¶
Native Platform (Apple M4)¶
Measured on Apple M4 with native Rust compilation (not cross-compiled):
| Metric | Value | Notes |
|---|---|---|
| XAP encode (per frame) | avg = 9 us | 10 ms frame budget -> 0.09% utilization |
| Client tick (full cycle) | avg = 56 us | Includes XMBP encode + buffer management |
| MCU headroom | 99.3% | Validates algorithmic efficiency |
| Memory (peak RSS) | ~2 MB | SDK client + test harness |
| Frames dropped | 0 | Across all scenarios |
Burn-In Scenarios¶
| Scenario | Duration | Audio Config | Sensors | Devices | Result |
|---|---|---|---|---|---|
| standard | 60s | 4ch @16kHz | 4 @100Hz | 1 | PASS |
| stress | 120s | 4ch @96kHz | 26 @100Hz | 1 | PASS |
| endurance | 120s+ | 2ch @16kHz | 4 @10Hz | 1 | PASS |
| multi-device | 60s | 4ch @16kHz | 4 @100Hz | 10 | PASS |
QEMU ARM Throttled¶
Simulated RP2350 performance using ARM QEMU with CPU throttling to approximate 150 MHz Cortex-M33:
| Metric | Value | Notes |
|---|---|---|
| XAP encode (per frame) | avg = ~650 us | ~6.5% of 10 ms budget |
| Client tick (full cycle) | avg = ~3.8 ms | ~38% of 10 ms budget |
| Estimated MCU headroom | ~62% | Conservative (QEMU overhead included) |
The QEMU results align with the CPU budget analysis in Section 2 (RP2350 at ~46% baseline, ~39% with DSP).
6. CMSIS-DSP Integration¶
The SDK automatically links CMSIS-DSP on Cortex-M targets when XYLOLABS_USE_CMSIS_DSP=1 is set.
Key Optimized Functions¶
| CMSIS-DSP Function | Used For | Speedup vs C |
|---|---|---|
arm_rfft_fast_f32 |
MDCT / spectral analysis | 3-5x |
arm_fir_f32 / arm_fir_q15 |
FIR downsampling filter | 2-4x |
arm_dot_prod_f32 |
Inner products in quantization | 2-3x |
arm_scale_f32 |
Gain normalization | 2x |
arm_fill_f32 / arm_copy_f32 |
Buffer management | 1.5-2x |
Encoder Path Selection¶
| Core | FPU | Recommended Path | Rationale |
|---|---|---|---|
| Cortex-M33 (RP2350) | Yes (single) | Fixed-point (q15) |
DSP SIMD instructions (SMLAD) are optimized for 16-bit fixed-point |
| Cortex-M4F (F411) | Yes (single) | Floating-point (f32) |
FPU makes float path faster than fixed-point |
| Cortex-M4F (nRF52840) | Yes (single) | Floating-point (f32) |
Same as F411 |
| Cortex-M33 (nRF9160) | Yes (single) | Fixed-point (q15) |
Same as RP2350 |
| Xtensa LX7 (ESP32-S3) | Yes (single) | Floating-point with PIE | PIE SIMD on f32 vectors |
| Cortex-M0+ (RP2040) | No | N/A (no XAP) | No DSP, no FPU |
| Cortex-M3 (F103) | No | N/A (no XAP) | No DSP, no FPU |
| RISC-V (ESP32-C3) | No | N/A (no XAP) | M extension only (multiply) |
7. Enabling DSP Features Per Target¶
Rust SDK (Cargo Feature Flags)¶
The Rust SDK (crates/xylolabs-sdk/) exposes two DSP feature flags:
| Feature | Targets | Effect |
|---|---|---|
cmsis-dsp |
Cortex-M33 (RP2350, nRF9160), Cortex-M4F (STM32F411, nRF52840, STM32WB55) | Enables CMSIS-DSP optimized MDCT and FIR paths |
esp32-simd |
ESP32-S3 (Xtensa LX7) | Enables PIE SIMD optimized MDCT paths |
Per-target Cargo.toml examples:
# RP2350 / nRF9160 (Cortex-M33)
xylolabs-sdk = { path = "../../crates/xylolabs-sdk", features = ["xap", "cmsis-dsp"] }
# STM32F411 / nRF52840 / STM32WB55 (Cortex-M4F)
xylolabs-sdk = { path = "../../crates/xylolabs-sdk", features = ["xap", "cmsis-dsp"] }
# ESP32-S3 (Xtensa LX7 with PIE)
xylolabs-sdk = { path = "../../crates/xylolabs-sdk", features = ["xap", "esp32-simd"] }
# STM32F103 / ESP32-C3 (no DSP -- ADPCM only)
xylolabs-sdk = { path = "../../crates/xylolabs-sdk", default-features = false, features = ["adpcm"] }
C SDK (Build Defines)¶
The C SDK uses compile-time defines in config.h, auto-detected from compiler flags:
| Define | Targets | Auto-detect Condition |
|---|---|---|
XYLOLABS_USE_CMSIS_DSP=1 |
Cortex-M33, Cortex-M4F | __ARM_FEATURE_DSP defined |
XYLOLABS_USE_ESP32S3_SIMD=1 |
ESP32-S3 | __XTENSA__ + CONFIG_IDF_TARGET_ESP32S3 |
Override explicitly via CMake if auto-detection is insufficient:
# CMake -- force-enable CMSIS-DSP
target_compile_definitions(my_firmware PRIVATE XYLOLABS_USE_CMSIS_DSP=1)
# CMake -- force-enable ESP32-S3 SIMD
target_compile_definitions(my_firmware PRIVATE XYLOLABS_USE_ESP32S3_SIMD=1)
Summary: Which Flag for Which Target¶
| Target | Rust Feature | C Define | Expected Speedup |
|---|---|---|---|
| RP2350 (Cortex-M33) | cmsis-dsp |
XYLOLABS_USE_CMSIS_DSP=1 |
~30% |
| STM32F411 (Cortex-M4F) | cmsis-dsp |
XYLOLABS_USE_CMSIS_DSP=1 |
~40% |
| nRF52840 (Cortex-M4F) | cmsis-dsp |
XYLOLABS_USE_CMSIS_DSP=1 |
~30-40% |
| nRF9160 (Cortex-M33) | cmsis-dsp |
XYLOLABS_USE_CMSIS_DSP=1 |
~30% |
| STM32WB55 (Cortex-M4F) | cmsis-dsp |
XYLOLABS_USE_CMSIS_DSP=1 |
~40% |
| ESP32-S3 (Xtensa LX7) | esp32-simd |
XYLOLABS_USE_ESP32S3_SIMD=1 |
~60% |
| STM32F103 / RP2040 / ESP32-C3 | N/A | N/A | No DSP available |
8. Recommendations¶
Platform Selection by Use Case¶
| Use Case | Recommended Platform | Codec | Rationale |
|---|---|---|---|
| 4ch @96kHz full-spectrum | RP2350, ESP32-S3 | XAP | Only platforms with sufficient compute + bandwidth fit |
| 4ch @48kHz standard | STM32F411, RP2350, ESP32-S3 | XAP | F411 at 40% CPU with DSP |
| 2ch @48kHz compact | nRF52840 | XAP | BLE transport, 42% CPU with DSP |
| Sensor-only (no audio) | Any platform | N/A | XMBP metadata only |
| Voice/ADPCM fallback | STM32F103, ESP32-C3 | IMA-ADPCM | No FPU/DSP required |
DSP Optimization Checklist¶
- Always enable CMSIS-DSP on Cortex-M targets (
XYLOLABS_USE_CMSIS_DSP=1). Drop-in replacement, no code changes needed. - Use XAP floating-point path on Cortex-M4F (STM32F411, nRF52840). The FPU makes float faster than fixed-point.
- Use XAP fixed-point path on Cortex-M33 (RP2350, nRF9160). DSP SIMD instructions are optimized for 16-bit operations.
- Enable PIE intrinsics on ESP32-S3. Auto-vectorization helps, but explicit PIE intrinsics in the XAP hot path yield 20-30% additional gain.
- For 96kHz 4ch: only RP2350, ESP32-S3, and STM32F411 (at 48kHz) have sufficient compute.
- Cortex-M3 and Cortex-M0+: ADPCM only. XAP encoder state (32 KB for 4ch) exceeds available SRAM on STM32F103 (20 KB).
- ESP32-C3 (RISC-V): sensor-only or ADPCM fallback. The M extension provides multiply but no SIMD or DSP acceleration.
Power vs Performance Tradeoffs¶
| Platform | Active Power | Sleep Mode | Best For |
|---|---|---|---|
| RP2350 | ~25 mA @150MHz | ~1.3 mA (dormant) | Battery-powered field sensors |
| ESP32-S3 | ~80 mA @240MHz (WiFi) | ~10 uA (deep sleep) | Mains-powered, WiFi available |
| STM32F411 | ~30 mA @100MHz | ~2.4 uA (standby) | Industrial, low-power |
| nRF52840 | ~5 mA @64MHz | ~1.5 uA (system off) | BLE wearable / beacon |
| STM32F103 | ~25 mA @72MHz | ~3.6 uA (standby) | Legacy sensor nodes |
9. Related Documents¶
- Codec Analysis -- 16 audio codecs compared across 5 MCU platforms
- RP2350 Feasibility -- 4ch 96kHz architecture, CPU/memory budget
- Pico 2 Platform Guide -- RP2350 hardware setup and build
- STM32 Platform Guide -- F103/F411/WB55/WBA55 configuration
- ESP32 Platform Guide -- S3/C3 WiFi, ESP-IDF integration
- SDK Overview -- Rust-first embedded SDK architecture