Skip to content

Performance Profiling — DSP Acceleration & Resource Budgets

Xylolabs API — Performance profiling: DSP acceleration and resource budgets Revision: 2026-03-23


1. DSP Acceleration Matrix

Each platform's DSP capabilities determine codec performance. Speedup percentages are measured against pure C baseline implementations.

Platform Core Clock DSP Extensions FPU XAP Speedup ADPCM Speedup
RP2350 (Pico 2) Cortex-M33 150 MHz ARMv8-M DSP (SMLAD, SMLAL, QADD, SSAT) Single-precision ~30% (80->56 MIPS) ~50%
RP2040 (Pico) Cortex-M0+ 133 MHz None None N/A (not feasible) Minimal
ESP32-S3 2x Xtensa LX7 240 MHz 128-bit SIMD (PIE): 4x f32, 8x i16 vector Single-precision ~60% ~40%
ESP32-C3 RISC-V RV32IMC 160 MHz M extension (multiply only) None Marginal Minimal
STM32F411 Cortex-M4F 100 MHz FPU + DSP: SMLAD, barrel shifter Single-precision ~40% (float path) ~30%
STM32F103 Cortex-M3 72 MHz None None N/A (not feasible) Minimal
STM32WB55 Cortex-M4F 64 MHz FPU + DSP (same as F411) Single-precision ~40% ~30%
nRF52840 Cortex-M4F 64 MHz FPU + DSP (same as F411) Single-precision ~30-40% ~30%
nRF9160 Cortex-M33 64 MHz ARMv8-M DSP (same as RP2350) Single-precision ~30% ~50%

DSP Instruction Summary

ARMv8-M DSP (Cortex-M33: RP2350, nRF9160, STM32U585): - SMLAL / UMLAL -- single-cycle 32x32->64 MAC for FIR and MDCT accumulation - SMLAD / SMUAD -- dual 16x16->32 MAC, doubles throughput for 16-bit audio - QADD / QSUB / SSAT / USAT -- saturating arithmetic, eliminates branch-based clipping - SBFX / UBFX -- bit-field extract for XMBP binary protocol parsing

Cortex-M4F DSP (STM32F411, STM32WB55, nRF52840): - All ARMv8-M DSP instructions above, plus: - Single-precision FPU -- hardware float multiply-accumulate in 1-3 cycles - SDIV / UDIV -- hardware integer divide in 2-12 cycles

Xtensa PIE (ESP32-S3): - 128-bit SIMD -- 4x f32 or 8x i16 per instruction - 16 x 128-bit dedicated vector registers - Hardware AES/SHA offloads TLS from CPU - PSRAM DMA for large audio buffer transfers


2. Per-Target Performance Budget

RP2350 (Pico 2) — 4ch XAP @96kHz

Single core budget (Core 0 for codec, Core 1 for I/O):

Component              | MIPS (baseline) | MIPS (with DSP) | % of 150MHz
-----------------------|-----------------|-----------------|------------
I2S DMA handling       |        2        |        2        |    1.3%
XAP MDCT forward       |       50        |       35        |   23.3%
XAP quantize+pack      |       15        |       10        |    6.7%
XMBP batch encode      |        5        |        5        |    3.3%
HTTP transport          |       10        |       10        |    6.7%
Sensor sampling (26ch)  |        5        |        5        |    3.3%
Watchdog + housekeeping |        2        |        2        |    1.3%
-----------------------|-----------------|-----------------|------------
TOTAL                  |       89        |       69        |   46.0%
Available headroom     |       61        |       81        |   54.0%

Dual-core split: Core 0 handles I2S DMA + XAP encoding (~37 MIPS with DSP, 24.7%). Core 1 handles XMBP, HTTP, sensors (~22 MIPS, 14.7%). Total system utilization: ~39.3%.

ESP32-S3 — 4ch XAP @96kHz

Dual-core budget (480 MIPS total):

Component              | MIPS (baseline) | MIPS (with PIE) | % of 480MHz
-----------------------|-----------------|-----------------|------------
I2S DMA handling       |        2        |        2        |    0.4%
XAP MDCT forward       |       50        |       20        |    4.2%
XAP quantize+pack      |       15        |        6        |    1.3%
WiFi stack (FreeRTOS)   |       30        |       30        |    6.3%
XMBP batch encode      |        5        |        5        |    1.0%
HTTP/TLS transport      |       20        |       12        |    2.5%
Sensor sampling (26ch)  |        5        |        5        |    1.0%
PSRAM DMA management    |        3        |        3        |    0.6%
Watchdog + housekeeping |        2        |        2        |    0.4%
-----------------------|-----------------|-----------------|------------
TOTAL                  |      132        |       85        |   17.7%
Available headroom     |      348        |      395        |   82.3%

PIE SIMD provides the largest absolute gain. TLS overhead is reduced by hardware AES/SHA acceleration.

STM32F411 — 4ch XAP @48kHz

Single-core budget (100 MIPS):

Component              | MIPS (baseline) | MIPS (with DSP) | % of 100MHz
-----------------------|-----------------|-----------------|------------
I2S DMA handling       |        2        |        2        |    2.0%
XAP MDCT forward       |       25        |       15        |   15.0%
XAP quantize+pack      |        8        |        5        |    5.0%
XMBP batch encode      |        3        |        3        |    3.0%
UART LTE-M1 transport   |        8        |        8        |    8.0%
Sensor sampling (26ch)  |        5        |        5        |    5.0%
Watchdog + housekeeping |        2        |        2        |    2.0%
-----------------------|-----------------|-----------------|------------
TOTAL                  |       53        |       40        |   40.0%
Available headroom     |       47        |       60        |   60.0%

The M4F FPU enables the XAP floating-point encoder path, which is faster than fixed-point on this core. CMSIS-DSP arm_rfft_fast_f32 provides 3-5x speedup for the MDCT.

nRF52840 — 2ch XAP @48kHz

Single-core budget (64 MIPS):

Component              | MIPS (baseline) | MIPS (with DSP) | % of 64MHz
-----------------------|-----------------|-----------------|------------
I2S DMA handling       |        1        |        1        |    1.6%
XAP MDCT forward       |       12        |        7        |   10.9%
XAP quantize+pack      |        4        |        3        |    4.7%
BLE GATT stack          |       10        |       10        |   15.6%
XMBP batch encode      |        2        |        2        |    3.1%
Sensor sampling (4ch)   |        2        |        2        |    3.1%
Watchdog + housekeeping |        2        |        2        |    3.1%
-----------------------|-----------------|-----------------|------------
TOTAL                  |       33        |       27        |   42.2%
Available headroom     |       31        |       37        |   57.8%

BLE stack overhead is significant. 4ch XAP @48kHz (~28 MIPS with DSP, 44% utilization) is feasible but tight.

STM32F103 — Sensor-only + ADPCM fallback

Single-core budget (72 MIPS, 20 KB SRAM):

Component              | MIPS (baseline) | MIPS (optimized) | % of 72MHz
-----------------------|-----------------|------------------|------------
ADPCM encode 2ch @24kHz|        1        |         1        |    1.4%
XMBP batch encode      |        2        |         2        |    2.8%
UART LTE-M1 transport   |        8        |         8        |   11.1%
Sensor sampling (4ch)   |        3        |         3        |    4.2%
Watchdog + housekeeping |        2        |         2        |    2.8%
-----------------------|-----------------|------------------|------------
TOTAL                  |       16        |        16        |   22.2%
Available headroom     |       56        |        56        |   77.8%

No DSP extension available. XAP is not feasible (32 KB encoder state exceeds 20 KB total SRAM). ADPCM at 2ch @24kHz is the maximum audio capability.


3. Memory Budget Per Target

All values in KB. "SDK Client" includes XylolabsClient state machine, session state, and configuration.

Target Total SRAM SDK Client XAP Encoder Ring Buffer XMBP HTTP Stack Available
RP2350 520 KB 12 KB 8 KB 32 KB 16 KB 4 KB 16 KB 432 KB
ESP32-S3 512 KB + 8MB PSRAM 12 KB 8 KB 64 KB (PSRAM) 16 KB 8 KB 16 KB ~8 MB
STM32F411 128 KB 12 KB 8 KB 8 KB 4 KB 4 KB 8 KB 84 KB
STM32U585 786 KB 12 KB 8 KB 32 KB 16 KB 4 KB 16 KB 698 KB
nRF52840 256 KB 12 KB 8 KB 16 KB 8 KB 4 KB 8 KB 200 KB
nRF9160 256 KB 12 KB 8 KB 16 KB 8 KB 4 KB 8 KB 200 KB
STM32WB55 256 KB 12 KB 8 KB 8 KB 4 KB 4 KB 8 KB 212 KB
STM32F103 20 KB 4 KB -- 4 KB 2 KB 2 KB 4 KB 4 KB
RP2040 264 KB 4 KB -- 8 KB 4 KB 4 KB 8 KB 236 KB
ESP32-C3 400 KB 4 KB -- 8 KB 4 KB 4 KB 8 KB 372 KB

Notes

  • XAP Encoder: 8 KB per channel for XAP encoder state; table shows 4ch total (32 KB) amortized. Platforms marked -- cannot run XAP.
  • Ring Buffer: Audio DMA double buffer for I2S capture. ESP32-S3 places this in PSRAM via DMA.
  • Stack: Dual-core platforms (RP2350, ESP32-S3) allocate 8 KB per core.
  • STM32F103: Extremely constrained. Only sensor + ADPCM 2ch fits. The 4 KB "available" is the absolute minimum for application logic.

4. API Server Concurrency Profile

Connection Handling

The Xylolabs API server is built on Tokio async runtime with Axum:

Parameter Value Notes
Runtime Tokio multi-threaded Worker threads = CPU cores
DB pool 20 connections (configurable) DATABASE_MAX_CONNECTIONS env var
Per-request memory ~1-2 KB (metadata) Excluding upload body
Upload body limit Up to 2 GB Full-file buffering (to be fixed)
SSE connections Unbounded Per-session broadcast channels
HTTP keep-alive 75 seconds Axum default

Ingest Pipeline Throughput

Stage Latency Notes
XMBP decode <100 us per 2kHz batch Benchmarked on x86-64 server
XAP frame decode ~5 us per frame XAP decoder (server-side, no constraints)
zstd compression ~200 us per chunk Offloaded to spawn_blocking
S3 write ~5-20 ms per chunk Network-bound, MinIO local ~2 ms
DB insert ~1-2 ms per record Batched within flush window
Flush window 10 seconds (configurable) Accumulates samples before write

Concurrent Session Capacity

Scenario Sessions Audio Streams Sensor Streams Server CPU DB Load
Light 10 10 x 2ch @16kHz 40 @100Hz <5% Low
Standard 50 50 x 4ch @48kHz 200 @100Hz ~20% Medium
Heavy 100 100 x 4ch @96kHz 2600 @100Hz ~60% High
Limit ~200 Limited by DB pool Limited by DB pool ~90% Saturated

Bottlenecks Identified (Performance Review)

Issue Severity Status Fix
N+1 tag queries in list_uploads P1 Fixed Batch fetch with single JOIN query
Sequential stats_overview queries P2 Fixed tokio::try_join! parallel execution
S3 full-file buffering on upload P1 TODO Streaming multipart upload
ConfigManager blocking RwLock P2 Fixed Migrated to tokio::sync::RwLock
Upload body buffered in memory P1 TODO Streaming body with backpressure
No connection rate limiting P3 TODO Tower rate-limit middleware

5. Burn-In Test Results

Native Platform (Apple M4)

Measured on Apple M4 with native Rust compilation (not cross-compiled):

Metric Value Notes
XAP encode (per frame) avg = 9 us 10 ms frame budget -> 0.09% utilization
Client tick (full cycle) avg = 56 us Includes XMBP encode + buffer management
MCU headroom 99.3% Validates algorithmic efficiency
Memory (peak RSS) ~2 MB SDK client + test harness
Frames dropped 0 Across all scenarios

Burn-In Scenarios

Scenario Duration Audio Config Sensors Devices Result
standard 60s 4ch @16kHz 4 @100Hz 1 PASS
stress 120s 4ch @96kHz 26 @100Hz 1 PASS
endurance 120s+ 2ch @16kHz 4 @10Hz 1 PASS
multi-device 60s 4ch @16kHz 4 @100Hz 10 PASS

QEMU ARM Throttled

Simulated RP2350 performance using ARM QEMU with CPU throttling to approximate 150 MHz Cortex-M33:

Metric Value Notes
XAP encode (per frame) avg = ~650 us ~6.5% of 10 ms budget
Client tick (full cycle) avg = ~3.8 ms ~38% of 10 ms budget
Estimated MCU headroom ~62% Conservative (QEMU overhead included)

The QEMU results align with the CPU budget analysis in Section 2 (RP2350 at ~46% baseline, ~39% with DSP).


6. CMSIS-DSP Integration

The SDK automatically links CMSIS-DSP on Cortex-M targets when XYLOLABS_USE_CMSIS_DSP=1 is set.

Key Optimized Functions

CMSIS-DSP Function Used For Speedup vs C
arm_rfft_fast_f32 MDCT / spectral analysis 3-5x
arm_fir_f32 / arm_fir_q15 FIR downsampling filter 2-4x
arm_dot_prod_f32 Inner products in quantization 2-3x
arm_scale_f32 Gain normalization 2x
arm_fill_f32 / arm_copy_f32 Buffer management 1.5-2x

Encoder Path Selection

Core FPU Recommended Path Rationale
Cortex-M33 (RP2350) Yes (single) Fixed-point (q15) DSP SIMD instructions (SMLAD) are optimized for 16-bit fixed-point
Cortex-M4F (F411) Yes (single) Floating-point (f32) FPU makes float path faster than fixed-point
Cortex-M4F (nRF52840) Yes (single) Floating-point (f32) Same as F411
Cortex-M33 (nRF9160) Yes (single) Fixed-point (q15) Same as RP2350
Xtensa LX7 (ESP32-S3) Yes (single) Floating-point with PIE PIE SIMD on f32 vectors
Cortex-M0+ (RP2040) No N/A (no XAP) No DSP, no FPU
Cortex-M3 (F103) No N/A (no XAP) No DSP, no FPU
RISC-V (ESP32-C3) No N/A (no XAP) M extension only (multiply)

7. Enabling DSP Features Per Target

Rust SDK (Cargo Feature Flags)

The Rust SDK (crates/xylolabs-sdk/) exposes two DSP feature flags:

Feature Targets Effect
cmsis-dsp Cortex-M33 (RP2350, nRF9160), Cortex-M4F (STM32F411, nRF52840, STM32WB55) Enables CMSIS-DSP optimized MDCT and FIR paths
esp32-simd ESP32-S3 (Xtensa LX7) Enables PIE SIMD optimized MDCT paths

Per-target Cargo.toml examples:

# RP2350 / nRF9160 (Cortex-M33)
xylolabs-sdk = { path = "../../crates/xylolabs-sdk", features = ["xap", "cmsis-dsp"] }

# STM32F411 / nRF52840 / STM32WB55 (Cortex-M4F)
xylolabs-sdk = { path = "../../crates/xylolabs-sdk", features = ["xap", "cmsis-dsp"] }

# ESP32-S3 (Xtensa LX7 with PIE)
xylolabs-sdk = { path = "../../crates/xylolabs-sdk", features = ["xap", "esp32-simd"] }

# STM32F103 / ESP32-C3 (no DSP -- ADPCM only)
xylolabs-sdk = { path = "../../crates/xylolabs-sdk", default-features = false, features = ["adpcm"] }

C SDK (Build Defines)

The C SDK uses compile-time defines in config.h, auto-detected from compiler flags:

Define Targets Auto-detect Condition
XYLOLABS_USE_CMSIS_DSP=1 Cortex-M33, Cortex-M4F __ARM_FEATURE_DSP defined
XYLOLABS_USE_ESP32S3_SIMD=1 ESP32-S3 __XTENSA__ + CONFIG_IDF_TARGET_ESP32S3

Override explicitly via CMake if auto-detection is insufficient:

# CMake -- force-enable CMSIS-DSP
target_compile_definitions(my_firmware PRIVATE XYLOLABS_USE_CMSIS_DSP=1)

# CMake -- force-enable ESP32-S3 SIMD
target_compile_definitions(my_firmware PRIVATE XYLOLABS_USE_ESP32S3_SIMD=1)

Summary: Which Flag for Which Target

Target Rust Feature C Define Expected Speedup
RP2350 (Cortex-M33) cmsis-dsp XYLOLABS_USE_CMSIS_DSP=1 ~30%
STM32F411 (Cortex-M4F) cmsis-dsp XYLOLABS_USE_CMSIS_DSP=1 ~40%
nRF52840 (Cortex-M4F) cmsis-dsp XYLOLABS_USE_CMSIS_DSP=1 ~30-40%
nRF9160 (Cortex-M33) cmsis-dsp XYLOLABS_USE_CMSIS_DSP=1 ~30%
STM32WB55 (Cortex-M4F) cmsis-dsp XYLOLABS_USE_CMSIS_DSP=1 ~40%
ESP32-S3 (Xtensa LX7) esp32-simd XYLOLABS_USE_ESP32S3_SIMD=1 ~60%
STM32F103 / RP2040 / ESP32-C3 N/A N/A No DSP available

8. Recommendations

Platform Selection by Use Case

Use Case Recommended Platform Codec Rationale
4ch @96kHz full-spectrum RP2350, ESP32-S3 XAP Only platforms with sufficient compute + bandwidth fit
4ch @48kHz standard STM32F411, RP2350, ESP32-S3 XAP F411 at 40% CPU with DSP
2ch @48kHz compact nRF52840 XAP BLE transport, 42% CPU with DSP
Sensor-only (no audio) Any platform N/A XMBP metadata only
Voice/ADPCM fallback STM32F103, ESP32-C3 IMA-ADPCM No FPU/DSP required

DSP Optimization Checklist

  1. Always enable CMSIS-DSP on Cortex-M targets (XYLOLABS_USE_CMSIS_DSP=1). Drop-in replacement, no code changes needed.
  2. Use XAP floating-point path on Cortex-M4F (STM32F411, nRF52840). The FPU makes float faster than fixed-point.
  3. Use XAP fixed-point path on Cortex-M33 (RP2350, nRF9160). DSP SIMD instructions are optimized for 16-bit operations.
  4. Enable PIE intrinsics on ESP32-S3. Auto-vectorization helps, but explicit PIE intrinsics in the XAP hot path yield 20-30% additional gain.
  5. For 96kHz 4ch: only RP2350, ESP32-S3, and STM32F411 (at 48kHz) have sufficient compute.
  6. Cortex-M3 and Cortex-M0+: ADPCM only. XAP encoder state (32 KB for 4ch) exceeds available SRAM on STM32F103 (20 KB).
  7. ESP32-C3 (RISC-V): sensor-only or ADPCM fallback. The M extension provides multiply but no SIMD or DSP acceleration.

Power vs Performance Tradeoffs

Platform Active Power Sleep Mode Best For
RP2350 ~25 mA @150MHz ~1.3 mA (dormant) Battery-powered field sensors
ESP32-S3 ~80 mA @240MHz (WiFi) ~10 uA (deep sleep) Mains-powered, WiFi available
STM32F411 ~30 mA @100MHz ~2.4 uA (standby) Industrial, low-power
nRF52840 ~5 mA @64MHz ~1.5 uA (system off) BLE wearable / beacon
STM32F103 ~25 mA @72MHz ~3.6 uA (standby) Legacy sensor nodes