Performance Profiling — DSP Acceleration & Resource Budgets¶

Xylolabs API — Performance profiling: DSP acceleration and resource budgets Revision: 2026-03-23

1. DSP Acceleration Matrix¶

Each platform's DSP capabilities determine codec performance. Speedup percentages are measured against pure C baseline implementations.

Platform	Core	Clock	DSP Extensions	FPU	XAP Speedup	ADPCM Speedup
RP2350 (Pico 2)	Cortex-M33	150 MHz	ARMv8-M DSP (SMLAD, SMLAL, QADD, SSAT)	Single-precision	~30% (80->56 MIPS)	~50%
RP2040 (Pico)	Cortex-M0+	133 MHz	None	None	N/A (not feasible)	Minimal
ESP32-S3	2x Xtensa LX7	240 MHz	128-bit SIMD (PIE): 4x f32, 8x i16 vector	Single-precision	~60%	~40%
ESP32-C3	RISC-V RV32IMC	160 MHz	M extension (multiply only)	None	Marginal	Minimal
STM32F411	Cortex-M4F	100 MHz	FPU + DSP: SMLAD, barrel shifter	Single-precision	~40% (float path)	~30%
STM32F103	Cortex-M3	72 MHz	None	None	N/A (not feasible)	Minimal
STM32WB55	Cortex-M4F	64 MHz	FPU + DSP (same as F411)	Single-precision	~40%	~30%
nRF52840	Cortex-M4F	64 MHz	FPU + DSP (same as F411)	Single-precision	~30-40%	~30%
nRF9160	Cortex-M33	64 MHz	ARMv8-M DSP (same as RP2350)	Single-precision	~30%	~50%

DSP Instruction Summary¶

ARMv8-M DSP (Cortex-M33: RP2350, nRF9160, STM32U585): - SMLAL / UMLAL -- single-cycle 32x32->64 MAC for FIR and MDCT accumulation - SMLAD / SMUAD -- dual 16x16->32 MAC, doubles throughput for 16-bit audio - QADD / QSUB / SSAT / USAT -- saturating arithmetic, eliminates branch-based clipping - SBFX / UBFX -- bit-field extract for XMBP binary protocol parsing

Cortex-M4F DSP (STM32F411, STM32WB55, nRF52840): - All ARMv8-M DSP instructions above, plus: - Single-precision FPU -- hardware float multiply-accumulate in 1-3 cycles - SDIV / UDIV -- hardware integer divide in 2-12 cycles

Xtensa PIE (ESP32-S3): - 128-bit SIMD -- 4x f32 or 8x i16 per instruction - 16 x 128-bit dedicated vector registers - Hardware AES/SHA offloads TLS from CPU - PSRAM DMA for large audio buffer transfers

2. Per-Target Performance Budget¶

RP2350 (Pico 2) — 4ch XAP @96kHz¶

Single core budget (Core 0 for codec, Core 1 for I/O):

Component              | MIPS (baseline) | MIPS (with DSP) | % of 150MHz
-----------------------|-----------------|-----------------|------------
I2S DMA handling       |        2        |        2        |    1.3%
XAP MDCT forward       |       50        |       35        |   23.3%
XAP quantize+pack      |       15        |       10        |    6.7%
XMBP batch encode      |        5        |        5        |    3.3%
HTTP transport          |       10        |       10        |    6.7%
Sensor sampling (26ch)  |        5        |        5        |    3.3%
Watchdog + housekeeping |        2        |        2        |    1.3%
-----------------------|-----------------|-----------------|------------
TOTAL                  |       89        |       69        |   46.0%
Available headroom     |       61        |       81        |   54.0%

Dual-core split: Core 0 handles I2S DMA + XAP encoding (~37 MIPS with DSP, 24.7%). Core 1 handles XMBP, HTTP, sensors (~22 MIPS, 14.7%). Total system utilization: ~39.3%.

ESP32-S3 — 4ch XAP @96kHz¶

Dual-core budget (480 MIPS total):

Component              | MIPS (baseline) | MIPS (with PIE) | % of 480MHz
-----------------------|-----------------|-----------------|------------
I2S DMA handling       |        2        |        2        |    0.4%
XAP MDCT forward       |       50        |       20        |    4.2%
XAP quantize+pack      |       15        |        6        |    1.3%
WiFi stack (FreeRTOS)   |       30        |       30        |    6.3%
XMBP batch encode      |        5        |        5        |    1.0%
HTTP/TLS transport      |       20        |       12        |    2.5%
Sensor sampling (26ch)  |        5        |        5        |    1.0%
PSRAM DMA management    |        3        |        3        |    0.6%
Watchdog + housekeeping |        2        |        2        |    0.4%
-----------------------|-----------------|-----------------|------------
TOTAL                  |      132        |       85        |   17.7%
Available headroom     |      348        |      395        |   82.3%

PIE SIMD provides the largest absolute gain. TLS overhead is reduced by hardware AES/SHA acceleration.

STM32F411 — 4ch XAP @48kHz¶

Single-core budget (100 MIPS):

Component              | MIPS (baseline) | MIPS (with DSP) | % of 100MHz
-----------------------|-----------------|-----------------|------------
I2S DMA handling       |        2        |        2        |    2.0%
XAP MDCT forward       |       25        |       15        |   15.0%
XAP quantize+pack      |        8        |        5        |    5.0%
XMBP batch encode      |        3        |        3        |    3.0%
UART LTE-M1 transport   |        8        |        8        |    8.0%
Sensor sampling (26ch)  |        5        |        5        |    5.0%
Watchdog + housekeeping |        2        |        2        |    2.0%
-----------------------|-----------------|-----------------|------------
TOTAL                  |       53        |       40        |   40.0%
Available headroom     |       47        |       60        |   60.0%

The M4F FPU enables the XAP floating-point encoder path, which is faster than fixed-point on this core. CMSIS-DSP arm_rfft_fast_f32 provides 3-5x speedup for the MDCT.

nRF52840 — 2ch XAP @48kHz¶

Single-core budget (64 MIPS):

Component              | MIPS (baseline) | MIPS (with DSP) | % of 64MHz
-----------------------|-----------------|-----------------|------------
I2S DMA handling       |        1        |        1        |    1.6%
XAP MDCT forward       |       12        |        7        |   10.9%
XAP quantize+pack      |        4        |        3        |    4.7%
BLE GATT stack          |       10        |       10        |   15.6%
XMBP batch encode      |        2        |        2        |    3.1%
Sensor sampling (4ch)   |        2        |        2        |    3.1%
Watchdog + housekeeping |        2        |        2        |    3.1%
-----------------------|-----------------|-----------------|------------
TOTAL                  |       33        |       27        |   42.2%
Available headroom     |       31        |       37        |   57.8%

BLE stack overhead is significant. 4ch XAP @48kHz (~28 MIPS with DSP, 44% utilization) is feasible but tight.

STM32F103 — Sensor-only + ADPCM fallback¶

Single-core budget (72 MIPS, 20 KB SRAM):

Component              | MIPS (baseline) | MIPS (optimized) | % of 72MHz
-----------------------|-----------------|------------------|------------
ADPCM encode 2ch @24kHz|        1        |         1        |    1.4%
XMBP batch encode      |        2        |         2        |    2.8%
UART LTE-M1 transport   |        8        |         8        |   11.1%
Sensor sampling (4ch)   |        3        |         3        |    4.2%
Watchdog + housekeeping |        2        |         2        |    2.8%
-----------------------|-----------------|------------------|------------
TOTAL                  |       16        |        16        |   22.2%
Available headroom     |       56        |        56        |   77.8%

No DSP extension available. XAP is not feasible (32 KB encoder state exceeds 20 KB total SRAM). ADPCM at 2ch @24kHz is the maximum audio capability.

3. Memory Budget Per Target¶

All values in KB. "SDK Client" includes XylolabsClient state machine, session state, and configuration.

Target	Total SRAM	SDK Client	XAP Encoder	Ring Buffer	XMBP	HTTP	Stack	Available
RP2350	520 KB	12 KB	8 KB	32 KB	16 KB	4 KB	16 KB	432 KB
ESP32-S3	512 KB + 8MB PSRAM	12 KB	8 KB	64 KB (PSRAM)	16 KB	8 KB	16 KB	~8 MB
STM32F411	128 KB	12 KB	8 KB	8 KB	4 KB	4 KB	8 KB	84 KB
STM32U585	786 KB	12 KB	8 KB	32 KB	16 KB	4 KB	16 KB	698 KB
nRF52840	256 KB	12 KB	8 KB	16 KB	8 KB	4 KB	8 KB	200 KB
nRF9160	256 KB	12 KB	8 KB	16 KB	8 KB	4 KB	8 KB	200 KB
STM32WB55	256 KB	12 KB	8 KB	8 KB	4 KB	4 KB	8 KB	212 KB
STM32F103	20 KB	4 KB	--	4 KB	2 KB	2 KB	4 KB	4 KB
RP2040	264 KB	4 KB	--	8 KB	4 KB	4 KB	8 KB	236 KB
ESP32-C3	400 KB	4 KB	--	8 KB	4 KB	4 KB	8 KB	372 KB

Notes¶

XAP Encoder: 8 KB per channel for XAP encoder state; table shows 4ch total (32 KB) amortized. Platforms marked -- cannot run XAP.
Ring Buffer: Audio DMA double buffer for I2S capture. ESP32-S3 places this in PSRAM via DMA.
Stack: Dual-core platforms (RP2350, ESP32-S3) allocate 8 KB per core.
STM32F103: Extremely constrained. Only sensor + ADPCM 2ch fits. The 4 KB "available" is the absolute minimum for application logic.

4. API Server Concurrency Profile¶

Connection Handling¶

The Xylolabs API server is built on Tokio async runtime with Axum:

Parameter	Value	Notes
Runtime	Tokio multi-threaded	Worker threads = CPU cores
DB pool	20 connections (configurable)	`DATABASE_MAX_CONNECTIONS` env var
Per-request memory	~1-2 KB (metadata)	Excluding upload body
Upload body limit	Up to 2 GB	Full-file buffering (to be fixed)
SSE connections	Unbounded	Per-session broadcast channels
HTTP keep-alive	75 seconds	Axum default

Ingest Pipeline Throughput¶

Stage	Latency	Notes
XMBP decode	<100 us per 2kHz batch	Benchmarked on x86-64 server
XAP frame decode	~5 us per frame	XAP decoder (server-side, no constraints)
zstd compression	~200 us per chunk	Offloaded to `spawn_blocking`
S3 write	~5-20 ms per chunk	Network-bound, MinIO local ~2 ms
DB insert	~1-2 ms per record	Batched within flush window
Flush window	10 seconds (configurable)	Accumulates samples before write

Concurrent Session Capacity¶

Scenario	Sessions	Audio Streams	Sensor Streams	Server CPU	DB Load
Light	10	10 x 2ch @16kHz	40 @100Hz	<5%	Low
Standard	50	50 x 4ch @48kHz	200 @100Hz	~20%	Medium
Heavy	100	100 x 4ch @96kHz	2600 @100Hz	~60%	High
Limit	~200	Limited by DB pool	Limited by DB pool	~90%	Saturated

Bottlenecks Identified (Performance Review)¶

Issue	Severity	Status	Fix
N+1 tag queries in `list_uploads`	P1	Fixed	Batch fetch with single JOIN query
Sequential `stats_overview` queries	P2	Fixed	`tokio::try_join!` parallel execution
S3 full-file buffering on upload	P1	TODO	Streaming multipart upload
`ConfigManager` blocking `RwLock`	P2	Fixed	Migrated to `tokio::sync::RwLock`
Upload body buffered in memory	P1	TODO	Streaming body with backpressure
No connection rate limiting	P3	TODO	Tower rate-limit middleware

5. Burn-In Test Results¶

Native Platform (Apple M4)¶

Measured on Apple M4 with native Rust compilation (not cross-compiled):

Metric	Value	Notes
XAP encode (per frame)	avg = 9 us	10 ms frame budget -> 0.09% utilization
Client tick (full cycle)	avg = 56 us	Includes XMBP encode + buffer management
MCU headroom	99.3%	Validates algorithmic efficiency
Memory (peak RSS)	~2 MB	SDK client + test harness
Frames dropped	0	Across all scenarios

Burn-In Scenarios¶

Scenario	Duration	Audio Config	Sensors	Devices	Result
standard	60s	4ch @16kHz	4 @100Hz	1	PASS
stress	120s	4ch @96kHz	26 @100Hz	1	PASS
endurance	120s+	2ch @16kHz	4 @10Hz	1	PASS
multi-device	60s	4ch @16kHz	4 @100Hz	10	PASS

QEMU ARM Throttled¶

Simulated RP2350 performance using ARM QEMU with CPU throttling to approximate 150 MHz Cortex-M33:

Metric	Value	Notes
XAP encode (per frame)	avg = ~650 us	~6.5% of 10 ms budget
Client tick (full cycle)	avg = ~3.8 ms	~38% of 10 ms budget
Estimated MCU headroom	~62%	Conservative (QEMU overhead included)

The QEMU results align with the CPU budget analysis in Section 2 (RP2350 at ~46% baseline, ~39% with DSP).

6. CMSIS-DSP Integration¶

The SDK automatically links CMSIS-DSP on Cortex-M targets when XYLOLABS_USE_CMSIS_DSP=1 is set.

Key Optimized Functions¶

CMSIS-DSP Function	Used For	Speedup vs C
`arm_rfft_fast_f32`	MDCT / spectral analysis	3-5x
`arm_fir_f32` / `arm_fir_q15`	FIR downsampling filter	2-4x
`arm_dot_prod_f32`	Inner products in quantization	2-3x
`arm_scale_f32`	Gain normalization	2x
`arm_fill_f32` / `arm_copy_f32`	Buffer management	1.5-2x

Encoder Path Selection¶

Core	FPU	Recommended Path	Rationale
Cortex-M33 (RP2350)	Yes (single)	Fixed-point (`q15`)	DSP SIMD instructions (`SMLAD`) are optimized for 16-bit fixed-point
Cortex-M4F (F411)	Yes (single)	Floating-point (`f32`)	FPU makes float path faster than fixed-point
Cortex-M4F (nRF52840)	Yes (single)	Floating-point (`f32`)	Same as F411
Cortex-M33 (nRF9160)	Yes (single)	Fixed-point (`q15`)	Same as RP2350
Xtensa LX7 (ESP32-S3)	Yes (single)	Floating-point with PIE	PIE SIMD on f32 vectors
Cortex-M0+ (RP2040)	No	N/A (no XAP)	No DSP, no FPU
Cortex-M3 (F103)	No	N/A (no XAP)	No DSP, no FPU
RISC-V (ESP32-C3)	No	N/A (no XAP)	M extension only (multiply)

7. Enabling DSP Features Per Target¶

Rust SDK (Cargo Feature Flags)¶

The Rust SDK (crates/xylolabs-sdk/) exposes two DSP feature flags:

Feature	Targets	Effect
`cmsis-dsp`	Cortex-M33 (RP2350, nRF9160), Cortex-M4F (STM32F411, nRF52840, STM32WB55)	Enables CMSIS-DSP optimized MDCT and FIR paths
`esp32-simd`	ESP32-S3 (Xtensa LX7)	Enables PIE SIMD optimized MDCT paths

Per-target Cargo.toml examples:

# RP2350 / nRF9160 (Cortex-M33)
xylolabs-sdk = { path = "../../crates/xylolabs-sdk", features = ["xap", "cmsis-dsp"] }

# STM32F411 / nRF52840 / STM32WB55 (Cortex-M4F)
xylolabs-sdk = { path = "../../crates/xylolabs-sdk", features = ["xap", "cmsis-dsp"] }

# ESP32-S3 (Xtensa LX7 with PIE)
xylolabs-sdk = { path = "../../crates/xylolabs-sdk", features = ["xap", "esp32-simd"] }

# STM32F103 / ESP32-C3 (no DSP -- ADPCM only)
xylolabs-sdk = { path = "../../crates/xylolabs-sdk", default-features = false, features = ["adpcm"] }

C SDK (Build Defines)¶

The C SDK uses compile-time defines in config.h, auto-detected from compiler flags:

Define	Targets	Auto-detect Condition
`XYLOLABS_USE_CMSIS_DSP=1`	Cortex-M33, Cortex-M4F	`__ARM_FEATURE_DSP` defined
`XYLOLABS_USE_ESP32S3_SIMD=1`	ESP32-S3	`__XTENSA__` + `CONFIG_IDF_TARGET_ESP32S3`

Override explicitly via CMake if auto-detection is insufficient:

# CMake -- force-enable CMSIS-DSP
target_compile_definitions(my_firmware PRIVATE XYLOLABS_USE_CMSIS_DSP=1)

# CMake -- force-enable ESP32-S3 SIMD
target_compile_definitions(my_firmware PRIVATE XYLOLABS_USE_ESP32S3_SIMD=1)

Summary: Which Flag for Which Target¶

Target	Rust Feature	C Define	Expected Speedup
RP2350 (Cortex-M33)	`cmsis-dsp`	`XYLOLABS_USE_CMSIS_DSP=1`	~30%
STM32F411 (Cortex-M4F)	`cmsis-dsp`	`XYLOLABS_USE_CMSIS_DSP=1`	~40%
nRF52840 (Cortex-M4F)	`cmsis-dsp`	`XYLOLABS_USE_CMSIS_DSP=1`	~30-40%
nRF9160 (Cortex-M33)	`cmsis-dsp`	`XYLOLABS_USE_CMSIS_DSP=1`	~30%
STM32WB55 (Cortex-M4F)	`cmsis-dsp`	`XYLOLABS_USE_CMSIS_DSP=1`	~40%
ESP32-S3 (Xtensa LX7)	`esp32-simd`	`XYLOLABS_USE_ESP32S3_SIMD=1`	~60%
STM32F103 / RP2040 / ESP32-C3	N/A	N/A	No DSP available

8. Recommendations¶

Platform Selection by Use Case¶

Use Case	Recommended Platform	Codec	Rationale
4ch @96kHz full-spectrum	RP2350, ESP32-S3	XAP	Only platforms with sufficient compute + bandwidth fit
4ch @48kHz standard	STM32F411, RP2350, ESP32-S3	XAP	F411 at 40% CPU with DSP
2ch @48kHz compact	nRF52840	XAP	BLE transport, 42% CPU with DSP
Sensor-only (no audio)	Any platform	N/A	XMBP metadata only
Voice/ADPCM fallback	STM32F103, ESP32-C3	IMA-ADPCM	No FPU/DSP required

DSP Optimization Checklist¶

Always enable CMSIS-DSP on Cortex-M targets (XYLOLABS_USE_CMSIS_DSP=1). Drop-in replacement, no code changes needed.
Use XAP floating-point path on Cortex-M4F (STM32F411, nRF52840). The FPU makes float faster than fixed-point.
Use XAP fixed-point path on Cortex-M33 (RP2350, nRF9160). DSP SIMD instructions are optimized for 16-bit operations.
Enable PIE intrinsics on ESP32-S3. Auto-vectorization helps, but explicit PIE intrinsics in the XAP hot path yield 20-30% additional gain.
For 96kHz 4ch: only RP2350, ESP32-S3, and STM32F411 (at 48kHz) have sufficient compute.
Cortex-M3 and Cortex-M0+: ADPCM only. XAP encoder state (32 KB for 4ch) exceeds available SRAM on STM32F103 (20 KB).
ESP32-C3 (RISC-V): sensor-only or ADPCM fallback. The M extension provides multiply but no SIMD or DSP acceleration.

Power vs Performance Tradeoffs¶

Platform	Active Power	Sleep Mode	Best For
RP2350	~25 mA @150MHz	~1.3 mA (dormant)	Battery-powered field sensors
ESP32-S3	~80 mA @240MHz (WiFi)	~10 uA (deep sleep)	Mains-powered, WiFi available
STM32F411	~30 mA @100MHz	~2.4 uA (standby)	Industrial, low-power
nRF52840	~5 mA @64MHz	~1.5 uA (system off)	BLE wearable / beacon
STM32F103	~25 mA @72MHz	~3.6 uA (standby)	Legacy sensor nodes

Codec Analysis -- 16 audio codecs compared across 5 MCU platforms
RP2350 Feasibility -- 4ch 96kHz architecture, CPU/memory budget
Pico 2 Platform Guide -- RP2350 hardware setup and build
STM32 Platform Guide -- F103/F411/WB55/WBA55 configuration
ESP32 Platform Guide -- S3/C3 WiFi, ESP-IDF integration
SDK Overview -- Rust-first embedded SDK architecture