Performance Evaluation — MCU Targets & Server Concurrency¶

Xylolabs API — Performance evaluation: MCU targets and server concurrency Revision: 2026-03-23

1. XAP Codec Benchmark Results¶

All measurements taken on Apple M-series host in --release mode. The XAP encoder is implemented in crates/xylolabs-sdk/src/codec/xap.rs. ADPCM encoder is the IMA-ADPCM implementation in crates/xylolabs-sdk/src/codec/adpcm.rs.

1.1 Encode Time Per Frame — Mono, 10ms¶

Rate	Ch	Samples	Frame	Avg (us)	Min (us)	Max (us)	Budget%
8000	1	80	10ms	0.5	0.4	12.7	0.005%
16000	1	160	10ms	1.1	1.0	1.2	0.011%
24000	1	240	10ms	1.9	1.6	6.8	0.019%
32000	1	320	10ms	3.0	2.6	9.0	0.030%
48000	1	480	10ms	512.3	451.1	700.0	5.123%
96000	1	960	10ms	1954.0	1763.8	2207.8	19.540%

Critical Observation — Cosine Table Threshold: There is a 170x discontinuity between 32kHz (3.0 us) and 48kHz (512.3 us). This is caused by the precomputed cosine table cutoff at XAP_PRECOMPUTE_MAX_N = 320 samples. Frames with N <= 320 (sample rates up to 32kHz at 10ms) use O(N^2) table lookup with zero trigonometric function calls. Frames with N > 320 (48kHz = 480, 96kHz = 960) fall back to runtime cosf() per MDCT coefficient, which is dramatically slower.

On real MCU targets with DSP extensions, the CMSIS-DSP arm_rfft_fast_f32 replaces this entire MDCT path, eliminating the discontinuity. The host benchmark reflects the software-only encoder behavior.

1.2 Channel Scaling — 16kHz, 10ms¶

Rate	Ch	Samples/ch	Avg (us)	Per-ch (us)	Scaling
16000	1	160	1.0	1.0	1.00x
16000	2	160	1.9	1.0	1.90x
16000	3	160	2.8	0.9	2.80x
16000	4	160	3.7	0.9	3.70x

Channel scaling is sub-linear (3.7x for 4 channels instead of 4.0x). The encoder processes channels independently with shared infrastructure (header write, de-interleave). The per-channel MDCT cost dominates, and the slight sub-linearity comes from amortized header/setup overhead.

1.3 Four-Channel Stress — Key Sample Rates¶

Rate	Ch	Samples/ch	Avg (us)	Min (us)	Max (us)	Budget%
16000	4	160	3.6	3.2	4.6	0.036%
48000	4	480	2004.7	1829.7	2378.2	20.047%
96000	4	960	7930.9	7390.5	10632.4	79.309%

The 4ch@96kHz configuration consumes 79.3% of the 10ms frame budget on the host. This leaves only 2.1ms headroom for XMBP encoding, I/O, sensors, and housekeeping on the host. On MCU targets with DSP acceleration, the MDCT is replaced by hardware-accelerated FFT paths that reduce this to 37-56 MIPS (see Section 2).

1.4 Frame Duration Comparison — 48kHz, 2ch¶

Rate	Ch	Duration	Samples/ch	Avg (us)	Budget%
48000	2	7.5ms	360	581.7	7.756%
48000	2	10ms	480	1007.6	10.076%

The 7.5ms frame processes fewer samples (360 vs 480) and is 42% faster in absolute time, but consumes a higher fraction of its shorter budget (7.8% vs 10.1%). The 10ms frame is preferred for MCU targets because the longer budget window provides more headroom for scheduling jitter and interrupt latency.

1.5 ADPCM Encode Time Per Frame — 10ms¶

Rate	Ch	Samples	Avg (us)	Min (us)	Max (us)	Budget%
16000	1	160	0.93	0.83	7.21	0.009%
16000	2	320	1.83	1.71	2.58	0.018%
16000	4	640	3.70	3.50	10.21	0.037%
48000	1	480	2.87	2.62	9.58	0.029%
48000	2	960	5.68	5.38	12.33	0.057%
48000	4	1920	11.56	10.79	19.67	0.116%
96000	1	960	5.79	5.38	13.50	0.058%
96000	2	1920	11.51	10.83	20.42	0.115%
96000	4	3840	23.00	21.75	48.79	0.230%

ADPCM is trivially cheap at all configurations. Even 4ch@96kHz costs only 23 us (0.23% of budget). This is because ADPCM uses pure integer arithmetic with no spectral transform -- it encodes sample-by-sample deltas using a lookup table.

1.6 XAP vs ADPCM Cost Ratio¶

Config	XAP (us)	ADPCM (us)	Ratio
1ch@16kHz	1.0	0.93	1x
4ch@16kHz	3.6	3.66	1x
4ch@48kHz	2054.1	11.72	175x
4ch@96kHz	7973.9	22.91	348x

At low sample rates (<=32kHz), XAP and ADPCM have comparable cost because the precomputed cosine table eliminates runtime trig. At 48kHz and above, XAP becomes 175-348x more expensive due to the O(N^2) MDCT with runtime cosf(). On MCU targets, DSP-accelerated FFT paths close this gap significantly (to approximately 20-40x), but ADPCM remains the lightest option for CPU-constrained targets.

2. MCU Feasibility Matrix¶

2.1 Maximum Sustainable Configuration Per Target¶

Evaluated using documented MIPS profiles from PERFORMANCE-PROFILE.md and CODEC-ANALYSIS.md, cross-referenced with benchmark measurements. CPU% includes codec encoding, I/O stack (I2S DMA, transport protocol, sensor sampling), and system overhead. RAM% includes SDK client state, codec buffers, ring buffers, XMBP framing, transport stack, and task stacks.

Target	Clock	SRAM	DSP/FPU	Max Audio Config	CPU%	RAM%	Verdict
RP2350 (Pico 2)	150 MHz	520 KB	M33 DSP+FPU	4ch @96kHz XAP	46.0%	16.9%	COMFORTABLE
ESP32-S3	240 MHz	512 KB + 8MB PSRAM	PIE SIMD+FPU	4ch @96kHz XAP	17.7%	24.2%	COMFORTABLE
STM32F411	100 MHz	128 KB	M4F DSP+FPU	4ch @48kHz XAP	40.0%	34.4%	FEASIBLE
nRF52840	64 MHz	256 KB	M4F DSP+FPU	2ch @48kHz XAP	42.2%	21.9%	FEASIBLE
nRF9160	64 MHz	256 KB	M33 DSP+FPU	2ch @48kHz XAP	44.5%	21.9%	FEASIBLE
STM32WB55	64 MHz	256 KB	M4F DSP+FPU	2ch @48kHz XAP	42.2%	17.2%	FEASIBLE
RP2040 (Pico)	133 MHz	264 KB	None	ADPCM 4ch @96kHz	3.0%	12.1%	ADPCM ONLY
ESP32-C3	160 MHz	400 KB	M ext only	ADPCM 4ch @96kHz	2.5%	8.0%	ADPCM ONLY
STM32F103	72 MHz	20 KB	None	ADPCM 2ch @24kHz	1.4%	80.0%	SENSOR ONLY

2.2 Verdict Definitions¶

Verdict	CPU Utilization	Meaning
COMFORTABLE	< 50%	Ample headroom for additional processing, OTA updates, or future features.
FEASIBLE	50-70%	Sufficient headroom for stable operation with careful scheduling.
TIGHT	70-85%	Operational but may exhibit jitter under worst-case interrupt latency.
MARGINAL	85-100%	Risk of frame drops under load. Not recommended for production.
ADPCM ONLY	N/A (XAP infeasible)	No DSP/FPU; cannot run XAP encoder. ADPCM at 4:1 compression only.
SENSOR ONLY	N/A (audio limited)	Extreme SRAM constraint. ADPCM 1-2ch + sensor telemetry only.

2.3 Detailed CPU Budget Per Target¶

RP2350 (Pico 2) — 4ch XAP @96kHz¶

Dual-core Cortex-M33 at 150 MHz. Core 0 handles I2S DMA + XAP encoding, Core 1 handles XMBP + HTTP + sensors.

Component	Baseline MIPS	With DSP MIPS	% of 150 MHz
I2S DMA handling	2	2	1.3%
XAP MDCT forward	50	35	23.3%
XAP quantize+pack	15	10	6.7%
XMBP batch encode	5	5	3.3%
HTTP transport	10	10	6.7%
Sensor sampling (26ch)	5	5	3.3%
Watchdog + housekeeping	2	2	1.3%
Total	89	69	46.0%
Available headroom	61	81	54.0%

With dual-core split: system utilization drops to ~39.3%.

ESP32-S3 — 4ch XAP @96kHz¶

Dual-core Xtensa LX7 at 240 MHz (480 MIPS total).

Component	Baseline MIPS	With PIE MIPS	% of 480 MHz
I2S DMA handling	2	2	0.4%
XAP MDCT forward	50	20	4.2%
XAP quantize+pack	15	6	1.3%
WiFi stack (FreeRTOS)	30	30	6.3%
XMBP batch encode	5	5	1.0%
HTTP/TLS transport	20	12	2.5%
Sensor sampling (26ch)	5	5	1.0%
PSRAM DMA management	3	3	0.6%
Watchdog + housekeeping	2	2	0.4%
Total	132	85	17.7%
Available headroom	348	395	82.3%

The ESP32-S3 has the most comfortable margin by far, primarily due to 128-bit PIE SIMD on the MDCT inner loop and hardware AES/SHA offload for TLS.

STM32F411 — 4ch XAP @48kHz¶

Single-core Cortex-M4F at 100 MHz.

Component	Baseline MIPS	With DSP MIPS	% of 100 MHz
I2S DMA handling	2	2	2.0%
XAP MDCT forward	25	15	15.0%
XAP quantize+pack	8	5	5.0%
XMBP batch encode	3	3	3.0%
UART LTE-M1 transport	8	8	8.0%
Sensor sampling (26ch)	5	5	5.0%
Watchdog + housekeeping	2	2	2.0%
Total	53	40	40.0%
Available headroom	47	60	60.0%

The M4F FPU enables the XAP floating-point encoder path, which is faster than fixed-point on this core. 4ch@96kHz is not recommended (would exceed 80% utilization).

nRF52840 — 2ch XAP @48kHz¶

Single-core Cortex-M4F at 64 MHz.

Component	Baseline MIPS	With DSP MIPS	% of 64 MHz
I2S DMA handling	1	1	1.6%
XAP MDCT forward	12	7	10.9%
XAP quantize+pack	4	3	4.7%
BLE GATT stack	10	10	15.6%
XMBP batch encode	2	2	3.1%
Sensor sampling (4ch)	2	2	3.1%
Watchdog + housekeeping	2	2	3.1%
Total	33	27	42.2%
Available headroom	31	37	57.8%

BLE stack overhead is the single largest non-codec consumer (15.6%). 4ch@48kHz is possible (~28 MIPS with DSP, 44% utilization) but leaves minimal headroom.

STM32F103 — Sensor-only + ADPCM fallback¶

Single-core Cortex-M3 at 72 MHz, 20 KB SRAM.

Component	MIPS	% of 72 MHz
ADPCM encode 2ch @24kHz	1	1.4%
XMBP batch encode	2	2.8%
UART LTE-M1 transport	8	11.1%
Sensor sampling (4ch)	3	4.2%
Watchdog + housekeeping	2	2.8%
Total	16	22.2%
Available headroom	56	77.8%

XAP is not feasible: the 32 KB encoder state for 4 channels exceeds the entire 20 KB SRAM. Even mono XAP (8 KB encoder state) leaves only 12 KB for everything else. ADPCM at 2ch @24kHz is the maximum practical audio configuration.

3. DSP/FPU Impact Analysis¶

3.1 Speedup by DSP Architecture¶

DSP Architecture	Platforms	Key Instructions	XAP Speedup	Mechanism
ARMv8-M DSP (Cortex-M33)	RP2350, nRF9160	SMLAD dual MAC, QADD, SSAT	~30%	Dual 16x16 MAC doubles throughput on 16-bit audio. Saturating arithmetic eliminates branch-based clipping. Fixed-point (Q15) path preferred.
Cortex-M4F FPU+DSP	STM32F411, STM32WB55, nRF52840	FPU + SMLAD + SDIV	~35-40%	Hardware float multiply-accumulate in 1-3 cycles. Float encoder path is faster than fixed-point. CMSIS-DSP `arm_rfft_fast_f32` provides 3-5x speedup for MDCT.
Xtensa PIE SIMD (ESP32-S3)	ESP32-S3	128-bit 4x f32, 8x i16	~60%	4-wide vector operations on dedicated 128-bit registers. Hardware AES/SHA offloads TLS from CPU. PSRAM DMA for large buffer transfers.
No DSP	RP2040, ESP32-C3, STM32F103	Software multiply only	0% (baseline)	All operations in software. No SIMD, no hardware float. XAP MDCT requires software `cosf()` which is 10-50x slower than DSP-accelerated FFT.

3.2 MDCT Path Selection¶

The MDCT forward transform is the encoder hot path, consuming 60-70% of total XAP encode time. The SDK selects the optimal path at compile time:

Core	FPU	Compile Path	MDCT Strategy
Cortex-M33 (RP2350, nRF9160)	Single-precision	`cmsis-dsp` feature	Fixed-point Q15 with `SMLAD` dual MAC. Precomputed cosine table in Q15 format.
Cortex-M4F (F411, WB55, nRF52840)	Single-precision	`cmsis-dsp` feature	Floating-point with `arm_rfft_fast_f32`. Hardware FPU makes float path faster than fixed-point.
Xtensa LX7 (ESP32-S3)	Single-precision	`esp32-simd` feature	Float with PIE 128-bit SIMD. 4 f32 values processed per vector instruction.
Cortex-M0+ (RP2040)	None	default (no DSP)	Not feasible. Software float MDCT exceeds CPU budget. ADPCM only.
Cortex-M3 (STM32F103)	None	default (no DSP)	Not feasible. Same as M0+. ADPCM only.
RISC-V (ESP32-C3)	None	default (no DSP)	Not feasible. M extension provides integer multiply but no SIMD. ADPCM only.

3.3 Benchmark-Measured Cosine Table Impact¶

The host benchmark reveals the dramatic impact of the precomputed cosine table:

Sample Rate	Frame Samples	Table Used	Encode Time (1ch)	Notes
8 kHz	80	Yes	0.5 us	O(N^2) table lookup, zero trig calls
16 kHz	160	Yes	1.1 us	O(N^2) table lookup
24 kHz	240	Yes	1.9 us	O(N^2) table lookup
32 kHz	320	Yes (limit)	3.0 us	Max N for precomputed table
48 kHz	480	No	512.3 us	Runtime `cosf()` per coefficient -- 170x slower
96 kHz	960	No	1954.0 us	Runtime `cosf()` -- 651x slower than 32kHz

The table limit (XAP_PRECOMPUTE_MAX_N = 320) keeps memory usage under 200 KB (51,200 i32 entries = 200 KB). Extending to 480 would require (480/2)*480 = 115,200 entries = 450 KB, which exceeds the SRAM of most MCU targets. On MCU targets, CMSIS-DSP replaces the entire MDCT with hardware-accelerated FFT, making this table irrelevant.

4. Memory Budget Analysis¶

4.1 Per-Target Memory Breakdown¶

All values in KB. Configurations shown are the maximum recommended per Section 2.

Target	SRAM	SDK+Codec	Ring Buf	XMBP	HTTP	Stack	Used	Avail	RAM%
RP2350 (Pico 2)	520	20	32	16	4	16	88	432	16.9%
ESP32-S3	512+8M	20	64*	16	8	16	124	388+	24.2%
STM32F411	128	20	8	4	4	8	44	84	34.4%
nRF52840	256	20	16	8	4	8	56	200	21.9%
nRF9160	256	20	16	8	4	8	56	200	21.9%
STM32WB55	256	20	8	4	4	8	44	212	17.2%
RP2040 (Pico)	264	8	8	4	4	8	32	232	12.1%
ESP32-C3	400	8	8	4	4	8	32	368	8.0%
STM32F103	20	4	4	2	2	4	16	4	80.0%

*ESP32-S3 ring buffer placed in PSRAM via DMA.

4.2 Memory-Limited Scenarios¶

RP2350 (520 KB SRAM): 4ch @96kHz + 26 sensors + HTTP = 88 KB (16.9%). Has 432 KB remaining for application logic, OTA staging buffer, and filesystem. The generous SRAM headroom makes RP2350 the best balanced target for feature-rich deployments.

STM32F411 (128 KB SRAM): 4ch @48kHz XAP = 44 KB (34.4%). The 84 KB remaining is adequate for the application but leaves no room for OTA staging. Firmware updates must use external flash or a swap-based bootloader. Dropping to 2ch@48kHz reduces usage to 36 KB, freeing 8 KB.

nRF52840 (256 KB SRAM): 2ch @48kHz + BLE GATT stack = 56 KB (21.9%). The SoftDevice BLE stack itself consumes an additional ~30 KB. With SoftDevice: total ~86 KB used, 170 KB available. Comfortable for BLE-based deployments.

STM32F103 (20 KB SRAM): Sensor-only ADPCM = 16 KB (80.0%). Only 4 KB remains for application logic. This is the absolute minimum viable configuration. Even adding a single additional sensor stream would require careful stack optimization. XAP is completely impossible: the encoder state alone (8 KB per channel, 32 KB for 4ch) exceeds total SRAM.

ESP32-S3 (512 KB + 8 MB PSRAM): The PSRAM massively extends available memory. Audio ring buffers (64 KB+) and XMBP batch buffers can reside in PSRAM with DMA access, keeping fast SRAM free for codec state and stack. This makes ESP32-S3 the only target that could feasibly support configurations beyond 4ch@96kHz (e.g., 8ch@48kHz for surround monitoring) without memory pressure.

5. Server Concurrency Evaluation¶

5.1 Architecture Overview¶

The Xylolabs API server is built on Tokio async runtime with Axum. The ingest pipeline (crates/xylolabs-server/src/ingest/manager.rs) processes XMBP batches from MCU devices, buffers samples in memory, compresses via zstd in spawn_blocking, and flushes to S3 and PostgreSQL.

5.2 Connection Parameters¶

Parameter	Value	Notes
Runtime	Tokio multi-threaded	Worker threads = CPU cores
DB connection pool	20 (default, configurable)	`DATABASE_MAX_CONNECTIONS` env var
Per-session memory	~1-4 KB per stream buffer	Excluding accumulated samples
Upload body limit	Up to 2 GB	Currently full-buffer (not streaming)
SSE live connections	Unbounded	Broadcast channels per session
HTTP keep-alive	75 seconds	Axum default
Auth rate limit	10 attempts/IP/minute	In-memory cache, 60s TTL, 10K IP capacity
Ingest flush window	10 seconds (configurable)	Accumulates samples before S3 write
Session timeout	300 seconds (configurable)	Auto-close stale sessions

5.3 Ingest Pipeline Throughput¶

Stage	Latency	Concurrency Model	Notes
XMBP batch decode	< 100 us per batch	Inline async	Pure CPU, no I/O
Sample buffering	< 1 us per sample	Mutex-guarded HashMap	Per-stream Vec
Live event broadcast	< 10 us per event	broadcast::channel	Only if subscribers exist
zstd compression	~200 us per chunk	`spawn_blocking`	Offloaded from async runtime
S3 upload	~5-20 ms per chunk	Async I/O	Network-bound; MinIO local ~2 ms
DB insert (chunk record)	~1-2 ms per record	Async sqlx	Batched within flush window
DB stats update	~1 ms per batch	Inline async	Single UPDATE per batch

Estimated throughput per core: ~500 XMBP batches/sec (CPU-bound stage is compression at ~200 us/chunk, but offloaded to blocking pool).

5.4 Concurrent Session Capacity¶

Scenario	Sessions	Audio Config	Sensor Streams	Server CPU	DB Load	RAM
Light	10	10 x 2ch @16kHz	40 @100Hz	< 5%	Low	~40 KB
Standard	50	50 x 4ch @48kHz	200 @100Hz	~20%	Medium	~200 KB
Heavy	100	100 x 4ch @96kHz	2600 @100Hz	~60%	High	~400 KB
Limit	~200	Limited by DB pool	Limited by DB pool	~90%	Saturated	~800 KB

The limiting factor is the PostgreSQL connection pool (default 20). Each flush operation requires a DB connection for the chunk INSERT. With 100 sessions flushing every 10 seconds, the DB pool processes ~10 flush operations/sec per connection, which is well within capacity. At 200+ sessions with aggressive flush windows, pool exhaustion becomes the bottleneck.

5.5 Concurrency Hazards — Identified and Resolved¶

Issue	Severity	Status	Resolution
`ConfigManager` blocking `RwLock` in async context	P2	Fixed	Migrated from `std::sync::RwLock` to `tokio::sync::RwLock`. Blocking lock in async runtime caused thread starvation under contention.
Ingest flush data loss on S3 failure	P0	Fixed	Previously drained samples before confirming S3 write. Now clones samples before flush; originals retained on failure for retry. `flushing` flag prevents concurrent flushes of same buffer.
N+1 tag queries in `list_uploads`	P1	Fixed	Replaced per-upload tag fetch (1+N queries) with single JOIN batch query.
Sequential `stats_overview` queries	P2	Fixed	Six independent COUNT queries now run concurrently via `tokio::try_join!`.
S3 full-file buffering on upload	P1	Open	2 GB upload = 2 GB RAM. Needs streaming multipart upload with backpressure.
Upload body buffered in memory	P1	Open	Large upload bodies held entirely in memory before S3 write. Needs streaming body handling.
No connection rate limiting (general)	P3	Open	Auth endpoints have IP-based rate limiting (10/min). General API endpoints lack rate limiting.
File descriptor leak via `mem::forget`	P0	Fixed	Temp files leaked via `mem::forget`. Fixed with `into_temp_path` for deterministic cleanup.
Session lock contention	P2	Mitigated	`IngestManager.sessions` uses `tokio::sync::RwLock`. Individual sessions use `tokio::sync::Mutex`. Flush operations drop the session lock before performing I/O. Stale session check uses atomic `last_activity_ms` without locking session mutexes.

5.6 Remaining Performance Risks¶

S3 full-file buffering: The most significant open issue. A single 2 GB upload consumes 2 GB of server RAM. With 10 concurrent large uploads, the server requires 20 GB RAM just for upload buffering. The fix requires streaming multipart upload to S3 with bounded memory buffers and backpressure signaling to the client.

DB pool starvation: The default pool of 20 connections supports approximately 200 concurrent sessions at the current flush interval. Beyond this, flush operations queue behind the pool, increasing latency and sample buffer memory. Increasing the pool size requires corresponding PostgreSQL max_connections tuning.

Broadcast channel unbounded subscribers: The SSE live event endpoint creates broadcast receivers with no limit on subscriber count. A malicious client opening thousands of SSE connections could exhaust memory with broadcast channel buffers.

6. Recommendations¶

6.1 Platform Selection by Use Case¶

Use Case	Recommended Platform	Codec	Max Config	Rationale
Full-spectrum industrial monitoring	ESP32-S3	XAP	4ch @96kHz	82% headroom, WiFi built-in, PSRAM for large buffers
Battery-powered field sensor	RP2350 (Pico 2)	XAP	4ch @96kHz	54% headroom, lowest active power (25 mA), dual-core
Compact industrial node	STM32F411	XAP	4ch @48kHz	60% headroom, proven M4F ecosystem, UART LTE-M
BLE wearable / beacon	nRF52840	XAP	2ch @48kHz	58% headroom, BLE transport, ultra-low sleep (1.5 uA)
Cellular IoT (LTE-M)	nRF9160	XAP	2ch @48kHz	55% headroom, integrated LTE-M modem
BLE + Thread mesh	STM32WB55	XAP	2ch @48kHz	58% headroom, dual-protocol BLE + 802.15.4
Voice-only / legacy sensor	STM32F103	IMA-ADPCM	2ch @24kHz	78% headroom, ADPCM only, 20 KB SRAM limit
Low-cost WiFi sensor	ESP32-C3	IMA-ADPCM	4ch @96kHz	98% headroom, ADPCM only, RISC-V, WiFi built-in
Education / prototyping	RP2040 (Pico)	IMA-ADPCM	4ch @96kHz	97% headroom, ADPCM only, lowest cost ($1)

6.2 Server Scaling Recommendations¶

Load Tier	Sessions	Server Config	DB Pool	Notes
Development	1-10	Single instance, 2 CPU	10	Default configuration sufficient
Small deployment	10-50	Single instance, 4 CPU	20	Default pool adequate
Medium deployment	50-200	Single instance, 8 CPU	40	Increase pool, monitor flush latency
Large deployment	200-1000	Multiple instances + LB	60/instance	Requires streaming S3 upload fix, horizontal scaling
Enterprise	1000+	Kubernetes pods, auto-scale	Connection pooler (PgBouncer)	Requires all P1 issues resolved

6.3 Priority Engineering Work¶

P1 — Streaming S3 upload: Replace full-file buffering with streaming multipart upload. Eliminates the O(file_size) memory consumption. Critical for any deployment handling files larger than 100 MB.
P1 — Streaming upload body: Implement backpressure-aware streaming from HTTP body to S3, never holding more than a bounded buffer (e.g., 4 MB) in memory.
P2 — SSE subscriber limits: Cap broadcast channel receivers per session (e.g., 100). Return 429 when exceeded.
P3 — General API rate limiting: Add Tower rate-limit middleware to all API endpoints, not just auth.
Optimization — Cosine table extension: Consider extending XAP_PRECOMPUTE_MAX_N to 480 on targets with sufficient SRAM (ESP32-S3 with PSRAM) for 48kHz table-based encoding without DSP dependency. This would eliminate the 170x discontinuity for the 48kHz use case.

Performance Profile -- DSP acceleration matrix, per-target CPU/memory budgets
Codec Analysis -- 16 audio codecs compared across 5 MCU platforms
RP2350 Feasibility -- 4ch 96kHz architecture, detailed CPU/memory budget
Pico 2 Platform Guide -- RP2350 hardware setup and build
STM32 Platform Guide -- F103/F411/WB55/WBA55 configuration
ESP32 Platform Guide -- S3/C3 WiFi, ESP-IDF integration
SDK Overview -- Rust-first embedded SDK architecture