Audio codec analysis for MCU-based industrial streaming¶
Xylolabs API — Codec Feasibility Study Revision: 2026-03-22
PATENT PENDING — XAP (Xylolabs Audio Protocol) and XMBP (Xylolabs Metadata Binary Protocol) are proprietary technologies of Xylolabs Inc. Patent applications have been filed.
1. Overview¶
Use case¶
Industrial audio monitoring with four-channel stereo microphone arrays operating at 96 kHz sample rate. Audio is captured on an embedded MCU, compressed in real-time, and streamed over LTE-M1 to a cloud ingest server where it is decoded, stored, and made available for analysis.
LTE-M1 bandwidth constraint¶
LTE-M1 (LTE Cat-M1) provides approximately 375 kbps uplink throughput in good conditions, degrading to ~100–200 kbps under typical field conditions. With protocol overhead, the practical sustained audio payload budget is:
- Best case: ~47 KB/s (375 kbps ÷ 8, no overhead)
- Practical target: ~30–40 KB/s (allowing for retransmissions, headers, metadata)
Requirements¶
| Requirement | Value |
|---|---|
| Channels | 4 (two stereo pairs) |
| Sample rate | 96 kHz |
| Bit depth | 16-bit (24-bit optional) |
| Encode location | MCU (no OS, bare-metal or RTOS) |
| Decode location | Cloud server (Linux, no resource constraints) |
| Max sustained bandwidth | ~40 KB/s |
| Latency budget | < 100ms end-to-end |
| Quality priority | Industrial monitoring — preserve full spectrum |
Raw PCM baseline¶
This is 19× over budget. Significant compression is required.
2. Codec comparison table¶
| Codec | Type | MIPS/ch @48kHz | MIPS/ch @96kHz | RAM/ch | Compression | Quality | License | MCU Feasible? |
|---|---|---|---|---|---|---|---|---|
| XAP | Lossy | 5–10 | 10–20 | ~8 KB | 10:1 | Excellent | Xylolabs proprietary (patent pending) | Yes |
| Opus (SILK) | Lossy | 15–25 | 30–50 | ~20 KB | 10:1+ | Excellent | BSD (royalty-free) | Marginal |
| Opus (CELT) | Lossy | 30–50 | 60–100 | ~30 KB | 10:1+ | Excellent | BSD (royalty-free) | No |
| MP3 (LAME) | Lossy | 20–40 | N/A (max 48 kHz) | ~30 KB | 10:1 | Good | Patent-free (2017+) | Marginal @48 kHz |
| AAC-LC | Lossy | 25–40 | 50–80 | ~25 KB | 10:1 | Very Good | ISO (Via Licensing) | No @96 kHz |
| HE-AAC v1 | Lossy | 30–50 | 60–100 | ~30 KB | 15:1 | Very Good | ISO (licensing required) | No |
| HE-AAC v2 | Lossy | 40–60 | 80–120 | ~40 KB | 20:1 | Good (stereo only) | ISO (licensing required) | No |
| xHE-AAC (USAC) | Lossy | 50–80 | N/A | ~50 KB | 25:1 | Excellent | ISO (licensing required) | No |
| SBC | Lossy | 3–5 | 6–10 | ~4 KB | 4–5:1 | Fair | Bluetooth SIG (royalty-free) | Yes |
| aptX | Lossy | 5–8 | 10–16 | ~8 KB | 4:1 | Very Good | Qualcomm (licensing required) | Yes |
| aptX HD | Lossy | 8–12 | 16–24 | ~12 KB | 4:1 | Excellent | Qualcomm (licensing required) | Yes (F411/ESP32-S3) |
| IMA-ADPCM | Lossy | < 1 | < 2 | < 1 KB | 4:1 | Fair | Public domain | Yes (all) |
| Speex | Lossy | 3–8 | N/A (max 32 kHz) | ~4 KB | 8–11:1 | Good (speech) | BSD (royalty-free) | Yes (voice only) |
| G.722 | Lossy | 3–5 | N/A (max 16 kHz) | ~2 KB | 4:1 | Good (wideband) | ITU-T (royalty-free) | Yes (voice only) |
| 3GPP EVS | Lossy | 10–20 | 15–30 | ~15 KB | 8–12:1 | Excellent (voice) | 3GPP (licensing required) | Marginal |
| AAC-ELD | Lossy | 15–25 | 20–30 | ~20 KB | 8–10:1 | Very Good | ISO (licensing required) | No @96 kHz |
| aptX Lossless | Lossless | N/A | N/A | N/A | ~2:1 | Perfect | Qualcomm (licensing required) | No |
| FLAC | Lossless | 30–50 | 60–100 | ~50 KB | 2:1 | Perfect | BSD (royalty-free) | No (too slow) |
| ALAC | Lossless | 25–40 | 50–80 | ~40 KB | 2:1 | Perfect | Apache (royalty-free) | No |
3. Detailed analysis per codec¶
3.1 XAP — Xylolabs Audio Protocol¶
- Sample rates: 8, 16, 24, 32, 44.1, 48, 96 kHz
- Frame sizes: 7.5 ms or 10 ms
- Bitrate range: 16–320 kbps per channel (encoder-configurable)
- MCU implementation: XAP decoder library — open-source reference implementation, ~5,000 lines of C, zero external dependencies, fixed-point arithmetic available
- 4ch @96 kHz bandwidth: 4 × 80 kbps = 320 kbps = 40 KB/s — fits within LTE-M1 budget
- CPU requirement: ~10 MIPS/ch @48 kHz, ~20 MIPS/ch @96 kHz → 80 MIPS total for 4ch @96 kHz → fits RP2350 (300 MIPS dual-core), fits ESP32-S3 (480 MIPS)
- With DSP acceleration: ~7 MIPS/ch @48 kHz, ~14 MIPS/ch @96 kHz → 56 MIPS for 4ch @96 kHz using CMSIS-DSP optimized paths on Cortex-M33/M4F (see Section 6)
- RAM: ~8 KB per channel encoder state → 32 KB for 4ch → fits all target MCUs
- Latency: 7.5–10 ms algorithmic delay per frame
- Quality: Near-transparent at 64 kbps/ch; excellent, broadcast-grade at 80 kbps/ch
- Licensing: Xylolabs proprietary (patent pending)
- Hardware acceleration: The XAP decoder library includes ARM-optimized fixed-point paths that use Cortex-M33/M4F DSP instructions (single-cycle MAC, saturating arithmetic). The MDCT, spectral analysis, and quantization loops see 25–40% speedup with DSP intrinsics.
Verdict: RECOMMENDED — best balance of quality, compression efficiency, CPU budget, RAM, and licensing for this use case. Already integrated in the Xylolabs Rust SDK (crates/xylolabs-sdk/) as the XAP codec module. DSP acceleration reduces CPU usage further on Cortex-M33/M4F platforms.
3.2 Opus¶
- Standard: RFC 6716 (IETF)
- Modes:
- SILK: Low complexity, designed for speech, up to 48 kHz
- CELT: Higher complexity, full-band music audio, up to 48 kHz
- Hybrid: Combination of SILK + CELT for 12–20 ms frames
- Sample rates: 8, 12, 16, 24, 48 kHz — no native 96 kHz support
- SILK mode CPU: ~20 MIPS/ch → 80 MIPS for 4ch → fits RP2350, but limited to 48 kHz
- CELT mode CPU: ~40 MIPS/ch → 160 MIPS for 4ch → tight on RP2350, exceeds STM32
- RAM (encoder): ~20–30 KB/ch → 80–120 KB for 4ch — tight on MCUs with limited SRAM
- 96 kHz limitation: Opus internally resamples all input to 48 kHz. Encoding at 96 kHz input is not meaningful — the codec will downsample, discarding content above 24 kHz.
- Quality: Best-in-class for 48 kHz audio; indistinguishable from original at 96–128 kbps/ch in CELT mode
- Licensing: BSD-style (royalty-free, no patent encumbrances)
Verdict: GOOD for 48 kHz streams. Not suitable for true 96 kHz capture because the encoder discards high-frequency content. If the application requirement is relaxed to 48 kHz, Opus SILK is viable on RP2350.
3.3 MP3 (MPEG-1/2 Audio Layer III)¶
- Standard: ISO 11172-3 (MPEG-1) / ISO 13818-3 (MPEG-2)
- Maximum sample rate: 48 kHz (MPEG-1) — no 96 kHz support
- CPU (encoder, e.g., LAME): ~30 MIPS/ch → 120 MIPS for 4ch @48 kHz — tight on most MCUs
- RAM: ~30 KB/ch — significant
- Quality: Acceptable at 128–192 kbps/ch; audible artifacts at ≤ 96 kbps
- Patents: All relevant patents expired by 2017; fully royalty-free
- MCU implementations: Very few production-quality MCU MP3 encoders exist; LAME is not designed for embedded
Verdict: NOT RECOMMENDED — no 96 kHz support, high CPU and RAM requirements, dated codec superseded by XAP and Opus. Encoding on MCU is not practical.
3.4 AAC-LC (Advanced Audio Coding — Low Complexity)¶
- Standard: ISO/IEC 14496-3 (MPEG-4 Audio)
- Sample rates: Up to 96 kHz supported natively
- CPU: ~30 MIPS/ch @48 kHz, ~60 MIPS/ch @96 kHz → 240 MIPS for 4ch — exceeds RP2350 (300 MIPS, too close), exceeds F411 and nRF52840 by far
- RAM: ~25 KB/ch → 100 KB for 4ch
- Quality: Very good at 128 kbps; excellent at 192 kbps
- Licensing: ISO/IEC patent pool managed by Via Licensing — royalty required per unit sold
- MCU encoders: FDK-AAC by Fraunhofer is the best open-source encoder, but it is heavy and not optimized for bare-metal MCUs
Verdict: NOT RECOMMENDED for MCU — too CPU-heavy at 96 kHz, licensing costs are non-trivial for hardware products, and no well-tested MCU encoder libraries exist.
3.5 HE-AAC v1 (High Efficiency AAC + Spectral Band Replication)¶
- Standard: ISO/IEC 14496-3
- Enhancement: Spectral Band Replication (SBR) regenerates high frequencies from low-frequency data, enabling high quality at half the bitrate of AAC-LC
- CPU: Higher than AAC-LC due to SBR analysis — ~50 MIPS/ch @48 kHz
- Best operating range: 48–80 kbps/ch (delivers AAC-LC @96 kbps quality at half the bitrate)
- Licensing: ISO/IEC + Dolby patent pool — royalty required
- MCU encoder complexity: SBR encoder analysis is computationally intensive and does not fit in typical MCU SRAM
Verdict: NOT FEASIBLE on MCU — encoder is significantly more complex than AAC-LC. Server-side decode is fine (FFmpeg), but encoding on embedded hardware is not practical.
3.6 HE-AAC v2 (HE-AAC v1 + Parametric Stereo)¶
- Standard: ISO/IEC 14496-3
- Enhancement: Adds Parametric Stereo (PS) on top of HE-AAC v1, encoding stereo as mono + steering parameters
- CPU: Highest among AAC variants — ~60–80 MIPS/ch equivalent
- Best at: 24–48 kbps for stereo content (remarkable compression for speech and music)
- Limitation: Designed for stereo (2-channel) only — not applicable to 4-channel configurations
- Quality at low bitrate: Very good for speech; stereo imaging degrades at low bitrates
Verdict: NOT FEASIBLE — too computationally intensive for MCU encoding, stereo-only limitation excludes 4-channel use case.
3.7 xHE-AAC / USAC (Extended HE-AAC / Unified Speech and Audio Coding)¶
- Standard: ISO/IEC 23003-3 (MPEG-D)
- Technology: Combines ACELP (speech coding) with MDCT (music coding), with smooth switching and advanced noise shaping
- CPU: Very high — designed for ARM Cortex-A class processors, not microcontrollers
- Quality: State of the art; near-transparent at 32 kbps stereo
- Licensing: ISO/IEC patent pool — royalty required
Verdict: NOT FEASIBLE on any MCU — encoder requires a full-featured CPU with OS-level memory management. Not relevant for this use case.
3.8 SBC (Subband Coding)¶
- Standard: Bluetooth A2DP specification
- Algorithm: Simple subband filter bank with ADPCM-like quantization
- CPU: Very low — ~5 MIPS/ch, trivial even on STM32F103
- Compression: Only 4–5:1 — less efficient than XAP, meaning higher bandwidth for the same bitrate
- RAM: ~4 KB/ch — excellent
- Quality: Adequate for Bluetooth speakers; not suitable for high-fidelity archival or spectrum analysis
- Licensing: Royalty-free under Bluetooth SIG terms
Verdict: FEASIBLE but strictly inferior to XAP in every dimension except simplicity. SBC was designed as a lowest-common-denominator codec before XAP existed. Use XAP instead when available; use SBC only as a final fallback if XAP cannot be integrated.
3.9 aptX / aptX HD¶
- Standard: Qualcomm proprietary
- Algorithm: ADPCM with enhanced subband processing
- aptX CPU: ~8 MIPS/ch → 32 MIPS for 4ch — very feasible
- aptX HD CPU: ~12 MIPS/ch → 48 MIPS for 4ch — feasible on most MCUs
- Compression: 4:1 (same ratio as ADPCM but significantly better perceptual quality)
- aptX quality: Very good — comparable to 16-bit CD quality
- aptX HD quality: Near-lossless for human perception; 24-bit input support
- RAM: ~8–12 KB/ch — manageable
- Licensing: Qualcomm license agreement required — cost-prohibitive for custom hardware without Qualcomm silicon, and SDK access is restricted
Verdict: GOOD quality, impractical licensing. The Qualcomm licensing model ties aptX to their chipsets commercially. Not suitable for open custom hardware development.
3.10 IMA-ADPCM (Interactive Multimedia Association Adaptive Differential PCM)¶
- Standard: IMA/DVI ADPCM specification
- Algorithm: Encodes sample-to-sample deltas using a 4-bit quantizer with adaptive step size
- CPU: Trivial — < 1 MIPS/ch → < 4 MIPS for 4ch @96 kHz
- Compression: Fixed 4:1 (16-bit input → 4-bit output)
- RAM: < 1 KB/ch — fits on any MCU
- Quality: Fair — audible quantization noise, high-frequency artifacts, no psychoacoustic modeling
- 4ch @96 kHz bandwidth: 768 KB/s ÷ 4 = 192 KB/s — still over the LTE-M1 budget
- 4ch @48 kHz bandwidth: 384 KB/s ÷ 4 = 96 KB/s — still over budget
- Licensing: Public domain
Verdict: BASELINE FALLBACK — always feasible computationally, runs on any MCU. However, 4:1 compression is insufficient for the 96 kHz / 4-channel use case over LTE-M1. Only viable if the sample rate is further reduced (e.g., 2ch @24 kHz for voice-only monitoring).
3.11 Speex¶
- Standard: Xiph.Org open specification (RFC 5574)
- Algorithm: CELP (Code-Excited Linear Prediction)
- Maximum sample rate: 32 kHz (ultra-wideband) — no 48 kHz or 96 kHz support
- Bitrate: 2.15–44.2 kbps (variable, mode-dependent)
- CPU: ~3–8 MIPS/ch (narrowband to ultra-wideband)
- RAM: ~4 KB/ch
- Quality: Good for speech; includes noise suppression, echo cancellation, and VAD (Voice Activity Detection) — features absent from G.722
- Licensing: BSD — fully royalty-free, no patent encumbrance
- Note: The Speex project officially recommends Opus as its successor for new deployments. However, Speex remains a valid choice for MCU voice-only use cases where Opus is too heavy (>30 MIPS/ch) and G.722's 16 kHz ceiling is limiting.
Verdict: VOICE-ONLY ALTERNATIVE — viable for speech monitoring at 8–32 kHz with better compression than G.722 (8–11:1 vs 4:1) and built-in noise suppression. Superseded by Opus for general audio; XAP remains the recommendation for wideband industrial audio.
3.12 G.722¶
- Standard: ITU-T G.722
- Algorithm: Subband ADPCM, two-subband (low and high)
- Maximum sample rate: 16 kHz (wideband voice) — no 48 kHz or 96 kHz support
- CPU: ~5 MIPS/ch
- Quality: Good for voice; completely unsuitable for full-band audio
- Licensing: ITU-T — royalty-free
Verdict: VOICE-ONLY — useful only if the application is narrowband speech monitoring. Not applicable for 96 kHz industrial audio.
3.13 FLAC (Free Lossless Audio Codec)¶
- Standard: Xiph.Org open specification
- Algorithm: Linear predictive coding + Rice entropy coding
- CPU: ~60–100 MIPS/ch @96 kHz → ~400 MIPS for 4ch — far exceeds all target MCUs
- Compression: ~2:1 (lossless, content-dependent)
- RAM: ~50 KB/ch — significant
- Bandwidth for 4ch @96 kHz: ~384 KB/s — still 10× over LTE-M1 budget
- Quality: Perfect — mathematically lossless
Verdict: NOT FEASIBLE — both the CPU cost and the resulting bandwidth exceed the constraints. FLAC is excellent for server-side storage transcoding but is not an encoding codec for this embedded use case.
3.14 ALAC (Apple Lossless Audio Codec)¶
- Standard: Apple open-sourced, Apache 2.0 license
- Algorithm: Similar to FLAC; integer polynomial predictor + Rice coding
- CPU: ~50–80 MIPS/ch @96 kHz → ~320 MIPS for 4ch — exceeds most MCUs
- RAM: ~40 KB/ch
- Quality: Perfect (lossless)
- Licensing: Apache 2.0 (royalty-free)
Verdict: NOT FEASIBLE — same constraints as FLAC. No benefit over FLAC for this use case and less tooling support on Linux servers.
3.15 3GPP EVS (Enhanced Voice Services)¶
- Standard: 3GPP TS 26.441 (Release 12+)
- Sample rates: 8, 16, 32, 48 kHz — no native 96 kHz support
- Bitrate: 5.9–128 kbps (mode-dependent)
- CPU: ~15–30 MIPS/ch — moderate complexity
- RAM: ~15 KB/ch
- Quality: Excellent for voice — state-of-the-art speech codec with super-wideband support
- Licensing: 3GPP patent pool — per-unit royalty required, complex licensing terms
- MCU feasibility: Marginal on RP2350 for 4ch @48 kHz (~120 MIPS), but no 96 kHz support eliminates it from consideration
Verdict: NOT RECOMMENDED — excellent voice codec but limited to 48 kHz maximum, requires 3GPP licensing, and offers no advantage over XAP for industrial wideband audio monitoring at 96 kHz.
3.16 AAC-ELD (Advanced Audio Coding — Enhanced Low Delay)¶
- Standard: ISO/IEC 14496-3 (MPEG-4 Audio, low-delay profile)
- Algorithm: AAC-LC core with low-delay MDCT window and optional SBR
- CPU: ~20–30 MIPS/ch @96 kHz — lower latency than AAC-LC but similar computational cost
- RAM: ~20 KB/ch
- Latency: ~15–30 ms (significantly lower than standard AAC-LC)
- Quality: Very good — designed for real-time communication (FaceTime, VoLTE)
- Licensing: ISO/IEC patent pool — royalty required
Verdict: NOT FEASIBLE for MCU @96 kHz — 4ch @96 kHz requires ~80–120 MIPS, tight on RP2350 and exceeding most single-core MCUs. Licensing cost and lack of MCU-optimized encoders make it impractical. XAP achieves similar low-latency goals with lower complexity.
3.17 aptX Lossless¶
- Standard: Qualcomm proprietary (Snapdragon Sound)
- Algorithm: Adaptive lossless/lossy hybrid — lossless at ~1 Mbps, lossy fallback at ~300 kbps
- Bandwidth: ~1 Mbps for lossless mode — far exceeds LTE-M1 budget
- CPU: High — designed for Qualcomm application processors, not MCUs
- Licensing: Qualcomm proprietary — restricted to Qualcomm silicon ecosystem
Verdict: NOT FEASIBLE — bandwidth requirement (~1 Mbps) is 20x over LTE-M1 capacity, requires Qualcomm silicon, and no MCU implementation exists. For lossless needs, record to SD card and batch upload.
4. Bandwidth analysis table¶
| Codec | Config | Bitrate | KB/s | Fits LTE-M1? | Quality |
|---|---|---|---|---|---|
| Raw PCM 16-bit | 4ch @96 kHz | 6144 kbps | 768 KB/s | No | Perfect |
| Raw PCM 16-bit | 4ch @48 kHz | 3072 kbps | 384 KB/s | No | Perfect |
| FLAC | 4ch @96 kHz | ~3072 kbps | ~384 KB/s | No | Lossless |
| FLAC | 4ch @48 kHz | ~1536 kbps | ~192 KB/s | No | Lossless |
| IMA-ADPCM | 4ch @96 kHz | 1536 kbps | 192 KB/s | No | Fair |
| IMA-ADPCM | 4ch @48 kHz | 768 kbps | 96 KB/s | No (2× over) | Fair |
| aptX | 4ch @96 kHz | 1536 kbps | 192 KB/s | No | Very Good |
| aptX | 4ch @48 kHz | 768 kbps | 96 KB/s | No | Very Good |
| SBC | 4ch @48 kHz | ~600 kbps | ~77 KB/s | No | Fair |
| MP3 @128k | 4ch @48 kHz | 512 kbps | 64 KB/s | Yes (marginal) | Good |
| AAC-LC @128k | 4ch @96 kHz | 512 kbps | 64 KB/s | Yes (marginal) | Very Good |
| Opus SILK @64k | 4ch @48 kHz | 256 kbps | 32 KB/s | YES | Excellent |
| XAP @80k | 4ch @96 kHz | 320 kbps | 40 KB/s | YES | Excellent |
| XAP @64k | 4ch @96 kHz | 256 kbps | 32 KB/s | YES | Very Good |
| HE-AAC @48k | 4ch @48 kHz | 192 kbps | 24 KB/s | Yes | Very Good |
| Speex @20k | 4ch @32 kHz | 80 kbps | 10 KB/s | YES | Good (voice) |
Key finding: XAP at 80 kbps/ch is the only codec that simultaneously achieves 96 kHz support, fits LTE-M1 bandwidth, and can run on target MCUs.
5. MCU platform feasibility matrix¶
CPU budgets assume worst-case single-core utilization. Dual-core MCUs (RP2350, ESP32-S3) can split encoder work across cores.
| Codec / Config | RP2350 (300 MIPS, 520 KB SRAM) | STM32F103 (72 MIPS, 20 KB) | STM32F411 (100 MIPS, 128 KB) | ESP32-S3 (480 MIPS, 512 KB+PSRAM) | nRF52840 (64 MIPS, 256 KB) |
|---|---|---|---|---|---|
| XAP 4ch @96 kHz (80 MIPS, 32 KB) | YES | NO (RAM) | Marginal | YES | NO |
| XAP 4ch @48 kHz (40 MIPS, 32 KB) | YES | NO (RAM) | YES | YES | Marginal |
| XAP 2ch @48 kHz (20 MIPS, 16 KB) | YES | YES | YES | YES | YES |
| Opus SILK 4ch @48 kHz (80 MIPS, 80 KB) | YES | NO | Marginal | YES | NO |
| SBC 4ch @48 kHz (20 MIPS, 16 KB) | YES | NO (RAM) | YES | YES | Marginal |
| IMA-ADPCM 4ch @96 kHz (< 8 MIPS, 4 KB) | YES | YES | YES | YES | YES |
| FLAC 4ch @96 kHz (~400 MIPS, 200 KB) | NO | NO | NO | Marginal | NO |
| AAC-LC 4ch @96 kHz (240 MIPS, 100 KB) | NO | NO | NO | Marginal | NO |
Notes¶
- RP2350: Dual Cortex-M33 cores at 150 MHz each; 300 MIPS total. Each core includes the ARMv8-M DSP extension: single-cycle 32x32→64 MAC (
SMLAL), dual 16x16 MAC (SMLAD/SMUAD), saturating arithmetic (QADD/QSUB/SSAT), and bit-field extract (SBFX/UBFX). These DSP instructions accelerate XAP's MDCT, spectral shaping, and FIR downsampling by 25-40%. With DSP-optimized XAP, 4ch @96 kHz drops from ~80 MIPS to ~56 MIPS, leaving ample headroom on a single core. - STM32F103: Cortex-M3 at 72 MHz, only 20 KB SRAM. No DSP extension, no FPU. Even XAP's 32 KB per 4ch encoder state exceeds available RAM. Maximum: 2ch XAP @48 kHz (16 KB state), or IMA-ADPCM (trivial CPU, < 1 KB state).
- STM32F411: 100 MHz Cortex-M4F with single-precision FPU AND full DSP extension. The FPU accelerates XAP floating-point paths, and DSP instructions (same as M33 plus hardware integer divide) speed up fixed-point paths. With DSP-optimized XAP, 4ch @96 kHz uses ~56 MIPS — feasible at 56% CPU utilization, a significant improvement over the 80% baseline estimate. The CMSIS-DSP library provides drop-in optimized FFT and FIR routines.
- ESP32-S3: Dual Xtensa LX7 at 240 MHz (480 MIPS total) + 8 MB PSRAM. Features 128-bit SIMD via Processor Instruction Extensions (PIE), capable of processing 4x f32 or 8x i16 in parallel. PIE provides 2-4x speedup for FFT/MDCT batch operations. Hardware AES/SHA accelerators offload TLS overhead. Most capable platform — can run Opus CELT for 2ch, or XAP 4ch @96 kHz at under 20% CPU utilization with SIMD.
- nRF52840: Cortex-M4F at 64 MHz with FPU and DSP extension. Adequate RAM (256 KB) but limited CPU. With DSP optimization, XAP 2ch @48 kHz uses ~14 MIPS (22% utilization) — comfortable. XAP 4ch @48 kHz (~28 MIPS, 44%) becomes feasible with DSP acceleration, though tight.
CMSIS-DSP integration¶
ARM's CMSIS-DSP library provides hardware-optimized DSP routines for all Cortex-M cores. Key functions used by the XAP encoder:
| CMSIS-DSP Function | Used For | Speedup vs C |
|---|---|---|
arm_rfft_fast_f32 |
MDCT/spectral analysis | 3-5x |
arm_fir_f32 / arm_fir_q15 |
FIR downsampling filter | 2-4x |
arm_dot_prod_f32 |
Inner products in quantization | 2-3x |
arm_scale_f32 |
Gain normalization | 2x |
arm_fill_f32 / arm_copy_f32 |
Buffer management | 1.5-2x |
The Xylolabs SDK links CMSIS-DSP on Cortex-M targets automatically when XYLOLABS_USE_CMSIS_DSP=1 is set in the build configuration.
6. DSP and hardware acceleration¶
Each target MCU offers hardware features that reduce codec CPU cost below the baseline MIPS estimates. These are not theoretical gains -- they reflect real instruction-level speedups from the silicon itself.
Cortex-M33 DSP extension (RP2350, nRF9160)¶
The ARMv8-M Cortex-M33 includes a mandatory DSP extension with:
- Single-cycle 32x32→64 MAC (
SMLAL,UMLAL) -- the workhorse for FIR filter and MDCT accumulation - Dual 16x16→32 MAC (
SMLAD,SMUAD) -- processes two 16-bit sample pairs per cycle, doubling throughput for 16-bit audio paths - Saturating arithmetic (
QADD,QSUB,SSAT,USAT) -- eliminates branch-based clipping in ADPCM step-size adaptation and XAP quantization - Bit-field extract (
SBFX,UBFX) -- efficient unpacking for XMBP binary protocol parsing
Impact on codecs:
| Codec | Baseline MIPS/ch @96 kHz | With DSP | Reduction |
|---|---|---|---|
| XAP | ~20 | ~14 | -30% |
| IMA-ADPCM | < 2 | < 1 | -50% |
| SBC | ~10 | ~7 | -30% |
Cortex-M4F FPU + DSP (STM32F411, STM32WB55, nRF52840)¶
Includes everything from the M33 DSP extension plus:
- Single-precision FPU -- hardware float multiply-accumulate in 1-3 cycles vs 10-20 cycle software emulation
- Hardware integer divide (
SDIV,UDIV) -- 2-12 cycles vs 20-100 cycle software divide - FPU impact: XAP has both fixed-point and floating-point encoder paths. On M4F, the floating-point path can actually be faster than fixed-point because the FPU handles normalization, scaling, and spectral analysis natively
Impact on codecs:
| Codec | Baseline MIPS/ch @96 kHz | With FPU+DSP | Reduction |
|---|---|---|---|
| XAP (float path) | ~20 | ~12 | -40% |
| XAP (fixed path) | ~20 | ~14 | -30% |
| Opus SILK | ~50 | ~35 | -30% |
ESP32-S3 vector extensions (PIE)¶
The Xtensa LX7 cores include Processor Instruction Extensions (PIE):
- 128-bit SIMD -- process 4x f32 or 8x i16 in a single instruction
- Dedicated vector registers -- 16 x 128-bit registers for parallel DSP pipelines
- Hardware AES/SHA -- offloads TLS handshake and encryption from the CPU, freeing cycles for codec work
- PSRAM DMA -- large audio buffers can live in PSRAM with DMA transfer to/from internal SRAM for processing
Impact on codecs:
| Codec | Baseline MIPS/ch @96 kHz | With PIE SIMD | Reduction |
|---|---|---|---|
| XAP | ~20 | ~8 | -60% |
| Opus CELT | ~100 | ~50 | -50% |
| FLAC | ~100 | ~60 | -40% |
The ESP32-S3 with PIE can run codecs that are infeasible on other platforms -- Opus CELT 2ch @48 kHz, or even experimental FLAC 2ch @48 kHz for lossless capture.
Rust SDK DSP feature flags¶
The Rust SDK (crates/xylolabs-sdk/) exposes DSP acceleration via Cargo feature flags:
| Feature | Target | Effect |
|---|---|---|
cmsis-dsp |
Cortex-M33 (RP2350, nRF9160), Cortex-M4F (STM32F411, nRF52840) | Enables CMSIS-DSP optimized MDCT and FIR paths |
esp32-simd |
ESP32-S3 (Xtensa LX7) | Enables PIE SIMD optimized MDCT paths |
# Cargo.toml -- enable DSP acceleration for RP2350
[dependencies]
xylolabs-sdk = { path = "../../crates/xylolabs-sdk", features = ["xap", "cmsis-dsp"] }
# Cargo.toml -- enable SIMD acceleration for ESP32-S3
xylolabs-sdk = { path = "../../crates/xylolabs-sdk", features = ["xap", "esp32-simd"] }
For the C SDK, set XYLOLABS_USE_CMSIS_DSP=1 or XYLOLABS_USE_ESP32S3_SIMD=1 in the build configuration (auto-detected from compiler flags when not set explicitly).
See Performance Profile for per-target CPU budgets and DSP optimization details.
Practical recommendations¶
- Always enable CMSIS-DSP on Cortex-M targets: Rust SDK
features = ["cmsis-dsp"], C SDKXYLOLABS_USE_CMSIS_DSP=1. The optimized FFT and FIR routines are drop-in replacements that require no code changes. - Use the XAP floating-point path on Cortex-M4F targets (STM32F411). The FPU makes it faster than the fixed-point path despite the overhead of float conversion.
- Use the XAP fixed-point path on Cortex-M33 targets (RP2350) since there is no FPU. The DSP extension still provides significant speedup for MAC-heavy operations.
- Enable PIE intrinsics on ESP32-S3 builds: Rust SDK
features = ["esp32-simd"], C SDKXYLOLABS_USE_ESP32S3_SIMD=1. The ESP-IDF compiler auto-vectorizes some loops, but explicit PIE intrinsics in the XAP hot path yield an additional 20-30% gain.
7. Recommendation¶
Primary choice: XAP at 64–80 kbps/ch¶
XAP is the optimal codec for this use case across all evaluation dimensions:
| Criterion | XAP Result |
|---|---|
| Native 96 kHz support | Yes |
| 4ch @96 kHz fits LTE-M1 (@80k) | 40 KB/s — Yes |
| CPU for 4ch @96 kHz | 80 MIPS — fits RP2350, ESP32-S3 |
| RAM for 4ch | 32 KB — fits all target MCUs |
| Licensing | Xylolabs proprietary (patent pending) |
| Latency | 7.5–10 ms frame — low latency |
| SDK integration | XAP codec module in Rust SDK (crates/xylolabs-sdk/) |
Recommended operating point: - Bitrate: 80 kbps/ch for highest quality; reduce to 64 kbps/ch to save bandwidth at acceptable quality loss - Frame size: 10 ms (lower CPU than 7.5 ms, acceptable latency) - Total stream: 4 × 80 kbps = 320 kbps = 40 KB/s
Fallback: IMA-ADPCM¶
When XAP integration is not possible (resource-constrained platforms, STM32F103, nRF52840 with 4ch):
- CPU: trivial (< 1 MIPS/ch)
- RAM: < 1 KB/ch
- Tradeoff: 4:1 compression only — requires reducing channel count or sample rate to fit LTE-M1
- 2ch @24 kHz ADPCM = 192 kbps = 24 KB/s — fits budget
For 48 kHz-only use: Opus SILK¶
When the application permits 48 kHz sample rate and maximum codec quality is required:
- Best-in-class quality for 8–48 kHz audio
- Rich server-side ecosystem (FFmpeg, libopus, web browser native)
- BSD license — royalty-free
- SILK mode: 80 MIPS for 4ch @48 kHz — fits RP2350 and ESP32-S3
8. Server-side decode support matrix¶
| Codec | FFmpeg Support | Decode Library | Storage Format | Notes |
|---|---|---|---|---|
| XAP | Yes (XAP decoder library, FFmpeg 7.1+) | XAP decoder library (C, Apache 2.0) | Transcode to FLAC/WAV | Build FFmpeg with XAP decoder support; supports 7.5/10 ms frames + HR mode (48/96 kHz) |
| Opus | Yes (libopus) |
Native FFmpeg | Direct web playback | Best-supported codec on modern platforms |
| MP3 | Yes (libmp3lame) |
Native FFmpeg | Direct web playback | Mature, universal support |
| AAC-LC | Yes (libfdk_aac) |
Native FFmpeg | Direct web playback | fdk-aac GPL issue — use non-free FFmpeg build |
| HE-AAC | Yes (libfdk_aac) |
Native FFmpeg | Direct web playback | Same fdk-aac requirement |
| SBC | Yes (built-in) | Native FFmpeg | Needs transcode to AAC/Opus | Lower quality — transcode for storage |
| Speex | Yes (libspeex) |
libspeex (C, BSD) |
Transcode to Opus/WAV | Build FFmpeg with --enable-libspeex; voice-only, max 32 kHz |
| IMA-ADPCM | Yes (adpcm_ima_wav) |
Native FFmpeg | WAV container | Trivial decode, built into all FFmpeg builds |
| FLAC | Yes (built-in) | Native FFmpeg | Direct lossless storage | Standard archival format |
XAP server integration¶
Recommended approach: Use the XAP decoder library for server-side XAP decoding. The C API shown here is an alternative for programmatic control.
Link the XAP decoder library directly for lower-level control:
// Rust - Recommended
// Server-side XAP decode using the xap crate (bindings to XAP decoder library)
use xap::{Decoder, PcmFormat};
let mut decoder = Decoder::new(frame_us, sample_rate);
let pcm_output = decoder.decode::<i16>(xap_frame).unwrap();
Legacy C equivalent
After decoding, write to FLAC or WAV for archival, or transcode to Opus for web delivery.
9. Implementation roadmap¶
Current state¶
- Rust SDK (
crates/xylolabs-sdk/) is the recommended implementation with XAP codec module - Legacy C SDK (
sdk/c/common/src/xap_encoder.c) available but scheduled for deprecation - IMA-ADPCM encoder is available as a fallback
- E2E burn-in tests validate XAP encoding stability (4 scenarios, 10-device concurrency, QEMU ARM emulation)
- Server-side XAP decoder integration pending
Planned steps¶
| Step | Action | Priority |
|---|---|---|
| 1 | Integrate XAP decoder library into Axum ingest server | High |
| 2 | Define XMBP chunk type for XAP frames (codec ID, bitrate, frame_us fields) | High |
| 3 | Implement XAP → FLAC transcode pipeline for archival storage | High |
| 4 | Add Opus encoder to ESP32-S3 SDK target (sufficient CPU) | Medium |
| 5 | Deprecate legacy C SDK (sdk/c/) in favor of Rust SDK |
Medium |
Codec ID assignment (XMBP protocol)¶
| Codec ID | Codec | Notes |
|---|---|---|
0x01 |
Raw PCM S16LE | Debug/development only |
0x02 |
IMA-ADPCM | Fallback baseline |
0x03 |
XAP | Primary production codec |
0x04 |
Opus | ESP32-S3 / high-CPU platforms |
0x05 |
Speex | Voice-only (max 32 kHz) |
References¶
- RFC 6716 (Opus): https://www.rfc-editor.org/rfc/rfc6716
- IMA ADPCM: IMA Digital Audio Focus and Technical Working Group, 1992
- LTE-M1 (LTE Cat-M1) specifications: 3GPP TS 36.300, Release 13