A comprehensive reference for deploying speech-to-text locally — covering Whisper variants, NVIDIA NeMo models, Moonshine, and edge STT options with DGX Spark compatibility notes.
Overview
The STT (speech-to-text) stage of a voice pipeline converts spoken audio into text for the LLM to process. While TTS quality gets more attention, STT latency and accuracy directly gate the entire pipeline — you can't respond to what you haven't transcribed.
The 2026 offline STT landscape is dominated by three families:
- Whisper ecosystem (OpenAI): The default starting point. 9 model sizes, 100 languages, MIT license, massive community tooling
- NVIDIA NeMo models (Parakeet, Canary, Nemotron Speech): State-of-the-art accuracy and speed, open-weight, native DGX Spark support
- Edge-optimized models (Moonshine, SenseVoice, Vosk): Purpose-built for low-power and embedded deployment
NVIDIA Parakeet-TDT-0.6B-v2 processes audio at 3,380× real-time — transcribing 60 minutes of audio per second at batch size 128. Moonshine v2 Medium Streaming achieves lower WER than Whisper Large-v3 at 43× the speed. Mozilla DeepSpeech was officially discontinued in June 2025.
The Whisper Family
OpenAI's Whisper (MIT license) spans nine model sizes from 39M to 1.55B parameters, supporting 100 languages.
Model Sizes
| Model | Params | English WER (LS-clean) | Relative Speed | Est. VRAM | Languages |
|---|---|---|---|---|---|
| tiny / tiny.en | 39M | ~7.6% / ~5.6% | ~10× | ~1 GB | 100 / EN only |
| base / base.en | 74M | ~5.0% / ~4.3% | ~7× | ~1 GB | 100 / EN only |
| small / small.en | 244M | ~3.4% / ~3.0% | ~4× | ~2 GB | 100 / EN only |
| medium / medium.en | 769M | ~2.9% / ~2.8% | ~2× | ~5 GB | 100 / EN only |
| large-v2 | 1,550M | ~2.7% | 1× | ~10 GB | 100 |
| large-v3 | 1,550M | ~2.4% | 1× | ~10 GB | 100 |
| large-v3-turbo | 809M | ~2.5% | ~8× | ~6 GB | 100 |
large-v3 (Nov 2023) added 128 Mel filter banks (vs 80 in v2), improving accuracy 10–20%. large-v3-turbo (Oct 2024) shrunk the decoder from 32 to 4 layers, delivering 8× speedup at near-identical accuracy. MLCommons selected Whisper Large-v3 as the MLPerf Inference ASR benchmark in September 2025.
Distil-Whisper
HuggingFace's distilled Whisper variants compress the decoder to 2 layers:
| Variant | Speed vs Teacher | Size Reduction | English WER Impact | License |
|---|---|---|---|---|
| distil-large-v2 | 6.3× faster | 51% smaller | Within 1% WER | MIT |
| distil-large-v3 | 6× faster | 49% smaller | Within 1% WER | MIT |
| distil-large-v3.5 | 1.5× faster than turbo | ~50% smaller | Within 1% WER | MIT |
distil-large-v3.5 (trained on 98K hours) is the current best distilled option. All distil variants are English-only. For multilingual, use large-v3-turbo directly.
Whisper Inference Engines
whisper.cpp
The CPU-first C/C++ Whisper implementation by ggml-org. MIT license, v1.8.3 (January 2026).
Key features:
- GGML quantization (Q4_0 through Q8_0) — up to 65% disk reduction
- Vulkan iGPU support (12× speedup on AMD/Intel integrated graphics)
- Native aarch64 build — confirmed working on DGX Spark Grace CPU
- Streaming via
whisper-stream(0.5s sampling intervals) - No Python dependency
Use CUDA architecture 121 for the Blackwell GB10 GPU on DGX Spark.
git clone https://github.com/ggml-org/whisper.cpp.git
cd whisper.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=121
cmake --build build -j4
./build/bin/whisper-cli -m models/ggml-large-v3-turbo-q8_0.bin -f audio.wavfaster-whisper
SYSTRAN's CTranslate2-based implementation. MIT license. 4× faster than original Whisper with identical accuracy, plus int8 quantization halving VRAM usage.
Benchmarks (RTX 3070 Ti, large-v3):
- Original Whisper: 2m23s
- faster-whisper FP16: 52s
- faster-whisper int8: ~45s
- distil-large-v3 int8: 22.5s
DGX Spark: Confirmed working via NVIDIA's official Arm Learning Paths tutorial (CPU int8 mode). Community Docker images available for CUDA 13 on Blackwell.
pip install faster-whisper# CPU inference (confirmed on DGX Spark):
from faster_whisper import WhisperModel
model = WhisperModel("large-v3-turbo", device="cpu", compute_type="int8")
segments, info = model.transcribe("audio.wav", beam_size=5)
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")NVIDIA NeMo Speech Models
All models below are open-weight under CC-BY-4.0 (exception noted) and available on HuggingFace and NGC.
Parakeet-TDT-0.6B
| Version | WER (LS-clean) | WER (LS-other) | Speed (RTFx) | Languages | License |
|---|---|---|---|---|---|
| v2 | 1.6% | 3.3% | 3,380× (batch 128) | EN only | CC-BY-4.0 |
| v3 | ~1.5% | ~3.1% | Similar | 25 EU languages | CC-BY-4.0 |
| 1.1B (English) | ~1.4% | ~2.9% | ~2,000× | EN only | CC-BY-4.0 |
Parakeet-TDT uses a FastConformer encoder with Token-and-Duration Transducer decoder. The v3 variant adds automatic language detection across 25 European languages and supports up to 3 hours of audio via local attention.
Canary Flash
| Variant | Params | WER (LS-clean) | WER (LS-other) | Translation | Languages |
|---|---|---|---|---|---|
| Canary-1B-Flash | 883M | 1.48% | 2.87% | ✅ EN↔DE/FR/ES | 4 |
| Canary-180M-Flash | 180M | 1.87% | 3.83% | ✅ Same | 4 |
| Canary-1B-v2 | ~1B | ~1.5% | ~2.9% | ✅ | 25 EU languages |
Canary models combine ASR with speech translation. The 180M Flash variant is small enough for edge deployment. Note: the original Canary-1B used CC-BY-NC-4.0, but all Flash and v2 variants use permissive CC-BY-4.0.
Nemotron Speech Streaming 0.6B
Released January 2026 under NVIDIA Open Model License. Purpose-built for low-latency voice agent STT with cache-aware FastConformer-RNNT architecture.
Key specs:
- Each audio frame encoded exactly once with cached context
- Configurable chunk sizes: 80ms to 1,120ms
- Median time-to-final-transcription: ~24ms
- Handles 560 concurrent streams on H100
- Confirmed on DGX Spark (demonstrated in pipecat-ai project)
Nemotron Speech Streaming 0.6B is the recommended STT model for real-time voice pipelines on NVIDIA hardware.
Edge and Specialized STT
SenseVoice Small (Alibaba FunAudioLLM)
234M parameters. Unique for combining ASR + speech emotion recognition + audio event detection (applause, laughter, coughing) in a single model.
- Languages: Chinese, English, Cantonese, Japanese, Korean
- Speed: 10 seconds of audio in 70ms on A800 (5× faster than Whisper-Small)
- Chinese/Cantonese accuracy: >50% better than Whisper
- ONNX export supported, integrated into sherpa-onnx
- License: Apache 2.0 (model), MIT (code)
Moonshine (Moonshine AI)
The standout edge STT model. MIT license.
| Variant | Params | Disk Size | WER vs Whisper | Latency | Streaming |
|---|---|---|---|---|---|
| tiny | 27M | 26 MB | Matches Whisper Tiny at 5× less compute | <100ms | ❌ |
| base | 61M | 60 MB | Matches Whisper Small | <150ms | ❌ |
| v2 Medium Streaming | ~200M | ~200 MB | Lower WER than Whisper Large-v3 | 258ms | ✅ |
| Language-specialized tiny | 27M | 26 MB | 48% lower error than Whisper Tiny | <100ms | ❌ |
Moonshine v2 (February 2026) introduced sliding-window self-attention for bounded-latency streaming. All models ship in ONNX format, run on Raspberry Pi and mobile, and can operate in as little as 8 MB RAM.
Language-specialized tiny models available for: Arabic, Chinese, Japanese, Korean, Ukrainian, Vietnamese.
Vosk (Alpha Cephei)
Apache 2.0 license. Ultra-lightweight with 50 MB portable models across 20+ languages and true zero-latency streaming. Accuracy (~10–15% WER on general English) is far behind Whisper-class models but excellent for command-and-control applications.
sherpa-onnx STT
Not a model but a cross-platform runtime (Apache 2.0, v1.12.29, March 2026) supporting 12+ STT architectures: Zipformer, Paraformer, Whisper, Moonshine, SenseVoice, Parakeet CTC, Canary Flash, FunASR-Nano. Runs on x86, ARM, RISC-V, Rockchip NPU, Ascend NPU, Jetson with 12 programming language bindings.
Mozilla DeepSpeech — Discontinued
Officially discontinued June 2025, GitHub repository archived. Last release: v0.9.3 (December 2020), 7.06% WER on LibriSpeech clean. Migrate to Whisper, Moonshine, or Vosk.
STT Model Comparison Table
| Model | Params | WER (LS-clean) | Streaming | Languages | ONNX/GGUF | License | Best For |
|---|---|---|---|---|---|---|---|
| Whisper large-v3-turbo | 809M | ~2.5% | ❌ (batch) | 100 | GGML ✅ | MIT | General purpose |
| distil-large-v3.5 | ~400M | ~2.6% | ❌ | EN | GGML ✅ | MIT | Fast English |
| Parakeet-TDT-0.6B-v2 | 600M | 1.6% | ❌ | EN | ONNX ✅ | CC-BY-4.0 | Accuracy + speed |
| Parakeet-TDT-0.6B-v3 | 600M | ~1.5% | ❌ | 25 EU | ONNX ✅ | CC-BY-4.0 | Multilingual accuracy |
| Canary-1B-Flash | 883M | 1.48% | ❌ | 4 + translation | ONNX ✅ | CC-BY-4.0 | Best accuracy |
| Canary-180M-Flash | 180M | 1.87% | ❌ | 4 + translation | ONNX ✅ | CC-BY-4.0 | Edge + accuracy |
| Nemotron Speech 0.6B | 600M | ~2% | ✅ 80ms chunks | EN | ❌ | NVIDIA Open | Real-time voice agents |
| SenseVoice Small | 234M | ~3% (EN) | ❌ | 5 + emotion | ONNX ✅ | Apache 2.0 | Chinese + emotion |
| Moonshine v2 Medium | ~200M | <2.4% | ✅ 258ms | EN | ONNX ✅ | MIT | Edge streaming |
| Moonshine tiny | 27M | ~7.6% | ❌ | EN | ONNX ✅ | MIT | Ultra-edge |
| Vosk (small EN) | ~50M | ~10-15% | ✅ Zero-latency | 20+ | ❌ | Apache 2.0 | Commands |
DGX Spark Compatibility
| Model/Runtime | Grace CPU | Blackwell GPU | Status |
|---|---|---|---|
| whisper.cpp | ✅ Native aarch64 | ⚠️ CUDA 13 build needed | Confirmed |
| faster-whisper | ✅ CPU int8 confirmed | ⚠️ CTranslate2 aarch64 | Official Arm tutorial |
| NeMo Parakeet/Canary | ✅ | ✅ Native via Riva/NIM | First-party NVIDIA |
| Nemotron Speech | ✅ | ✅ | Confirmed (pipecat demo) |
| Moonshine | ✅ ONNX Runtime | N/A (CPU-only typical) | Designed for ARM |
| SenseVoice | ✅ Via sherpa-onnx | N/A | ONNX on aarch64 |
| Vosk | ✅ ARM support | N/A (CPU only) | Works |
Choosing an STT Model
| Your Situation | Recommended Model | Why |
|---|---|---|
| Real-time voice agent on NVIDIA | Nemotron Speech 0.6B | 24ms median latency, streaming, native Spark |
| Best accuracy, batch processing | Canary-1B-Flash | 1.48% WER, best open-weight accuracy |
| General purpose, any hardware | Whisper large-v3-turbo via faster-whisper | 100 languages, MIT, massive ecosystem |
| Fast English transcription | distil-large-v3.5 | 1.5× faster than turbo, within 1% WER |
| Raspberry Pi / edge device | Moonshine tiny or base | 26–60 MB, runs in 8 MB RAM |
| Edge with streaming needed | Moonshine v2 Medium | Better than Whisper Large-v3 at 43× speed |
| Chinese + emotion detection | SenseVoice Small | 5× faster than Whisper Small, emotion tags |
| Ultra-lightweight commands | Vosk | 50 MB, zero-latency streaming |
| Cross-platform deployment | sherpa-onnx + any ONNX model | 12 languages, embedded to server |
References
| Resource | URL |
|---|---|
| OpenAI Whisper | github.com/openai/whisper |
| whisper.cpp | github.com/ggml-org/whisper.cpp |
| faster-whisper | github.com/SYSTRAN/faster-whisper |
| Distil-Whisper | github.com/huggingface/distil-whisper |
| Parakeet-TDT-0.6B-v2 | huggingface.co/nvidia/parakeet-tdt-0.6b-v2 |
| Parakeet-TDT-0.6B-v3 | huggingface.co/nvidia/parakeet-tdt-0.6b-v3 |
| Canary-1B-Flash | huggingface.co/nvidia/canary-1b |
| Nemotron Speech Streaming | huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b |
| Nemotron Speech blog | huggingface.co/blog/nvidia/nemotron-speech-asr-scaling-voice-agents |
| SenseVoice | github.com/FunAudioLLM/SenseVoice |
| Moonshine | github.com/moonshine-ai/moonshine |
| Moonshine v2 paper | arxiv.org/abs/2602.12241 |
| Vosk | github.com/alphacep/vosk-api |
| sherpa-onnx | github.com/k2-fsa/sherpa-onnx |
| Arm DGX Spark voice chatbot | learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/ |
Related Guides
Complete Offline TTS Guide 2026
Comprehensive guide to offline text-to-speech: 30+ models, optimization techniques, deployment configurations, and DGX Spark compatibility.
TTS Model Catalog & Optimization
Complete catalog of 30+ offline TTS models with DGX Spark compatibility, optimization techniques, voice pipeline architecture, and deployment configurations.
Hardware Comparison Guide
Side-by-side comparison of 8 hardware platforms for TTS deployment: DGX Spark, Jetson Orin, Mac Studio, consumer GPUs, and cloud instances.
Was this guide helpful?