Offline STT Guide

A comprehensive reference for deploying speech-to-text locally — covering Whisper variants, NVIDIA NeMo models, Moonshine, and edge STT options with DGX Spark compatibility notes.

Overview

The STT (speech-to-text) stage of a voice pipeline converts spoken audio into text for the LLM to process. While TTS quality gets more attention, STT latency and accuracy directly gate the entire pipeline — you can't respond to what you haven't transcribed.

The 2026 offline STT landscape is dominated by three families:

Whisper ecosystem (OpenAI): The default starting point. 9 model sizes, 100 languages, MIT license, massive community tooling
NVIDIA NeMo models (Parakeet, Canary, Nemotron Speech): State-of-the-art accuracy and speed, open-weight, native DGX Spark support
Edge-optimized models (Moonshine, SenseVoice, Vosk): Purpose-built for low-power and embedded deployment

Key Finding

NVIDIA Parakeet-TDT-0.6B-v2 processes audio at 3,380× real-time — transcribing 60 minutes of audio per second at batch size 128. Moonshine v2 Medium Streaming achieves lower WER than Whisper Large-v3 at 43× the speed. Mozilla DeepSpeech was officially discontinued in June 2025.

The Whisper Family

OpenAI's Whisper (MIT license) spans nine model sizes from 39M to 1.55B parameters, supporting 100 languages.

Model Sizes

Model	Params	English WER (LS-clean)	Relative Speed	Est. VRAM	Languages
tiny / tiny.en	39M	~7.6% / ~5.6%	~10×	~1 GB	100 / EN only
base / base.en	74M	~5.0% / ~4.3%	~7×	~1 GB	100 / EN only
small / small.en	244M	~3.4% / ~3.0%	~4×	~2 GB	100 / EN only
medium / medium.en	769M	~2.9% / ~2.8%	~2×	~5 GB	100 / EN only
large-v2	1,550M	~2.7%	1×	~10 GB	100
large-v3	1,550M	~2.4%	1×	~10 GB	100
large-v3-turbo	809M	~2.5%	~8×	~6 GB	100

large-v3 (Nov 2023) added 128 Mel filter banks (vs 80 in v2), improving accuracy 10–20%. large-v3-turbo (Oct 2024) shrunk the decoder from 32 to 4 layers, delivering 8× speedup at near-identical accuracy. MLCommons selected Whisper Large-v3 as the MLPerf Inference ASR benchmark in September 2025.

Distil-Whisper

HuggingFace's distilled Whisper variants compress the decoder to 2 layers:

Variant	Speed vs Teacher	Size Reduction	English WER Impact	License
distil-large-v2	6.3× faster	51% smaller	Within 1% WER	MIT
distil-large-v3	6× faster	49% smaller	Within 1% WER	MIT
distil-large-v3.5	1.5× faster than turbo	~50% smaller	Within 1% WER	MIT

distil-large-v3.5 (trained on 98K hours) is the current best distilled option. All distil variants are English-only. For multilingual, use large-v3-turbo directly.

Whisper Inference Engines

whisper.cpp

The CPU-first C/C++ Whisper implementation by ggml-org. MIT license, v1.8.3 (January 2026).

Key features:

GGML quantization (Q4_0 through Q8_0) — up to 65% disk reduction
Vulkan iGPU support (12× speedup on AMD/Intel integrated graphics)
Native aarch64 build — confirmed working on DGX Spark Grace CPU
Streaming via whisper-stream (0.5s sampling intervals)
No Python dependency

Build for DGX Spark

Use CUDA architecture 121 for the Blackwell GB10 GPU on DGX Spark.

git clone https://github.com/ggml-org/whisper.cpp.git
cd whisper.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=121
cmake --build build -j4
./build/bin/whisper-cli -m models/ggml-large-v3-turbo-q8_0.bin -f audio.wav

faster-whisper

SYSTRAN's CTranslate2-based implementation. MIT license. 4× faster than original Whisper with identical accuracy, plus int8 quantization halving VRAM usage.

Benchmarks (RTX 3070 Ti, large-v3):

Original Whisper: 2m23s
faster-whisper FP16: 52s
faster-whisper int8: ~45s
distil-large-v3 int8: 22.5s

DGX Spark: Confirmed working via NVIDIA's official Arm Learning Paths tutorial (CPU int8 mode). Community Docker images available for CUDA 13 on Blackwell.

pip install faster-whisper

# CPU inference (confirmed on DGX Spark):
from faster_whisper import WhisperModel
model = WhisperModel("large-v3-turbo", device="cpu", compute_type="int8")
segments, info = model.transcribe("audio.wav", beam_size=5)
for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

NVIDIA NeMo Speech Models

All models below are open-weight under CC-BY-4.0 (exception noted) and available on HuggingFace and NGC.

Parakeet-TDT-0.6B

Version	WER (LS-clean)	WER (LS-other)	Speed (RTFx)	Languages	License
v2	1.6%	3.3%	3,380× (batch 128)	EN only	CC-BY-4.0
v3	~1.5%	~3.1%	Similar	25 EU languages	CC-BY-4.0
1.1B (English)	~1.4%	~2.9%	~2,000×	EN only	CC-BY-4.0

Parakeet-TDT uses a FastConformer encoder with Token-and-Duration Transducer decoder. The v3 variant adds automatic language detection across 25 European languages and supports up to 3 hours of audio via local attention.

Canary Flash

Variant	Params	WER (LS-clean)	WER (LS-other)	Translation	Languages
Canary-1B-Flash	883M	1.48%	2.87%	✅ EN↔DE/FR/ES	4
Canary-180M-Flash	180M	1.87%	3.83%	✅ Same	4
Canary-1B-v2	~1B	~1.5%	~2.9%	✅	25 EU languages

Canary models combine ASR with speech translation. The 180M Flash variant is small enough for edge deployment. Note: the original Canary-1B used CC-BY-NC-4.0, but all Flash and v2 variants use permissive CC-BY-4.0.

Nemotron Speech Streaming 0.6B

Released January 2026 under NVIDIA Open Model License. Purpose-built for low-latency voice agent STT with cache-aware FastConformer-RNNT architecture.

Key specs:

Each audio frame encoded exactly once with cached context
Configurable chunk sizes: 80ms to 1,120ms
Median time-to-final-transcription: ~24ms
Handles 560 concurrent streams on H100
Confirmed on DGX Spark (demonstrated in pipecat-ai project)

Recommended for Real-Time Pipelines

Nemotron Speech Streaming 0.6B is the recommended STT model for real-time voice pipelines on NVIDIA hardware.

Edge and Specialized STT

SenseVoice Small (Alibaba FunAudioLLM)

234M parameters. Unique for combining ASR + speech emotion recognition + audio event detection (applause, laughter, coughing) in a single model.

Languages: Chinese, English, Cantonese, Japanese, Korean
Speed: 10 seconds of audio in 70ms on A800 (5× faster than Whisper-Small)
Chinese/Cantonese accuracy: >50% better than Whisper
ONNX export supported, integrated into sherpa-onnx
License: Apache 2.0 (model), MIT (code)

Moonshine (Moonshine AI)

The standout edge STT model. MIT license.

Variant	Params	Disk Size	WER vs Whisper	Latency	Streaming
tiny	27M	26 MB	Matches Whisper Tiny at 5× less compute	<100ms	❌
base	61M	60 MB	Matches Whisper Small	<150ms	❌
v2 Medium Streaming	~200M	~200 MB	Lower WER than Whisper Large-v3	258ms	✅
Language-specialized tiny	27M	26 MB	48% lower error than Whisper Tiny	<100ms	❌

Moonshine v2 (February 2026) introduced sliding-window self-attention for bounded-latency streaming. All models ship in ONNX format, run on Raspberry Pi and mobile, and can operate in as little as 8 MB RAM.

Language-specialized tiny models available for: Arabic, Chinese, Japanese, Korean, Ukrainian, Vietnamese.

Vosk (Alpha Cephei)

Apache 2.0 license. Ultra-lightweight with 50 MB portable models across 20+ languages and true zero-latency streaming. Accuracy (~10–15% WER on general English) is far behind Whisper-class models but excellent for command-and-control applications.

sherpa-onnx STT

Not a model but a cross-platform runtime (Apache 2.0, v1.12.29, March 2026) supporting 12+ STT architectures: Zipformer, Paraformer, Whisper, Moonshine, SenseVoice, Parakeet CTC, Canary Flash, FunASR-Nano. Runs on x86, ARM, RISC-V, Rockchip NPU, Ascend NPU, Jetson with 12 programming language bindings.

Mozilla DeepSpeech — Discontinued

DeepSpeech Discontinued

Officially discontinued June 2025, GitHub repository archived. Last release: v0.9.3 (December 2020), 7.06% WER on LibriSpeech clean. Migrate to Whisper, Moonshine, or Vosk.

STT Model Comparison Table

Model	Params	WER (LS-clean)	Streaming	Languages	ONNX/GGUF	License	Best For
Whisper large-v3-turbo	809M	~2.5%	❌ (batch)	100	GGML ✅	MIT	General purpose
distil-large-v3.5	~400M	~2.6%	❌	EN	GGML ✅	MIT	Fast English
Parakeet-TDT-0.6B-v2	600M	1.6%	❌	EN	ONNX ✅	CC-BY-4.0	Accuracy + speed
Parakeet-TDT-0.6B-v3	600M	~1.5%	❌	25 EU	ONNX ✅	CC-BY-4.0	Multilingual accuracy
Canary-1B-Flash	883M	1.48%	❌	4 + translation	ONNX ✅	CC-BY-4.0	Best accuracy
Canary-180M-Flash	180M	1.87%	❌	4 + translation	ONNX ✅	CC-BY-4.0	Edge + accuracy
Nemotron Speech 0.6B	600M	~2%	✅ 80ms chunks	EN	❌	NVIDIA Open	Real-time voice agents
SenseVoice Small	234M	~3% (EN)	❌	5 + emotion	ONNX ✅	Apache 2.0	Chinese + emotion
Moonshine v2 Medium	~200M	<2.4%	✅ 258ms	EN	ONNX ✅	MIT	Edge streaming
Moonshine tiny	27M	~7.6%	❌	EN	ONNX ✅	MIT	Ultra-edge
Vosk (small EN)	~50M	~10-15%	✅ Zero-latency	20+	❌	Apache 2.0	Commands

DGX Spark Compatibility

Model/Runtime	Grace CPU	Blackwell GPU	Status
whisper.cpp	✅ Native aarch64	⚠️ CUDA 13 build needed	Confirmed
faster-whisper	✅ CPU int8 confirmed	⚠️ CTranslate2 aarch64	Official Arm tutorial
NeMo Parakeet/Canary	✅	✅ Native via Riva/NIM	First-party NVIDIA
Nemotron Speech	✅	✅	Confirmed (pipecat demo)
Moonshine	✅ ONNX Runtime	N/A (CPU-only typical)	Designed for ARM
SenseVoice	✅ Via sherpa-onnx	N/A	ONNX on aarch64
Vosk	✅ ARM support	N/A (CPU only)	Works

Choosing an STT Model

Your Situation	Recommended Model	Why
Real-time voice agent on NVIDIA	Nemotron Speech 0.6B	24ms median latency, streaming, native Spark
Best accuracy, batch processing	Canary-1B-Flash	1.48% WER, best open-weight accuracy
General purpose, any hardware	Whisper large-v3-turbo via faster-whisper	100 languages, MIT, massive ecosystem
Fast English transcription	distil-large-v3.5	1.5× faster than turbo, within 1% WER
Raspberry Pi / edge device	Moonshine tiny or base	26–60 MB, runs in 8 MB RAM
Edge with streaming needed	Moonshine v2 Medium	Better than Whisper Large-v3 at 43× speed
Chinese + emotion detection	SenseVoice Small	5× faster than Whisper Small, emotion tags
Ultra-lightweight commands	Vosk	50 MB, zero-latency streaming
Cross-platform deployment	sherpa-onnx + any ONNX model	12 languages, embedded to server

References

Resource	URL
OpenAI Whisper	github.com/openai/whisper
whisper.cpp	github.com/ggml-org/whisper.cpp
faster-whisper	github.com/SYSTRAN/faster-whisper
Distil-Whisper	github.com/huggingface/distil-whisper
Parakeet-TDT-0.6B-v2	huggingface.co/nvidia/parakeet-tdt-0.6b-v2
Parakeet-TDT-0.6B-v3	huggingface.co/nvidia/parakeet-tdt-0.6b-v3
Canary-1B-Flash	huggingface.co/nvidia/canary-1b
Nemotron Speech Streaming	huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b
Nemotron Speech blog	huggingface.co/blog/nvidia/nemotron-speech-asr-scaling-voice-agents
SenseVoice	github.com/FunAudioLLM/SenseVoice
Moonshine	github.com/moonshine-ai/moonshine
Moonshine v2 paper	arxiv.org/abs/2602.12241
Vosk	github.com/alphacep/vosk-api
sherpa-onnx	github.com/k2-fsa/sherpa-onnx
Arm DGX Spark voice chatbot	learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/