Skip to content

Offline STT Guide

A comprehensive reference for deploying speech-to-text locally — covering Whisper variants, NVIDIA NeMo models, Moonshine, and edge STT options with DGX Spark compatibility notes.

Overview

The STT (speech-to-text) stage of a voice pipeline converts spoken audio into text for the LLM to process. While TTS quality gets more attention, STT latency and accuracy directly gate the entire pipeline — you can't respond to what you haven't transcribed.

The 2026 offline STT landscape is dominated by three families:

  • Whisper ecosystem (OpenAI): The default starting point. 9 model sizes, 100 languages, MIT license, massive community tooling
  • NVIDIA NeMo models (Parakeet, Canary, Nemotron Speech): State-of-the-art accuracy and speed, open-weight, native DGX Spark support
  • Edge-optimized models (Moonshine, SenseVoice, Vosk): Purpose-built for low-power and embedded deployment
Key Finding

NVIDIA Parakeet-TDT-0.6B-v2 processes audio at 3,380× real-time — transcribing 60 minutes of audio per second at batch size 128. Moonshine v2 Medium Streaming achieves lower WER than Whisper Large-v3 at 43× the speed. Mozilla DeepSpeech was officially discontinued in June 2025.

The Whisper Family

OpenAI's Whisper (MIT license) spans nine model sizes from 39M to 1.55B parameters, supporting 100 languages.

Model Sizes

ModelParamsEnglish WER (LS-clean)Relative SpeedEst. VRAMLanguages
tiny / tiny.en39M~7.6% / ~5.6%~10×~1 GB100 / EN only
base / base.en74M~5.0% / ~4.3%~7×~1 GB100 / EN only
small / small.en244M~3.4% / ~3.0%~4×~2 GB100 / EN only
medium / medium.en769M~2.9% / ~2.8%~2×~5 GB100 / EN only
large-v21,550M~2.7%~10 GB100
large-v31,550M~2.4%~10 GB100
large-v3-turbo809M~2.5%~8×~6 GB100

large-v3 (Nov 2023) added 128 Mel filter banks (vs 80 in v2), improving accuracy 10–20%. large-v3-turbo (Oct 2024) shrunk the decoder from 32 to 4 layers, delivering 8× speedup at near-identical accuracy. MLCommons selected Whisper Large-v3 as the MLPerf Inference ASR benchmark in September 2025.

Distil-Whisper

HuggingFace's distilled Whisper variants compress the decoder to 2 layers:

VariantSpeed vs TeacherSize ReductionEnglish WER ImpactLicense
distil-large-v26.3× faster51% smallerWithin 1% WERMIT
distil-large-v36× faster49% smallerWithin 1% WERMIT
distil-large-v3.51.5× faster than turbo~50% smallerWithin 1% WERMIT

distil-large-v3.5 (trained on 98K hours) is the current best distilled option. All distil variants are English-only. For multilingual, use large-v3-turbo directly.

Whisper Inference Engines

whisper.cpp

The CPU-first C/C++ Whisper implementation by ggml-org. MIT license, v1.8.3 (January 2026).

Key features:

  • GGML quantization (Q4_0 through Q8_0) — up to 65% disk reduction
  • Vulkan iGPU support (12× speedup on AMD/Intel integrated graphics)
  • Native aarch64 build — confirmed working on DGX Spark Grace CPU
  • Streaming via whisper-stream (0.5s sampling intervals)
  • No Python dependency
Build for DGX Spark

Use CUDA architecture 121 for the Blackwell GB10 GPU on DGX Spark.

git clone https://github.com/ggml-org/whisper.cpp.git
cd whisper.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=121
cmake --build build -j4
./build/bin/whisper-cli -m models/ggml-large-v3-turbo-q8_0.bin -f audio.wav

faster-whisper

SYSTRAN's CTranslate2-based implementation. MIT license. 4× faster than original Whisper with identical accuracy, plus int8 quantization halving VRAM usage.

Benchmarks (RTX 3070 Ti, large-v3):

  • Original Whisper: 2m23s
  • faster-whisper FP16: 52s
  • faster-whisper int8: ~45s
  • distil-large-v3 int8: 22.5s

DGX Spark: Confirmed working via NVIDIA's official Arm Learning Paths tutorial (CPU int8 mode). Community Docker images available for CUDA 13 on Blackwell.

pip install faster-whisper
# CPU inference (confirmed on DGX Spark):
from faster_whisper import WhisperModel
model = WhisperModel("large-v3-turbo", device="cpu", compute_type="int8")
segments, info = model.transcribe("audio.wav", beam_size=5)
for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

NVIDIA NeMo Speech Models

All models below are open-weight under CC-BY-4.0 (exception noted) and available on HuggingFace and NGC.

Parakeet-TDT-0.6B

VersionWER (LS-clean)WER (LS-other)Speed (RTFx)LanguagesLicense
v21.6%3.3%3,380× (batch 128)EN onlyCC-BY-4.0
v3~1.5%~3.1%Similar25 EU languagesCC-BY-4.0
1.1B (English)~1.4%~2.9%~2,000×EN onlyCC-BY-4.0

Parakeet-TDT uses a FastConformer encoder with Token-and-Duration Transducer decoder. The v3 variant adds automatic language detection across 25 European languages and supports up to 3 hours of audio via local attention.

Canary Flash

VariantParamsWER (LS-clean)WER (LS-other)TranslationLanguages
Canary-1B-Flash883M1.48%2.87%✅ EN↔DE/FR/ES4
Canary-180M-Flash180M1.87%3.83%✅ Same4
Canary-1B-v2~1B~1.5%~2.9%25 EU languages

Canary models combine ASR with speech translation. The 180M Flash variant is small enough for edge deployment. Note: the original Canary-1B used CC-BY-NC-4.0, but all Flash and v2 variants use permissive CC-BY-4.0.

Nemotron Speech Streaming 0.6B

Released January 2026 under NVIDIA Open Model License. Purpose-built for low-latency voice agent STT with cache-aware FastConformer-RNNT architecture.

Key specs:

  • Each audio frame encoded exactly once with cached context
  • Configurable chunk sizes: 80ms to 1,120ms
  • Median time-to-final-transcription: ~24ms
  • Handles 560 concurrent streams on H100
  • Confirmed on DGX Spark (demonstrated in pipecat-ai project)
Recommended for Real-Time Pipelines

Nemotron Speech Streaming 0.6B is the recommended STT model for real-time voice pipelines on NVIDIA hardware.

Edge and Specialized STT

SenseVoice Small (Alibaba FunAudioLLM)

234M parameters. Unique for combining ASR + speech emotion recognition + audio event detection (applause, laughter, coughing) in a single model.

  • Languages: Chinese, English, Cantonese, Japanese, Korean
  • Speed: 10 seconds of audio in 70ms on A800 (5× faster than Whisper-Small)
  • Chinese/Cantonese accuracy: >50% better than Whisper
  • ONNX export supported, integrated into sherpa-onnx
  • License: Apache 2.0 (model), MIT (code)

Moonshine (Moonshine AI)

The standout edge STT model. MIT license.

VariantParamsDisk SizeWER vs WhisperLatencyStreaming
tiny27M26 MBMatches Whisper Tiny at 5× less compute<100ms
base61M60 MBMatches Whisper Small<150ms
v2 Medium Streaming~200M~200 MBLower WER than Whisper Large-v3258ms
Language-specialized tiny27M26 MB48% lower error than Whisper Tiny<100ms

Moonshine v2 (February 2026) introduced sliding-window self-attention for bounded-latency streaming. All models ship in ONNX format, run on Raspberry Pi and mobile, and can operate in as little as 8 MB RAM.

Language-specialized tiny models available for: Arabic, Chinese, Japanese, Korean, Ukrainian, Vietnamese.

Vosk (Alpha Cephei)

Apache 2.0 license. Ultra-lightweight with 50 MB portable models across 20+ languages and true zero-latency streaming. Accuracy (~10–15% WER on general English) is far behind Whisper-class models but excellent for command-and-control applications.

sherpa-onnx STT

Not a model but a cross-platform runtime (Apache 2.0, v1.12.29, March 2026) supporting 12+ STT architectures: Zipformer, Paraformer, Whisper, Moonshine, SenseVoice, Parakeet CTC, Canary Flash, FunASR-Nano. Runs on x86, ARM, RISC-V, Rockchip NPU, Ascend NPU, Jetson with 12 programming language bindings.

Mozilla DeepSpeech — Discontinued

DeepSpeech Discontinued

Officially discontinued June 2025, GitHub repository archived. Last release: v0.9.3 (December 2020), 7.06% WER on LibriSpeech clean. Migrate to Whisper, Moonshine, or Vosk.

STT Model Comparison Table

ModelParamsWER (LS-clean)StreamingLanguagesONNX/GGUFLicenseBest For
Whisper large-v3-turbo809M~2.5%❌ (batch)100GGML ✅MITGeneral purpose
distil-large-v3.5~400M~2.6%ENGGML ✅MITFast English
Parakeet-TDT-0.6B-v2600M1.6%ENONNX ✅CC-BY-4.0Accuracy + speed
Parakeet-TDT-0.6B-v3600M~1.5%25 EUONNX ✅CC-BY-4.0Multilingual accuracy
Canary-1B-Flash883M1.48%4 + translationONNX ✅CC-BY-4.0Best accuracy
Canary-180M-Flash180M1.87%4 + translationONNX ✅CC-BY-4.0Edge + accuracy
Nemotron Speech 0.6B600M~2%✅ 80ms chunksENNVIDIA OpenReal-time voice agents
SenseVoice Small234M~3% (EN)5 + emotionONNX ✅Apache 2.0Chinese + emotion
Moonshine v2 Medium~200M<2.4%✅ 258msENONNX ✅MITEdge streaming
Moonshine tiny27M~7.6%ENONNX ✅MITUltra-edge
Vosk (small EN)~50M~10-15%✅ Zero-latency20+Apache 2.0Commands

DGX Spark Compatibility

Model/RuntimeGrace CPUBlackwell GPUStatus
whisper.cpp✅ Native aarch64⚠️ CUDA 13 build neededConfirmed
faster-whisper✅ CPU int8 confirmed⚠️ CTranslate2 aarch64Official Arm tutorial
NeMo Parakeet/Canary✅ Native via Riva/NIMFirst-party NVIDIA
Nemotron SpeechConfirmed (pipecat demo)
Moonshine✅ ONNX RuntimeN/A (CPU-only typical)Designed for ARM
SenseVoice✅ Via sherpa-onnxN/AONNX on aarch64
Vosk✅ ARM supportN/A (CPU only)Works

Choosing an STT Model

Your SituationRecommended ModelWhy
Real-time voice agent on NVIDIANemotron Speech 0.6B24ms median latency, streaming, native Spark
Best accuracy, batch processingCanary-1B-Flash1.48% WER, best open-weight accuracy
General purpose, any hardwareWhisper large-v3-turbo via faster-whisper100 languages, MIT, massive ecosystem
Fast English transcriptiondistil-large-v3.51.5× faster than turbo, within 1% WER
Raspberry Pi / edge deviceMoonshine tiny or base26–60 MB, runs in 8 MB RAM
Edge with streaming neededMoonshine v2 MediumBetter than Whisper Large-v3 at 43× speed
Chinese + emotion detectionSenseVoice Small5× faster than Whisper Small, emotion tags
Ultra-lightweight commandsVosk50 MB, zero-latency streaming
Cross-platform deploymentsherpa-onnx + any ONNX model12 languages, embedded to server

References

ResourceURL
OpenAI Whispergithub.com/openai/whisper
whisper.cppgithub.com/ggml-org/whisper.cpp
faster-whispergithub.com/SYSTRAN/faster-whisper
Distil-Whispergithub.com/huggingface/distil-whisper
Parakeet-TDT-0.6B-v2huggingface.co/nvidia/parakeet-tdt-0.6b-v2
Parakeet-TDT-0.6B-v3huggingface.co/nvidia/parakeet-tdt-0.6b-v3
Canary-1B-Flashhuggingface.co/nvidia/canary-1b
Nemotron Speech Streaminghuggingface.co/nvidia/nemotron-speech-streaming-en-0.6b
Nemotron Speech bloghuggingface.co/blog/nvidia/nemotron-speech-asr-scaling-voice-agents
SenseVoicegithub.com/FunAudioLLM/SenseVoice
Moonshinegithub.com/moonshine-ai/moonshine
Moonshine v2 paperarxiv.org/abs/2602.12241
Voskgithub.com/alphacep/vosk-api
sherpa-onnxgithub.com/k2-fsa/sherpa-onnx
Arm DGX Spark voice chatbotlearn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/

Related Guides

Was this guide helpful?