Choose the right TTS model for your use case. Includes decision flowcharts, ranked recommendations by scenario, and step-by-step quickstart guides for the top models.
Decision Flowchart
Which TTS Should I Use?
DGX Spark Picks (confirmed or high-confidence)
| Priority | Model | Why | Status |
|---|---|---|---|
| 1st to test | Qwen3-TTS-12Hz-0.6B | 97ms TTFA, 12.5 Hz codec = fewest tokens/sec, confirmed on Spark via vLLM-Omni | Confirmed |
| 2nd to test | VibeVoice-Realtime-0.5B | Confirmed 0.48× RTF on Spark, streaming, 7.5 Hz codec | Confirmed |
| 3rd to test | Orpheus 3B (GGUF Q4) | llama.cpp has excellent Spark support; broadest ecosystem | High confidence |
| 4th to test | TADA-1B | Raw PyTorch + Llama backbone; ultra-low 2-3 Hz codec rate | High confidence |
| 5th to test | Fun-CosyVoice 3 | Similar architecture to working Qwen3-TTS; TRT-LLM path | High confidence |
| Fallback | NVIDIA Magpie 357M | Confirmed on Spark but no voice cloning in open weights | Confirmed (no cloning) |
Preset Voice Models (no cloning needed)
| Situation | Model | Why |
|---|---|---|
| Absolute smallest footprint | Kokoro-82M | 50 MB ONNX, runs in browser, #17 on TTS Arena |
| CPU-only / embedded | Piper | aarch64 binaries, hundreds of voices, ONNX |
| Need multiple accents | MeloTTS | 6 languages, 4 English accents, MIT |
| Tiniest possible | KittenTTS Nano | 25 MB, 15M params, INT8 available |
Edge Picks (CPU / Raspberry Pi / Mobile)
| Situation | Model | Why |
|---|---|---|
| Voice cloning on CPU | NeuTTS Air (Q4 GGUF) | 500 MB, real-time on mid-range CPU, Apache 2.0 |
| High quality without cloning | Kokoro-82M (ONNX) | Runs on Pi 4, browser, phone |
| Maximum language coverage | sherpa-onnx + Piper voices | Pre-built ARM64 binaries, 30+ languages |
| Accessibility / screen reader | eSpeak-NG | <5 MB, instant, 100+ languages |
Recommendations by Use Case
Voice Assistant / Conversational AI
Goal: Sub-1-second voice-to-voice response with natural conversation flow.
| Rank | Model | Why | Est. Pipeline Latency |
|---|---|---|---|
| 1 | Qwen3-TTS-12Hz-0.6B | 97ms TTFA, streaming, 10 languages, Apache 2.0 | ~700-900ms |
| 2 | Fun-CosyVoice 3 | 150ms TTFA, bidirectional streaming, emotion control | ~700-1000ms |
| 3 | TADA-1B | 0.09 RTF, zero hallucinations, MIT | ~800-1200ms (no streaming yet) |
| 4 | Chatterbox Turbo | Sub-200ms, paralinguistic tags, MIT | ~800-1100ms |
Use Pipecat with sentence-level aggregation. Add filler words ("umm," "so") via LLM prompting to mask TTS latency.
Audiobook / Long-Form Narration
Goal: Highest quality, consistent voice over 10+ minutes, emotion range.
| Rank | Model | Why |
|---|---|---|
| 1 | Fish Audio S2 Pro | SOTA quality, 80+ languages, 15,000+ emotion tags, voice cloning |
| 2 | Higgs Audio V2 | Best emotion benchmarks, multi-speaker dialogue |
| 3 | Qwen3-TTS-12Hz-1.7B | High quality, 10 languages, long-form stable |
| 4 | TADA-3B-ML | Zero hallucinations, 700s context, 9 languages |
RTF doesn't need to be real-time for audiobooks. Prioritize quality over speed.
Voice Cloning for Persona Preservation
Goal: Capture a specific person's voice from a recording for ongoing use.
| Rank | Model | Ref Audio Needed | Clone Quality |
|---|---|---|---|
| 1 | Fish Audio S2 Pro | 10-30s | Excellent (SOTA speaker sim) |
| 2 | Qwen3-TTS-12Hz-1.7B | 3-15s | Excellent (0.789 avg SIM) |
| 3 | Fun-CosyVoice 3 | Few seconds | Very good |
| 4 | NeuTTS Air | 3-15s | Good (85-95% depending on length) |
| 5 | Chatterbox | 5-10s | Good (emotion preservation) |
Game / Interactive Media
Goal: Multiple distinct characters, emotional range, real-time.
| Rank | Model | Why |
|---|---|---|
| 1 | Dia2 | Multi-speaker ([S1]/[S2] tags), nonverbal sounds, streaming |
| 2 | Chatterbox Turbo | Paralinguistic tags ([laugh], [cough]), emotion control, MIT |
| 3 | Fish Audio S2 Pro | 15,000+ inline emotion/prosody tags |
| 4 | Orpheus 3B | 8+ emotive tags, easy Ollama deployment |
Multilingual Applications
Goal: High quality across many languages.
| Languages | Best Model | Coverage |
|---|---|---|
| 2 (ZH+EN) | Spark-TTS | Purpose-built bilingual |
| 5-10 | Fun-CosyVoice 3 | 9 languages + 18 Chinese dialects |
| 10 | Qwen3-TTS-12Hz | 10 languages with voice cloning |
| 23 | Chatterbox Multilingual | 23 languages with emotion control |
| 80+ | Fish Audio S2 Pro | Broadest coverage (80+ languages) |
The 5-Minute Comparison
Top 10 models ranked by "deployability on DGX Spark"
This ranking prioritizes: (1) confirmed/likely Spark compatibility, (2) voice cloning support, (3) low latency, (4) permissive license.
| Rank | Model | Params | License | Voice Clone | Best RTF | Spark Status | Overall Score |
|---|---|---|---|---|---|---|---|
| 1 | Qwen3-TTS-12Hz-0.6B | 0.6B | Apache 2.0 | Yes (3s) | ~97ms TTFA | Confirmed | 5/5 |
| 2 | Orpheus 3B (GGUF) | 3B | Apache 2.0 | Yes | ~200ms | llama.cpp path | 4.5/5 |
| 3 | TADA-1B | 1B | MIT | Yes | 0.09 | Raw PyTorch | 4/5 |
| 4 | Fun-CosyVoice 3 | 0.5B | Apache 2.0 | Yes | ~150ms TTFA | Similar to Qwen3 | 4/5 |
| 5 | VibeVoice-RT-0.5B | 0.88B | MIT | Presets only | 0.48 (Spark) | Confirmed | 3.5/5 |
| 6 | Chatterbox Turbo | 350M | MIT | Yes (5s) | Sub-200ms | PyTorch | 3.5/5 |
| 7 | OuteTTS 1.0 (GGUF) | 0.6B | Apache 2.0 | Yes (10s) | Competitive | llama.cpp path | 3/5 |
| 8 | NeuTTS Air (GGUF) | 748M | Apache 2.0 | Yes (3s) | CPU RT | llama.cpp path | 3/5 |
| 9 | Fish S2 Pro | 4.4B | Research License | Yes (10s) | 0.195 (H200) | Hard build | 2.5/5 |
| 10 | Magpie TTS 357M | 357M | NVIDIA License | No (open weights) | ~600ms (Spark) | Confirmed | 2.5/5 |
Quickstart Guides
Quickstart 1: Qwen3-TTS on DGX Spark (Recommended Starting Point)
Quickstart 2: Orpheus 3B via llama.cpp (Most Portable)
Quickstart 3: TADA-1B (Fastest Raw Inference)
Quickstart 4: Chatterbox Turbo (Easiest Setup)
Quickstart 5: Kokoro-82M (Lightest Weight — No Cloning)
Voice Cloning Best Practices
Recording reference audio
The quality of your voice clone depends heavily on the reference audio. Follow these guidelines:
Duration:
- 3 seconds: Minimum viable. Captures basic timbre but misses speaking style
- 10-15 seconds: Sweet spot for most models. Captures timbre + rhythm + style
- 30 seconds: Diminishing returns for most models; some (Fish S2 Pro) benefit
- Over 60 seconds: Can actually degrade quality on some models (Qwen3-TTS)
Recording quality:
- Use a decent microphone (even a modern phone in a quiet room works)
- Record in WAV format, 16-bit, 24 kHz or higher, mono
- Minimize background noise (no music, TV, fans)
- Avoid room echo (record in a carpeted room, not a bathroom)
- Speak naturally — like a conversation, not a voiceover
Content:
- Natural, continuous speech with varied intonation
- Include some questions, statements, and exclamations for prosody range
- Avoid long pauses (> 2 seconds)
- Speech should occupy at least 60% of the total duration
- Provide the accurate transcript if the model requires it
Post-processing:
- Trim silence from beginning and end (leave ~0.5s)
- Normalize to -18 dB LUFS average
- Remove any clicks, pops, or noise artifacts
- Do NOT apply compression, EQ, or effects — models work best with natural audio
Per-model cloning tips
| Model | Tip |
|---|---|
| Qwen3-TTS | Append 0.5s silence to end of reference. Quality plateaus around 15s then degrades |
| Fish S2 Pro | SGLang prefix caching makes repeated use of same voice nearly free. 10-30s reference |
| NeuTTS Air | 85-90% similarity at 3s; 95%+ at 15s. Same-language reference strongly recommended |
| Orpheus | Built-in voices only in released model. Fine-tuning required for true custom voices |
| Chatterbox | Clean recording quality matters more than duration. cfg_weight ~0.3 for natural rhythm |
| OuteTTS | Speaker profiles save as reusable JSON files — pre-compute for production |
Troubleshooting Common Issues
DGX Spark Specific
Problem: "Torch not compiled with CUDA enabled" or "no kernel image is available"
# Solution: Install PyTorch from the cu130 index
pip install torch==2.9.1 torchaudio==2.9.1 \
--force-reinstall --index-url https://download.pytorch.org/whl/cu130Problem: White noise after ~2 seconds ("Blackwell Noise") This affects xTTS-v2, F5-TTS, Kokoro (GPU), and others. It's a numerical instability in FP16/BF16 on sm_121 Tensor Cores.
# Workaround 1: Force CPU inference
device = "cpu" # in your Python code
# Workaround 2: Use a different model that's confirmed working on Spark
# (Qwen3-TTS, VibeVoice-RT, Magpie)Problem: torchaudio fails to install on aarch64
# Solution 1: Use the cu130 index
pip install torchaudio==2.9.1 --index-url https://download.pytorch.org/whl/cu130
# Solution 2: Replace torchaudio with soundfile
pip install soundfile
# Then use: import soundfile as sf; sf.write("out.wav", audio, sr)FA3 is incompatible with Blackwell. Use FA2 or rely on SDPA fallback.
pip install flash-attn==2.* --no-build-isolationProblem: vLLM crashes with CUDA errors
# Solution: Force eager mode
vllm serve model_name --enforce-eager --trust-remote-codeProblem: SGLang sgl-kernel import errors
# Solution: Build from source for sm_121a
cd sglang/sgl-kernel
TORCH_CUDA_ARCH_LIST="12.1a" MAX_JOBS=4 CMAKE_BUILD_PARALLEL_LEVEL=1 \
python -m build --wheel --no-isolation
pip install --no-deps dist/sgl_kernel*.whlGeneral TTS Issues
Problem: Generated speech sounds robotic or choppy
- Increase reference audio duration (if using voice cloning)
- Reduce speaking rate (lower
rateparameter) - Try a different model — quality varies significantly between models
- Ensure you're running on GPU (CPU inference can introduce quality artifacts on some models)
Problem: Words are skipped or hallucinated
- Try TADA (zero hallucinations by design)
- Reduce temperature parameter
- Split long text into shorter segments
- Some models hallucinate more at the end of long sequences — limit output length
Problem: Voice clone doesn't sound like the reference
- Ensure reference audio is clean (no background noise, echo)
- Check that reference transcript matches the audio exactly
- Try a longer reference clip (10-15 seconds)
- Some models work better with same-language references
- Reduce cfg_weight or classifier-free guidance scale
Problem: High latency / slow generation
- Check RTF — if > 1.0, generation is slower than real-time
- Enable streaming output to reduce perceived latency
- Use smaller model variant (0.6B instead of 1.7B)
- Apply quantization (INT4 or GGUF Q4)
- Ensure GPU is being used (check
nvidia-smi) - Pre-warm model on startup (run a dummy generation)
This is a companion document to "The Complete Offline TTS Guide (2026 Edition)." Return to the main guide for comprehensive model comparisons and deployment configurations.
Related Guides
Complete Offline TTS Guide 2026
Comprehensive guide to offline text-to-speech: 30+ models, optimization techniques, deployment configurations, and DGX Spark compatibility.
TTS Model Catalog & Optimization
Complete catalog of 30+ offline TTS models with DGX Spark compatibility, optimization techniques, voice pipeline architecture, and deployment configurations.
Hardware Comparison Guide
Side-by-side comparison of 8 hardware platforms for TTS deployment: DGX Spark, Jetson Orin, Mac Studio, consumer GPUs, and cloud instances.
Was this guide helpful?