TTS Decision Guide

Choose the right TTS model for your use case. Includes decision flowcharts, ranked recommendations by scenario, and step-by-step quickstart guides for the top models.

Decision Flowchart

Which TTS Should I Use?

DGX Spark Picks (confirmed or high-confidence)

Priority	Model	Why	Status
1st to test	Qwen3-TTS-12Hz-0.6B	97ms TTFA, 12.5 Hz codec = fewest tokens/sec, confirmed on Spark via vLLM-Omni	Confirmed
2nd to test	VibeVoice-Realtime-0.5B	Confirmed 0.48× RTF on Spark, streaming, 7.5 Hz codec	Confirmed
3rd to test	Orpheus 3B (GGUF Q4)	llama.cpp has excellent Spark support; broadest ecosystem	High confidence
4th to test	TADA-1B	Raw PyTorch + Llama backbone; ultra-low 2-3 Hz codec rate	High confidence
5th to test	Fun-CosyVoice 3	Similar architecture to working Qwen3-TTS; TRT-LLM path	High confidence
Fallback	NVIDIA Magpie 357M	Confirmed on Spark but no voice cloning in open weights	Confirmed (no cloning)

Preset Voice Models (no cloning needed)

Situation	Model	Why
Absolute smallest footprint	Kokoro-82M	50 MB ONNX, runs in browser, #17 on TTS Arena
CPU-only / embedded	Piper	aarch64 binaries, hundreds of voices, ONNX
Need multiple accents	MeloTTS	6 languages, 4 English accents, MIT
Tiniest possible	KittenTTS Nano	25 MB, 15M params, INT8 available

Edge Picks (CPU / Raspberry Pi / Mobile)

Situation	Model	Why
Voice cloning on CPU	NeuTTS Air (Q4 GGUF)	500 MB, real-time on mid-range CPU, Apache 2.0
High quality without cloning	Kokoro-82M (ONNX)	Runs on Pi 4, browser, phone
Maximum language coverage	sherpa-onnx + Piper voices	Pre-built ARM64 binaries, 30+ languages
Accessibility / screen reader	eSpeak-NG	<5 MB, instant, 100+ languages

Recommendations by Use Case

Voice Assistant / Conversational AI

Goal: Sub-1-second voice-to-voice response with natural conversation flow.

Rank	Model	Why	Est. Pipeline Latency
1	Qwen3-TTS-12Hz-0.6B	97ms TTFA, streaming, 10 languages, Apache 2.0	~700-900ms
2	Fun-CosyVoice 3	150ms TTFA, bidirectional streaming, emotion control	~700-1000ms
3	TADA-1B	0.09 RTF, zero hallucinations, MIT	~800-1200ms (no streaming yet)
4	Chatterbox Turbo	Sub-200ms, paralinguistic tags, MIT	~800-1100ms

Pipeline Tip

Use Pipecat with sentence-level aggregation. Add filler words ("umm," "so") via LLM prompting to mask TTS latency.

Audiobook / Long-Form Narration

Goal: Highest quality, consistent voice over 10+ minutes, emotion range.

Rank	Model	Why
1	Fish Audio S2 Pro	SOTA quality, 80+ languages, 15,000+ emotion tags, voice cloning
2	Higgs Audio V2	Best emotion benchmarks, multi-speaker dialogue
3	Qwen3-TTS-12Hz-1.7B	High quality, 10 languages, long-form stable
4	TADA-3B-ML	Zero hallucinations, 700s context, 9 languages

Audiobook Tip

RTF doesn't need to be real-time for audiobooks. Prioritize quality over speed.

Voice Cloning for Persona Preservation

Goal: Capture a specific person's voice from a recording for ongoing use.

Rank	Model	Ref Audio Needed	Clone Quality
1	Fish Audio S2 Pro	10-30s	Excellent (SOTA speaker sim)
2	Qwen3-TTS-12Hz-1.7B	3-15s	Excellent (0.789 avg SIM)
3	Fun-CosyVoice 3	Few seconds	Very good
4	NeuTTS Air	3-15s	Good (85-95% depending on length)
5	Chatterbox	5-10s	Good (emotion preservation)

Game / Interactive Media

Goal: Multiple distinct characters, emotional range, real-time.

Rank	Model	Why
1	Dia2	Multi-speaker ([S1]/[S2] tags), nonverbal sounds, streaming
2	Chatterbox Turbo	Paralinguistic tags ([laugh], [cough]), emotion control, MIT
3	Fish Audio S2 Pro	15,000+ inline emotion/prosody tags
4	Orpheus 3B	8+ emotive tags, easy Ollama deployment

Multilingual Applications

Goal: High quality across many languages.

Languages	Best Model	Coverage
2 (ZH+EN)	Spark-TTS	Purpose-built bilingual
5-10	Fun-CosyVoice 3	9 languages + 18 Chinese dialects
10	Qwen3-TTS-12Hz	10 languages with voice cloning
23	Chatterbox Multilingual	23 languages with emotion control
80+	Fish Audio S2 Pro	Broadest coverage (80+ languages)

The 5-Minute Comparison

Top 10 models ranked by "deployability on DGX Spark"

This ranking prioritizes: (1) confirmed/likely Spark compatibility, (2) voice cloning support, (3) low latency, (4) permissive license.

Rank	Model	Params	License	Voice Clone	Best RTF	Spark Status	Overall Score
1	Qwen3-TTS-12Hz-0.6B	0.6B	Apache 2.0	Yes (3s)	~97ms TTFA	Confirmed	5/5
2	Orpheus 3B (GGUF)	3B	Apache 2.0	Yes	~200ms	llama.cpp path	4.5/5
3	TADA-1B	1B	MIT	Yes	0.09	Raw PyTorch	4/5
4	Fun-CosyVoice 3	0.5B	Apache 2.0	Yes	~150ms TTFA	Similar to Qwen3	4/5
5	VibeVoice-RT-0.5B	0.88B	MIT	Presets only	0.48 (Spark)	Confirmed	3.5/5
6	Chatterbox Turbo	350M	MIT	Yes (5s)	Sub-200ms	PyTorch	3.5/5
7	OuteTTS 1.0 (GGUF)	0.6B	Apache 2.0	Yes (10s)	Competitive	llama.cpp path	3/5
8	NeuTTS Air (GGUF)	748M	Apache 2.0	Yes (3s)	CPU RT	llama.cpp path	3/5
9	Fish S2 Pro	4.4B	Research License	Yes (10s)	0.195 (H200)	Hard build	2.5/5
10	Magpie TTS 357M	357M	NVIDIA License	No (open weights)	~600ms (Spark)	Confirmed	2.5/5

Quickstart Guides

Quickstart 1: Qwen3-TTS on DGX Spark (Recommended Starting Point)

Quickstart 2: Orpheus 3B via llama.cpp (Most Portable)

Quickstart 3: TADA-1B (Fastest Raw Inference)

Quickstart 4: Chatterbox Turbo (Easiest Setup)

Quickstart 5: Kokoro-82M (Lightest Weight — No Cloning)

Voice Cloning Best Practices

Recording reference audio

The quality of your voice clone depends heavily on the reference audio. Follow these guidelines:

Duration:

3 seconds: Minimum viable. Captures basic timbre but misses speaking style
10-15 seconds: Sweet spot for most models. Captures timbre + rhythm + style
30 seconds: Diminishing returns for most models; some (Fish S2 Pro) benefit
Over 60 seconds: Can actually degrade quality on some models (Qwen3-TTS)

Recording quality:

Use a decent microphone (even a modern phone in a quiet room works)
Record in WAV format, 16-bit, 24 kHz or higher, mono
Minimize background noise (no music, TV, fans)
Avoid room echo (record in a carpeted room, not a bathroom)
Speak naturally — like a conversation, not a voiceover

Content:

Natural, continuous speech with varied intonation
Include some questions, statements, and exclamations for prosody range
Avoid long pauses (> 2 seconds)
Speech should occupy at least 60% of the total duration
Provide the accurate transcript if the model requires it

Post-processing:

Trim silence from beginning and end (leave ~0.5s)
Normalize to -18 dB LUFS average
Remove any clicks, pops, or noise artifacts
Do NOT apply compression, EQ, or effects — models work best with natural audio

Per-model cloning tips

Model	Tip
Qwen3-TTS	Append 0.5s silence to end of reference. Quality plateaus around 15s then degrades
Fish S2 Pro	SGLang prefix caching makes repeated use of same voice nearly free. 10-30s reference
NeuTTS Air	85-90% similarity at 3s; 95%+ at 15s. Same-language reference strongly recommended
Orpheus	Built-in voices only in released model. Fine-tuning required for true custom voices
Chatterbox	Clean recording quality matters more than duration. cfg_weight ~0.3 for natural rhythm
OuteTTS	Speaker profiles save as reusable JSON files — pre-compute for production

Troubleshooting Common Issues

DGX Spark Specific

Problem: "Torch not compiled with CUDA enabled" or "no kernel image is available"

# Solution: Install PyTorch from the cu130 index
pip install torch==2.9.1 torchaudio==2.9.1 \
  --force-reinstall --index-url https://download.pytorch.org/whl/cu130

Problem: White noise after ~2 seconds ("Blackwell Noise") This affects xTTS-v2, F5-TTS, Kokoro (GPU), and others. It's a numerical instability in FP16/BF16 on sm_121 Tensor Cores.

# Workaround 1: Force CPU inference
device = "cpu"  # in your Python code
 
# Workaround 2: Use a different model that's confirmed working on Spark
# (Qwen3-TTS, VibeVoice-RT, Magpie)

Problem: torchaudio fails to install on aarch64

# Solution 1: Use the cu130 index
pip install torchaudio==2.9.1 --index-url https://download.pytorch.org/whl/cu130
 
# Solution 2: Replace torchaudio with soundfile
pip install soundfile
# Then use: import soundfile as sf; sf.write("out.wav", audio, sr)

Flash Attention 3 Incompatibility

FA3 is incompatible with Blackwell. Use FA2 or rely on SDPA fallback.

pip install flash-attn==2.* --no-build-isolation

Problem: vLLM crashes with CUDA errors

# Solution: Force eager mode
vllm serve model_name --enforce-eager --trust-remote-code

Problem: SGLang sgl-kernel import errors

# Solution: Build from source for sm_121a
cd sglang/sgl-kernel
TORCH_CUDA_ARCH_LIST="12.1a" MAX_JOBS=4 CMAKE_BUILD_PARALLEL_LEVEL=1 \
  python -m build --wheel --no-isolation
pip install --no-deps dist/sgl_kernel*.whl

General TTS Issues

Problem: Generated speech sounds robotic or choppy

Increase reference audio duration (if using voice cloning)
Reduce speaking rate (lower rate parameter)
Try a different model — quality varies significantly between models
Ensure you're running on GPU (CPU inference can introduce quality artifacts on some models)

Problem: Words are skipped or hallucinated

Try TADA (zero hallucinations by design)
Reduce temperature parameter
Split long text into shorter segments
Some models hallucinate more at the end of long sequences — limit output length

Problem: Voice clone doesn't sound like the reference

Ensure reference audio is clean (no background noise, echo)
Check that reference transcript matches the audio exactly
Try a longer reference clip (10-15 seconds)
Some models work better with same-language references
Reduce cfg_weight or classifier-free guidance scale

Problem: High latency / slow generation

Check RTF — if > 1.0, generation is slower than real-time
Enable streaming output to reduce perceived latency
Use smaller model variant (0.6B instead of 1.7B)
Apply quantization (INT4 or GGUF Q4)
Ensure GPU is being used (check nvidia-smi)
Pre-warm model on startup (run a dummy generation)

This is a companion document to "The Complete Offline TTS Guide (2026 Edition)." Return to the main guide for comprehensive model comparisons and deployment configurations.