Skip to content

TTS Decision Guide

Choose the right TTS model for your use case. Includes decision flowcharts, ranked recommendations by scenario, and step-by-step quickstart guides for the top models.

Decision Flowchart

Which TTS Should I Use?

DGX Spark Picks (confirmed or high-confidence)

PriorityModelWhyStatus
1st to testQwen3-TTS-12Hz-0.6B97ms TTFA, 12.5 Hz codec = fewest tokens/sec, confirmed on Spark via vLLM-OmniConfirmed
2nd to testVibeVoice-Realtime-0.5BConfirmed 0.48× RTF on Spark, streaming, 7.5 Hz codecConfirmed
3rd to testOrpheus 3B (GGUF Q4)llama.cpp has excellent Spark support; broadest ecosystemHigh confidence
4th to testTADA-1BRaw PyTorch + Llama backbone; ultra-low 2-3 Hz codec rateHigh confidence
5th to testFun-CosyVoice 3Similar architecture to working Qwen3-TTS; TRT-LLM pathHigh confidence
FallbackNVIDIA Magpie 357MConfirmed on Spark but no voice cloning in open weightsConfirmed (no cloning)

Preset Voice Models (no cloning needed)

SituationModelWhy
Absolute smallest footprintKokoro-82M50 MB ONNX, runs in browser, #17 on TTS Arena
CPU-only / embeddedPiperaarch64 binaries, hundreds of voices, ONNX
Need multiple accentsMeloTTS6 languages, 4 English accents, MIT
Tiniest possibleKittenTTS Nano25 MB, 15M params, INT8 available

Edge Picks (CPU / Raspberry Pi / Mobile)

SituationModelWhy
Voice cloning on CPUNeuTTS Air (Q4 GGUF)500 MB, real-time on mid-range CPU, Apache 2.0
High quality without cloningKokoro-82M (ONNX)Runs on Pi 4, browser, phone
Maximum language coveragesherpa-onnx + Piper voicesPre-built ARM64 binaries, 30+ languages
Accessibility / screen readereSpeak-NG<5 MB, instant, 100+ languages

Recommendations by Use Case

Voice Assistant / Conversational AI

Goal: Sub-1-second voice-to-voice response with natural conversation flow.

RankModelWhyEst. Pipeline Latency
1Qwen3-TTS-12Hz-0.6B97ms TTFA, streaming, 10 languages, Apache 2.0~700-900ms
2Fun-CosyVoice 3150ms TTFA, bidirectional streaming, emotion control~700-1000ms
3TADA-1B0.09 RTF, zero hallucinations, MIT~800-1200ms (no streaming yet)
4Chatterbox TurboSub-200ms, paralinguistic tags, MIT~800-1100ms
Pipeline Tip

Use Pipecat with sentence-level aggregation. Add filler words ("umm," "so") via LLM prompting to mask TTS latency.

Audiobook / Long-Form Narration

Goal: Highest quality, consistent voice over 10+ minutes, emotion range.

RankModelWhy
1Fish Audio S2 ProSOTA quality, 80+ languages, 15,000+ emotion tags, voice cloning
2Higgs Audio V2Best emotion benchmarks, multi-speaker dialogue
3Qwen3-TTS-12Hz-1.7BHigh quality, 10 languages, long-form stable
4TADA-3B-MLZero hallucinations, 700s context, 9 languages
Audiobook Tip

RTF doesn't need to be real-time for audiobooks. Prioritize quality over speed.

Voice Cloning for Persona Preservation

Goal: Capture a specific person's voice from a recording for ongoing use.

RankModelRef Audio NeededClone Quality
1Fish Audio S2 Pro10-30sExcellent (SOTA speaker sim)
2Qwen3-TTS-12Hz-1.7B3-15sExcellent (0.789 avg SIM)
3Fun-CosyVoice 3Few secondsVery good
4NeuTTS Air3-15sGood (85-95% depending on length)
5Chatterbox5-10sGood (emotion preservation)

Game / Interactive Media

Goal: Multiple distinct characters, emotional range, real-time.

RankModelWhy
1Dia2Multi-speaker ([S1]/[S2] tags), nonverbal sounds, streaming
2Chatterbox TurboParalinguistic tags ([laugh], [cough]), emotion control, MIT
3Fish Audio S2 Pro15,000+ inline emotion/prosody tags
4Orpheus 3B8+ emotive tags, easy Ollama deployment

Multilingual Applications

Goal: High quality across many languages.

LanguagesBest ModelCoverage
2 (ZH+EN)Spark-TTSPurpose-built bilingual
5-10Fun-CosyVoice 39 languages + 18 Chinese dialects
10Qwen3-TTS-12Hz10 languages with voice cloning
23Chatterbox Multilingual23 languages with emotion control
80+Fish Audio S2 ProBroadest coverage (80+ languages)

The 5-Minute Comparison

Top 10 models ranked by "deployability on DGX Spark"

This ranking prioritizes: (1) confirmed/likely Spark compatibility, (2) voice cloning support, (3) low latency, (4) permissive license.

RankModelParamsLicenseVoice CloneBest RTFSpark StatusOverall Score
1Qwen3-TTS-12Hz-0.6B0.6BApache 2.0Yes (3s)~97ms TTFAConfirmed5/5
2Orpheus 3B (GGUF)3BApache 2.0Yes~200msllama.cpp path4.5/5
3TADA-1B1BMITYes0.09Raw PyTorch4/5
4Fun-CosyVoice 30.5BApache 2.0Yes~150ms TTFASimilar to Qwen34/5
5VibeVoice-RT-0.5B0.88BMITPresets only0.48 (Spark)Confirmed3.5/5
6Chatterbox Turbo350MMITYes (5s)Sub-200msPyTorch3.5/5
7OuteTTS 1.0 (GGUF)0.6BApache 2.0Yes (10s)Competitivellama.cpp path3/5
8NeuTTS Air (GGUF)748MApache 2.0Yes (3s)CPU RTllama.cpp path3/5
9Fish S2 Pro4.4BResearch LicenseYes (10s)0.195 (H200)Hard build2.5/5
10Magpie TTS 357M357MNVIDIA LicenseNo (open weights)~600ms (Spark)Confirmed2.5/5

Quickstart Guides

Quickstart 2: Orpheus 3B via llama.cpp (Most Portable)

Quickstart 3: TADA-1B (Fastest Raw Inference)

Quickstart 4: Chatterbox Turbo (Easiest Setup)

Quickstart 5: Kokoro-82M (Lightest Weight — No Cloning)


Voice Cloning Best Practices

Recording reference audio

The quality of your voice clone depends heavily on the reference audio. Follow these guidelines:

Duration:

  • 3 seconds: Minimum viable. Captures basic timbre but misses speaking style
  • 10-15 seconds: Sweet spot for most models. Captures timbre + rhythm + style
  • 30 seconds: Diminishing returns for most models; some (Fish S2 Pro) benefit
  • Over 60 seconds: Can actually degrade quality on some models (Qwen3-TTS)

Recording quality:

  • Use a decent microphone (even a modern phone in a quiet room works)
  • Record in WAV format, 16-bit, 24 kHz or higher, mono
  • Minimize background noise (no music, TV, fans)
  • Avoid room echo (record in a carpeted room, not a bathroom)
  • Speak naturally — like a conversation, not a voiceover

Content:

  • Natural, continuous speech with varied intonation
  • Include some questions, statements, and exclamations for prosody range
  • Avoid long pauses (> 2 seconds)
  • Speech should occupy at least 60% of the total duration
  • Provide the accurate transcript if the model requires it

Post-processing:

  • Trim silence from beginning and end (leave ~0.5s)
  • Normalize to -18 dB LUFS average
  • Remove any clicks, pops, or noise artifacts
  • Do NOT apply compression, EQ, or effects — models work best with natural audio

Per-model cloning tips

ModelTip
Qwen3-TTSAppend 0.5s silence to end of reference. Quality plateaus around 15s then degrades
Fish S2 ProSGLang prefix caching makes repeated use of same voice nearly free. 10-30s reference
NeuTTS Air85-90% similarity at 3s; 95%+ at 15s. Same-language reference strongly recommended
OrpheusBuilt-in voices only in released model. Fine-tuning required for true custom voices
ChatterboxClean recording quality matters more than duration. cfg_weight ~0.3 for natural rhythm
OuteTTSSpeaker profiles save as reusable JSON files — pre-compute for production

Troubleshooting Common Issues

DGX Spark Specific

Problem: "Torch not compiled with CUDA enabled" or "no kernel image is available"

# Solution: Install PyTorch from the cu130 index
pip install torch==2.9.1 torchaudio==2.9.1 \
  --force-reinstall --index-url https://download.pytorch.org/whl/cu130

Problem: White noise after ~2 seconds ("Blackwell Noise") This affects xTTS-v2, F5-TTS, Kokoro (GPU), and others. It's a numerical instability in FP16/BF16 on sm_121 Tensor Cores.

# Workaround 1: Force CPU inference
device = "cpu"  # in your Python code
 
# Workaround 2: Use a different model that's confirmed working on Spark
# (Qwen3-TTS, VibeVoice-RT, Magpie)

Problem: torchaudio fails to install on aarch64

# Solution 1: Use the cu130 index
pip install torchaudio==2.9.1 --index-url https://download.pytorch.org/whl/cu130
 
# Solution 2: Replace torchaudio with soundfile
pip install soundfile
# Then use: import soundfile as sf; sf.write("out.wav", audio, sr)
Flash Attention 3 Incompatibility

FA3 is incompatible with Blackwell. Use FA2 or rely on SDPA fallback.

pip install flash-attn==2.* --no-build-isolation

Problem: vLLM crashes with CUDA errors

# Solution: Force eager mode
vllm serve model_name --enforce-eager --trust-remote-code

Problem: SGLang sgl-kernel import errors

# Solution: Build from source for sm_121a
cd sglang/sgl-kernel
TORCH_CUDA_ARCH_LIST="12.1a" MAX_JOBS=4 CMAKE_BUILD_PARALLEL_LEVEL=1 \
  python -m build --wheel --no-isolation
pip install --no-deps dist/sgl_kernel*.whl

General TTS Issues

Problem: Generated speech sounds robotic or choppy

  • Increase reference audio duration (if using voice cloning)
  • Reduce speaking rate (lower rate parameter)
  • Try a different model — quality varies significantly between models
  • Ensure you're running on GPU (CPU inference can introduce quality artifacts on some models)

Problem: Words are skipped or hallucinated

  • Try TADA (zero hallucinations by design)
  • Reduce temperature parameter
  • Split long text into shorter segments
  • Some models hallucinate more at the end of long sequences — limit output length

Problem: Voice clone doesn't sound like the reference

  • Ensure reference audio is clean (no background noise, echo)
  • Check that reference transcript matches the audio exactly
  • Try a longer reference clip (10-15 seconds)
  • Some models work better with same-language references
  • Reduce cfg_weight or classifier-free guidance scale

Problem: High latency / slow generation

  • Check RTF — if > 1.0, generation is slower than real-time
  • Enable streaming output to reduce perceived latency
  • Use smaller model variant (0.6B instead of 1.7B)
  • Apply quantization (INT4 or GGUF Q4)
  • Ensure GPU is being used (check nvidia-smi)
  • Pre-warm model on startup (run a dummy generation)

This is a companion document to "The Complete Offline TTS Guide (2026 Edition)." Return to the main guide for comprehensive model comparisons and deployment configurations.

Related Guides

Was this guide helpful?