Complete TTS Model Catalog
4A. LLM-Backbone Models with Voice Cloning
These are the current state-of-the-art for quality, expressiveness, and zero-shot voice cloning. All require GPU acceleration for real-time performance.
| Model | Developer | Params | Architecture | Codec / Frame Rate | Voice Cloning | Ref Audio | TTFA | RTF | Streaming | Languages | Emotion Control | License |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Hume TADA-1B | Hume AI | 1B | Llama 3.2 + flow-matching | TADA ~2-3 Hz | ✅ Zero-shot | Varies | Not published | 0.09 | Not yet | EN (1B); 9 lang (3B-ML) | Via ref audio | MIT |
| Hume TADA-3B-ML | Hume AI | 3B | Llama 3.2 + flow-matching | TADA ~2-3 Hz | ✅ Zero-shot | Varies | Not published | ~0.09 | Not yet | 9 (ar,ch,de,es,fr,it,ja,pl,pt) | Via ref audio | MIT |
| Qwen3-TTS-12Hz-0.6B | Alibaba Qwen | 0.6B | Qwen3 LLM + MTP | 12.5 Hz, 16-layer multi-codebook | ✅ 3-second | 3s min, 10-15s optimal | 97ms | Not published | ✅ Bidirectional | 10 | NL instruction | Apache 2.0 |
| Qwen3-TTS-12Hz-1.7B | Alibaba Qwen | 1.7B | Qwen3 LLM + MTP | 12.5 Hz, 16-layer multi-codebook | ✅ 3-second | 3s min, 10-15s optimal | 101ms | Not published | ✅ Bidirectional | 10 | NL instruction | Apache 2.0 |
| Fun-CosyVoice 3 | Alibaba FunAudioLLM | 0.5B + DiT | LLM + conditional flow matching | 25 Hz | ✅ Few-shot | Few seconds | ~150ms | Not published | ✅ Bidirectional | 9 + 18 dialects | Instruction-based | Apache 2.0 |
| Fish Audio S2 Pro | Fish Audio | ~4.4B (4B slow + 400M fast) | Dual-AR (Qwen3 backbone) | ~21 Hz, 10-codebook RVQ | ✅ 10-30s ref | 10-30s | ~100ms (H200) | 0.195 (H200) | ✅ SGLang | 80+ | 15,000+ NL tags | Fish Research License |
| VibeVoice-1.5B | Microsoft | ~1.5B total | Qwen2.5 + σ-VAE + diffusion | σ-VAE 7.5 Hz | ✅ Up to 4 speakers | ~10s | ~200ms | 0.2 | ✅ | 2 + 9 experimental | Context-aware | MIT |
| VibeVoice-Realtime-0.5B | Microsoft | ~880M total | Qwen2.5 + σ-VAE + diffusion | σ-VAE 7.5 Hz | ❌ Pre-computed only | N/A (7 presets) | ~300ms | 0.48 (Spark) | ✅ Core design | 2 | Limited | MIT |
| Orpheus TTS 3B | Canopy Labs | 3B | Llama 3.2 + SNAC codec | SNAC 7 tok/frame @ 21 Hz | ✅ Built-in voices; fine-tune for custom | 10-30s | ~200ms | Competitive | ✅ SNAC chunks | 9 (EN primary) | 8+ emotive tags | Apache 2.0 |
| Chatterbox Turbo | Resemble AI | 350M | AR + 1-step distilled diffusion | Custom | ✅ Zero-shot | 5-10s | Sub-200ms | Up to 6× RT | Chunked | EN only | [laugh] [cough] tags + exaggeration param | MIT |
| Chatterbox Multilingual | Resemble AI | ~550M | AR + diffusion | Custom | ✅ Zero-shot | 5-10s | Sub-200ms | ~2× RT | Chunked | 23 | Emotion exaggeration | MIT |
| Spark-TTS | SparkAudio / HKUST | 0.5B | Qwen2.5 + BiCodec | BiCodec 50 Hz | ✅ Zero-shot | 5-30s | Not published | 0.07 (L20, 4× conc) / 0.14 (single) | ✅ Triton gRPC | 2 (ZH, EN) | Gender/pitch/speed | Apache 2.0 |
| OuteTTS 1.0 (0.6B) | OuteAI | 0.6B | Pure AR (Qwen3) + DAC | DAC ~150 TPS | ✅ Zero-shot | 10-15s | Not published | Competitive | Partial (async) | 14 | Via ref speaker | Apache 2.0 |
| OuteTTS 1.0 (1B) | OuteAI | 1B | Pure AR (Llama 3.2) + DAC | DAC ~150 TPS | ✅ Zero-shot | 10-15s | Not published | Competitive | Partial (async) | 23+ | Via ref speaker | Apache 2.0 |
| NeuTTS Air | Neuphonic | 748M | Qwen2 + NeuCodec | NeuCodec 50 Hz | ✅ 3-15s | 3-15s | Not published | <0.5 CPU | ✅ GGUF stream | EN primary | None | Apache 2.0 |
| NeuTTS Nano | Neuphonic | ~229M | Compact LM + NeuCodec | NeuCodec 50 Hz | ✅ Same-language ref | ~10s | Not published | CPU real-time | ✅ GGUF stream | EN, ES, DE, FR | None | Apache 2.0 |
| Sesame CSM-1B | Sesame AI Labs | 1.1B | Dual-AR (Llama) | Mimi 12.5 Hz | ✅ Conversational context | Audio prompt | Not published | Not published | Partial | EN | Conversational prosody | Apache 2.0 |
| Dia2 | Nari Labs | 1B / 2B | AR streaming transformer | Mimi 12.5 Hz | ✅ Audio prefix | ≤10s | Not published | Not published | ✅ Core feature | EN | 16+ nonverbal tags | Apache 2.0 |
| Zonos | Zyphra | 1.6B | Transformer / Transformer+Mamba2 SSM hybrid | DAC ~86 Hz | ✅ 5-30s | 5-30s | Not published | ~2× on 4090 | ❌ | EN primary (+JA,ZH,FR,DE) | Rich (rate, pitch, 5 emotions) | Apache 2.0 |
| Higgs Audio V2 | Boson AI | ~5.8B (3.6B LLM + 2.2B DualFFN) | Llama-3.2-3B + DualFFN | Custom | ✅ ~3s | ~3s | 2-3s e2e | Not published | Limited | 20+ | 75.7% vs gpt-4o-mini-tts (Emotions) | Apache 2.0 (generation weights) |
| Higgs Audio V2.5 | Boson AI | ~1B condensed | Condensed architecture + GRPO | Custom | ✅ | Short ref | <150ms claimed | Faster than V2 | ✅ OpenAI-compat | 20+ | Improved over V2 | Apache 2.0 |
| Kani-TTS-2 | nineninesix.ai | ~400M | LiquidAI LFM2 + NanoCodec | NanoCodec 12.5 fps | ✅ Zero-shot | 10-20s | Not published | 0.2 (RTX 5080) | Via OpenAI server | EN, PT | Accent/dialect | Apache 2.0 |
| F5-TTS | SWivid / SJTU | 335M | Flow Matching DiT | Mel spectrogram + vocoder | ✅ 1-5s | 1-5s | Not published | 0.15 (3090); 0.030 (Fast variant) | ✅ Chunked | 2 (+10 community) | Via ref audio | Code: MIT; Weights: CC-BY-NC |
| GPT-SoVITS v2 ProPlus | RVC-Boss | ~210M (v2Pro) | GPT + SoVITS hybrid | Custom | ✅ 5s zero-shot; 1min best | 5s / 1min | Not published | 0.014 claimed (4090, optimal) | WebUI | 5 (ZH/EN/JA/KR/Cantonese) | DPO-trained | MIT |
| IndexTTS 1.5 | Bilibili | ~600M GPT + vocoder | GPT AR + BigVGAN2 | Custom | ✅ Zero-shot | 5-10s | Not published | Faster than CV2 claimed | ❌ documented | 2 (ZH, EN) | Limited | Code: Apache 2.0; Weights: bilibili license (commercial auth required) |
| IndexTTS-2 | Bilibili | ~600M+ | GPT AR + BigVGAN2 + Qwen3 | Custom | ✅ Zero-shot | 5-10s | Not published | Not published | ❌ documented | 3 (ZH, EN, JP) | Full timbre-emotion disentanglement | bilibili Model Use License (non-commercial for improving AI) |
| MaskGCT | Amphion / CUHK-SZ | ~1B+ total | Masked Generative Codec Transformer | Custom | ✅ 5-10s | 5-10s | Not published | Not published | ❌ | 2 (EN, ZH) | Via prompt audio | CC-BY-NC-4.0 |
| XTTS-v2 | Coqui AI (community) | ~467M | GPT-2 + VQ-VAE + HiFi-GAN | Custom | ✅ 3-6s | 3-6s | <150ms claimed | Not published | ✅ | 17 | Via ref audio | CPML (non-commercial) |
| Mars5-TTS | Camb.ai | ~1.2B | AR Transformer + DDPM | Custom | ✅ 2-12s | 2-12s | Not published | Not published | ❌ | EN only | Via punctuation | AGPL 3.0 |
| StyleTTS2 | Columbia University | ~100M est. | Diffusion + ISTFTNet + WavLM | Mel + ISTFTNet | ✅ 5-10s | 5-10s | Not published | ~2× VITS | Experimental | EN only | Style vectors (strong) | Code: MIT; Pretrained: must disclose synthetic |
4B. Models Without Voice Cloning (Preset Voices)
| Model | Developer | Params | Cloning | Languages | Latency | License | Notes |
|---|---|---|---|---|---|---|---|
| Kokoro-82M | Hexgrad | 82M | ❌ 54 voicepacks | 9 | <0.1 RTF | Apache 2.0 | #17 on TTS Arena V2 (highest open-source). ONNX, runs in browser |
| NVIDIA Magpie TTS | NVIDIA | 357M | ❌ (open weights); ✅ via Riva NIM | 9 | ~600ms/sentence (Spark) | NVIDIA Open Model License | Confirmed on DGX Spark. Frame stacking + MaskGit parallel decode |
| Parler-TTS | HuggingFace | 880M (large) / 400M (mini) | ❌ NL voice description | EN primary | Not published | Apache 2.0 | Control voice via text description instead of reference audio |
| MeloTTS | MyShell.ai | ~80M est. | ❌ Preset speakers | 6 (EN/ES/FR/ZH/JA/KR) | CPU real-time | MIT | 4 English accents. ONNX via sherpa-onnx |
| KittenTTS | KittenML | 15M (nano) / 40M (micro) / 80M (mini) | ❌ Built-in voices | EN | CPU real-time | Apache 2.0 | INT8 checkpoints available. 25 MB smallest variant |
| Piper TTS | Open Home Foundation | VITS-based (varies) | ❌ (fine-tune for custom) | 30+ | CPU real-time | GPL-3.0 | aarch64 binaries. Hundreds of voices. ONNX-based |
4C. Classical / Formant Engines
| Engine | Type | SSML Support | x86/arm64 | Streaming | License | Notes |
|---|---|---|---|---|---|---|
| eSpeak-NG | Formant synthesis | ✅ SSML subset | Both | ✅ stdout | Various (verify repo) | Foundation for many neural model phonemizers |
| Festival | Unit selection + parametric | SABLE markup | Unix platforms | Batch + client/server | X11-type (unrestricted) | Scheme scriptable. Multiple synthesis modes |
| Flite | Lightweight Festival | Basic | x86/ARM/Android | Batch | BSD-like (verify COPYING) | >70× real-time on legacy hardware. Thread-safe |
| MARY TTS | Java framework | ✅ SSML prosody | JVM cross-arch | Client/server | Verify repo | Research-oriented |
4D. Closed-Source Offline-Capable Options
| Product | Voice Cloning | Latency | Languages | Offline Method | DGX Spark | Pricing |
|---|---|---|---|---|---|---|
| NVIDIA Riva/Magpie NIM | ✅ (Zeroshot/Flow via NIM) | ~100-150ms/utterance | 9+ | Docker container | x86_64 only currently | NVIDIA AI Enterprise license |
| Resemble AI On-Prem | ✅ Rapid clone (10s) + Professional (25+ sentences) | Sub-200ms | 149+ | Air-gapped Python package | Likely (Python-based) | Enterprise sales |
| Cartesia Sonic 3 | ✅ ~3s | ~40ms TTFA claimed | 15 | On-premise + on-device | Not confirmed | Enterprise sales |
| Microsoft Embedded Speech | ❌ Neural preset voices | Not published | Multiple | Speech SDK offline | Linux x64+Arm64, macOS x64+Arm64, Win11 | Commercial license |
| Acapela TTS Embedded | ❌ | Not published | Multiple | Vendor SDK | ARM + x86 | Commercial |
| ElevenLabs | ✅ | Low | 29+ | No offline option | N/A | Cloud-only |
4E. Corrections Log
Seven material errors were identified in our initial research and corrected:
- Fish Audio S2 Pro: Parameter count is ~4.4B, not 5B (4B Slow AR + 400M Fast AR)
- VibeVoice Spark benchmark: RTF 0.48 and 766ms pipeline latency are from a single community developer's forum post (logosflux, Jan 2026), not official NVIDIA/Microsoft figures
- Spark-TTS RTF: 0.07 is achieved at 4× concurrency on L20; single-stream RTF is ~0.14
- Higgs Audio V2 win rate: 75.7% is specifically against gpt-4o-mini-tts (not GPT-4o) in the Emotions category only
- CosyVoice 2 language count: CosyVoice 2 supports ~4 primary languages; the "9 languages" applies to Fun-CosyVoice 3
- Orpheus variant names: Correct names are Nano (150M), Tiny (400M), Small (1B), Medium (3B) — not "Large." Only the 3B model appears publicly released
- IndexTTS 1.x license: Code is Apache 2.0 but model weights carry bilibili license restrictions requiring written authorization for commercial use
DGX Spark Compatibility Status
Confirmed working on GPU (March 2026)
| Model | Method | Performance | Source |
|---|---|---|---|
| VibeVoice-Realtime-0.5B | Native PyTorch (SDPA) | RTF 0.48×, ~766ms full pipeline | Community benchmark (logosflux, NVIDIA Forums, Jan 2026) |
| NVIDIA Magpie TTS (357M) | NeMo / Pipecat | ~600ms batch per sentence | Daily.co blog + NVIDIA validation |
| Qwen3-TTS-1.7B | vLLM-Omni 0.16.0 | Working (no RTF published) | NVIDIA Forums (shahizat, Feb 2026) |
Confirmed broken on DGX Spark GPU ("Blackwell Noise")
These models produce white noise after ~2 seconds due to FP16/BF16 numerical instability on sm_121 Tensor Cores:
- Coqui xTTS-v2
- F5-TTS
- Kokoro-82M (GPU mode)
- AllTalk TTS
- NVIDIA Riva Magpie (in Docker — container compatibility issue)
- PersonaPlex
Workarounds for broken models
- CPU fallback: Many models work correctly on the Grace ARM CPU (which is quite capable at 4 GHz). Use
device="cpu"and accept slower generation - Replace torchaudio with soundfile for audio I/O
- Build Flash Attention 2 from source (~15 min with
MAX_JOBS=4) - Use SDPA fallback instead of Flash Attention (VibeVoice works with SDPA and may actually be faster on Blackwell)
Not yet tested but likely compatible
- TADA-1B/3B (raw PyTorch, Llama backbone — should work with cu130 wheels)
- Fun-CosyVoice 3 (PyTorch-based, similar architecture to working Qwen3-TTS)
- Orpheus 3B via llama.cpp GGUF (llama.cpp has excellent DGX Spark support)
- OuteTTS 1.0 via llama.cpp GGUF (same inference path as Orpheus)
- NeuTTS Air/Nano via llama.cpp GGUF (CPU-first architecture)
- Sesame CSM-1B (HuggingFace Transformers, should work with BF16)
Key DGX Spark deployment resources
| Resource | URL |
|---|---|
| VibeVoice pipeline on Spark | forums.developer.nvidia.com/t/dgx-spark-vibevoice-tts-streaming-voice-pipeline-setup-guide/356424 |
| Qwen3-TTS on Spark | forums.developer.nvidia.com/t/running-vllm-omni-for-qwen3-tts-voice-design-voice-clone-on-dgx-spark/361255 |
| xTTS Docker attempt | forums.developer.nvidia.com/t/xtts-in-a-dockercontainer-on-the-dgx-spark/357850 |
| aarch64 compatibility thread | forums.developer.nvidia.com/t/architecture-and-library-compatibility-on-aarch64/350389 |
| SGLang build guide | forums.developer.nvidia.com/t/build-sglang-from-source-on-blackwell-pro-6000-dgx-spark/360785 |
| Community ML setup guide | github.com/martimramos/dgx-spark-ml-guide |
| NVIDIA official playbooks | github.com/NVIDIA/dgx-spark-playbooks |
| Full voice pipeline code | github.com/Logos-Flux/spark-voice-pipeline |
| Daily.co Magpie voice agent | daily.co/blog/building-voice-agents-with-nvidia-open-models/ |
Optimization Techniques Catalog
6A. Model-Level Optimizations
KV Cache Optimization
PagedAttention (vLLM) splits KV cache into fixed-size blocks, reducing waste from 60–80% to under 4% and enabling 2–4× throughput. SGLang's RadixAttention stores KV tensors in a radix tree — when the same voice prompt is reused, cached computation yields 75–95% cache hit rates. Fish Audio S2 Pro achieves 86.4% average prefix-cache hit rate (>90% at peak) via RadixAttention, making repeated voice cloning nearly free. On DGX Spark, NVFP4 KV cache quantization cuts KV memory by ~50%.
Speculative Decoding
A small draft model proposes token sequences that the target model verifies in parallel. On DGX Spark, EAGLE3 speculative decoding delivers ~2× speedup and is supported by SGLang's official Spark Docker image. Speech-specific variants: VADUSA applies Medusa-style draft heads to TTS; PredGen overlaps speculative text/audio generation with user speech for up to 3× perceived latency reduction.
Quantization
On DGX Spark, quantization directly reduces bytes-per-token through the 273 GB/s bandwidth wall. INT8 achieves 2–3× speedup with <3% MOS drop. INT4 (W4A16) with Marlin kernels outperforms NVFP4 for decode on DGX Spark — FP8→FP4 disappoints because the bottleneck is memory reads, not compute. Quality hierarchy: BF16 > FP16 > FP8 > INT8 >> INT4 (INT4 can cause prosody artifacts).
While INT4 quantization doubles decode throughput on DGX Spark, it sits at the bottom of the quality hierarchy and can introduce prosody artifacts. Use INT8 where quality is a priority; reserve INT4 for maximum-throughput scenarios where naturalness is secondary.
torch.compile
mode="reduce-overhead" uses CUDA graphs and delivers 30–50% latency reduction. However, on DGX Spark sm_121a, CUDA graph capture via Triton frequently fails — vLLM requires enforce_eager=True. Raw CUDA graphs via SGLang work better (17 ms first-request overhead vs torch.compile's ~4 second warm-up).
Flash Attention
FA2 reduces attention memory from O(n²) to O(n). For Qwen3-TTS specifically, without FlashAttention, inference is 10× slower even on RTX 5090. FA3 is incompatible with Blackwell; use FA2 built from source or SDPA fallback.
Flash Attention 3 (FA3) does not work on Blackwell (sm_121). Build Flash Attention 2 from source or use the PyTorch SDPA fallback. Without any FlashAttention, Qwen3-TTS is 10× slower.
Multi-Token Prediction (MTP)
Qwen3-TTS-12Hz predicts all 16 codebook layers simultaneously per frame via hierarchical MTP. Self-distilled MTP achieves >3× faster decoding with <5% accuracy drop. This is a game-changer for multi-codebook architectures.
Continuous Batching
Orca-style iteration-level scheduling lets requests join/leave at each generation step. On DGX Spark, throughput scales nearly linearly: Llama 8B goes from 20.5 tok/s at BS=1 to 368 tok/s at BS=32 via SGLang.
Model Distillation
Chatterbox Turbo's 10→1 step diffusion distillation eliminates the decoder bottleneck. Fast F5-TTS reduces diffusion steps from 32 to 7, achieving RTF 0.030 (4× speedup).
6B. Pipeline and Streaming Optimizations
Sentence-Level Chunking
Split LLM output at punctuation; submit first sentence to TTS while LLM generates the rest. Deepgram explicitly warns against splitting mid-sentence (produces choppy prosody).
Streaming Text Input
True streaming TTS processes text character-by-character and starts synthesizing 6.5× faster than batch mode. Supported by: Qwen3-TTS, CosyVoice 2/3, VibeVoice-Realtime, Dia2.
Pipeline Stage Overlap
Start TTS once first few LLM tokens are available. Pipecat automates this with sentence-boundary aggregation. The Cresta production approach uses hedging (multiple parallel LLM calls, use whichever returns first).
Voice Prompt Caching
Pre-encode reference audio and reuse across requests. Implementation: SGLang RadixAttention (automatic for S2 Pro), CosyVoice LRU timbre manager, manual dict cache for TADA/others.
Warm-Up Passes
Non-negotiable for production. torch.compile requires ~3 min warm-up; CUDA graphs need sub-second. Pre-warm vocoder/codec separately — often the cold-start bottleneck.
6C. Audio Codec Frame Rate Impact
The single most impactful architectural choice for TTS speed:
| Frame Rate | Tokens per 10s | Representative Codecs | Speed Impact |
|---|---|---|---|
| 75–150 Hz | 750–1,500 | EnCodec, DAC, APCodec | Impractical for real-time AR |
| 50 Hz | 500 | SpeechTokenizer, NeuCodec | Standard; heavy for AR |
| 25 Hz | 250 | Qwen3-TTS-25Hz, CosyVoice | 2× reduction |
| 21 Hz | 210 | NanoCodec (Magpie), Fish S2 | Good balance |
| 12.5 Hz | 125 | Mimi, Qwen3-TTS-12Hz | 4× fewer tokens — sweet spot |
| 7.5 Hz | 75 | σ-VAE (VibeVoice) | Ultra-low; continuous latents |
| 2–3 Hz | 20–30 | TADA | Extreme compression; RTF 0.09 |
Lower frame rate with more quantization layers outperforms higher frame rate with fewer layers at the same bitrate. Prefer 12.5 Hz codecs (Mimi, Qwen3-TTS-12Hz) over 50 Hz codecs for autoregressive decode on bandwidth-limited hardware like DGX Spark.
6D. Serving Infrastructure Comparison
| Framework | TTS Models | Strength | DGX Spark Status |
|---|---|---|---|
| SGLang | Fish S2 Pro, Orpheus | RadixAttention voice caching | Official Docker; EAGLE3 works |
| vLLM | Qwen3-TTS, Orpheus | Best TTFT; PagedAttention | Requires enforce_eager=True |
| TensorRT-LLM | CosyVoice (4× speedup), Spark-TTS, Magpie | Deepest NVIDIA optimization | Fully supported |
| llama.cpp | OuteTTS, Orpheus, NeuTTS | GGUF; ARM64 native | Excellent; official benchmarks |
| ONNX Runtime | Kokoro, Piper, Chatterbox Turbo | Cross-platform; lightweight | ARM64 with CUDA EP |
| Triton | Multi-model pipelines | Ensemble scheduling | Full NVIDIA support |
6E. Conversational "Feel" Optimizations
Filler Word Injection
Reduces perceived latency by 50–70% (from 1.5–2s to 0.5–0.7s). Instruct LLM to start responses with "umm," "so," or "well." Send tiny filler chunk to TTS instantly, stream the rest behind it.
Turn-Taking Prediction
Beyond VAD silence detection. NVIDIA ACE Agent uses two-pass endpointing: send interim transcript at 240 ms pause, final at 800 ms — saving 500–600 ms overall.
Barge-in Support
Stop TTS playback within 200 ms when user interrupts. Pipecat uses Silero VAD + SmartTurn, flushing downstream buffers instantly.
Backchanneling
Filter short utterances ("yeah," "okay") from triggering full responses. NVIDIA ACE Agent filters words under 3 characters.
6F. DGX Spark Specific Optimizations
- Use BF16 everywhere: Blackwell has native BF16 tensor cores. Both Qwen3-TTS and Fish S2 Pro were trained in BF16
- Prefer low frame-rate codecs: 12.5 Hz (Qwen3-TTS-12Hz, Mimi) → only ~12.5 tokens/sec of audio, well within bandwidth
- INT4 quantization > NVFP4 for decode: Marlin kernel INT4 outperforms FP4 on Spark
- Disable GUI:
sudo systemctl set-default multi-user.targetfrees ~60 MB GPU memory - Unified memory advantage: No cudaMemcpy needed. Models, KV cache, and audio buffers share the same 128 GB pool
- No NUMA topology: Single SoC, no transfer penalty for CPU offloading of text preprocessing
Model Optimization Compatibility Matrix
Serving Framework Support
| Model | SGLang | vLLM | TRT-LLM | llama.cpp | ONNX RT | Raw PyTorch |
|---|---|---|---|---|---|---|
| TADA-1B/3B | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ |
| Qwen3-TTS-12Hz | 🔜 PR open | ✅ vLLM-Omni | ❌ | ❌ | ❌ | ✅ |
| Fun-CosyVoice3 | ❌ | ✅ v0.9+ | ✅ 4× speedup | ❌ | Partial | ✅ |
| Chatterbox Turbo | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ |
| Fish S2 Pro | ✅ Primary | Compatible | ❌ | ❌ | ❌ | ✅ |
| NeuTTS Air/Nano | ❌ | ✅ | ❌ | ✅ Primary | ✅ | ✅ |
| VibeVoice-RT | ❌ | 🔜 | ❌ | ❌ | ❌ | ✅ |
| Orpheus 3B | ✅ | ✅ | ❌ | ✅ GGUF | ❌ | ✅ |
| Spark-TTS | ❌ | Possible | ✅ Official | ❌ | ❌ | ✅ |
| OuteTTS 1.0 | ❌ | ✅ Exp | ❌ | ✅ Primary | ❌ | ✅ |
| Kokoro-82M | ❌ | ❌ | ❌ | ❌ | ✅ Full | ✅ |
| Sesame CSM-1B | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ HF |
| Dia2 | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ |
| Magpie 357M | ❌ | ❌ | ✅ Via Riva | ❌ | ❌ | ✅ NeMo |
Quantization and Optimization Availability
| Model | Quantization Formats | Voice Cache | CUDA Graphs | torch.compile |
|---|---|---|---|---|
| TADA-1B/3B | FP16/BF16 only | ❌ | Likely | Likely |
| Qwen3-TTS-12Hz | BF16 (mandatory), 4-bit MLX | Implicit KV | Unknown | Unknown |
| Fun-CosyVoice3 | FP16/FP32 | ✅ LRU timbre manager | Unknown | Unknown |
| Chatterbox Turbo | FP16, ONNX FP32 | ❌ | Unknown | Unknown |
| Fish S2 Pro | BF16 only | ✅ 86.4% RadixAttention | ✅ SGLang | ✅ --compile |
| NeuTTS Air/Nano | GGUF Q4/Q8, FP16 | ✅ Reference .pt | Unknown | Unknown |
| VibeVoice-RT | FP16/BF16 | ✅ Pre-computed | Unknown | Unknown |
| Orpheus 3B | FP16, FP8, GGUF Q4-Q8, GPTQ, AWQ | Implicit KV | ✅ Via vLLM/SGLang | ✅ |
| Spark-TTS | FP16, 4-bit BnB | Implicit globals | Via TRT-LLM | Unknown |
| OuteTTS 1.0 | GGUF Q2-Q8, FP8, EXL2, FP16 | ✅ JSON profiles | Via backends | Possible |
| Kokoro-82M | ONNX fp32/fp16/q8/q4/q4f16 | Pre-computed voicepacks | ❌ | ⚠️ Issues |
| Sesame CSM-1B | FP16/BF16 | ✅ Conversation ctx | ✅ Documented | ✅ Static cache |
| Dia2 | BF16 only | ✅ Audio prefix | ✅ --cuda-graph | ✅ use_torch_compile |
| Magpie 357M | FP16, TRT optimized | ✅ Context encoder | Likely via Riva | Unknown |
Voice Pipeline Architecture
STT → LLM → TTS Pipeline
The standard real-time voice agent pipeline has three stages, with the LLM consistently dominating total latency:
User speaks → [VAD] → STT (~50ms) → LLM (~300-700ms) → TTS (~100-600ms) → Audio playback
↑ DOMINANT ↑ YOU ARE HERE
Latency budget (conversational threshold: <500ms perceived):
| Setup | Voice-to-Voice Latency |
|---|---|
| GPT-4o Realtime (native audio) | 232–320 ms |
| Optimized cloud pipeline | <500 ms |
| Pipecat local on M-series Mac | <800 ms |
| NVIDIA Blueprint target | 500–1,500 ms |
| VibeVoice pipeline on DGX Spark | ~766 ms |
| Naive cloud pipeline | 3–5 s |
Pipecat Integration
Pipecat is the leading open-source Python framework for real-time voice AI, supporting 25+ TTS backends. Key integration patterns:
TextAggregationMode.SENTENCE(default): Natural prosody, sends complete sentences to TTSTextAggregationMode.TOKEN: Lowest latency, may produce unnatural prosody- Interruption handling: Silero VAD + SmartTurn → flush all downstream buffers
- OpenAI-compatible servers bridge any local TTS to Pipecat via
OpenAITTSService(base_url=...)
Community TTS API servers for Pipecat integration:
| Server | TTS Backend | API Compatibility |
|---|---|---|
| Chatterbox-TTS-Server | Chatterbox (all 3 variants) | OpenAI TTS |
| Kokoro-FastAPI | Kokoro-82M | OpenAI TTS |
| LocalAI | XTTS, Piper, multiple | OpenAI + ElevenLabs |
| local-openai-tts-server | Silero + Piper | OpenAI TTS |
| LocalOrpheusTTS | Orpheus via Ollama | Custom |
DGX Spark Memory Budget for Full Pipeline
| Component | Model Example | Est. Memory |
|---|---|---|
| STT | Whisper Large v3 Turbo | ~3 GB |
| LLM | Llama-3.2-8B Q4 | ~5 GB |
| TTS (primary) | Qwen3-TTS-12Hz-0.6B | ~1.2 GB |
| TTS (alternate) | Orpheus 3B GGUF Q4 | ~2 GB |
| TTS (lightweight) | Kokoro-82M ONNX | ~0.05 GB |
| Framework overhead | KV cache, CUDA context | ~4-8 GB |
| Total | ~15-20 GB | |
| Free for voice profiles, context, upgrades | ~108-113 GB |
Configuration Reference
Model-Specific Deployment Settings
TADA-1B:
# Ultra-low frame rate is the optimization itself
model = TadaForCausalLM.from_pretrained("HumeAI/tada-1b").to("cuda").to(torch.bfloat16)
# Flow matching steps configurable for quality/speed tradeoff
# ~700 seconds continuous audio with 2048 token contextQwen3-TTS-12Hz (0.6B/1.7B):
# BF16 is MANDATORY — FP16 causes NaN/overflow in MTP sampling
# vLLM-Omni on DGX Spark:
vllm serve Qwen/Qwen3-TTS-12Hz-0.6B-Base --enforce-eager --trust-remote-code \
--gpu-memory-utilization 0.9 --dtype bfloat16
# Reference audio: 10-15s optimal; append 0.5s silence to end
# 3s minimum for cloningBF16 is mandatory for Qwen3-TTS-12Hz. Using FP16 causes NaN/overflow in the multi-token prediction (MTP) sampling step. Always pass --dtype bfloat16 when serving via vLLM.
Fun-CosyVoice 3:
# LightTTS framework for optimized serving:
# github.com/ModelTC/LightTTS
# TRT-LLM gives 4× speedup on LLM component
# Shared memory timbre manager with LRU for repeated speakersFish S2 Pro (via SGLang):
# Build sgl-kernel for Spark:
TORCH_CUDA_ARCH_LIST="12.1a" MAX_JOBS=4 CMAKE_BUILD_PARALLEL_LEVEL=1 \
python -m build --wheel --no-isolation
# CRITICAL: BF16 RoPE truncation patch required
# --mem-fraction-static 0.15 (keeps allocation small for shared memory)
# --max-running-requests 2 (bandwidth-limited on Spark)Orpheus 3B (via llama.cpp):
# Build for DGX Spark:
cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_F16=ON -DCMAKE_CUDA_ARCHITECTURES=121
export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
# Serving:
./llama-server -m orpheus-3b-q4_k_m.gguf -c 8192 -ngl 99 \
--cache-type-k q8_0 --cache-type-v q8_0 -fa --mlock
# repetition_penalty >= 1.1 REQUIRED for stable output
# Stream every 7 SNAC tokens (= 1 audio frame)Chatterbox Turbo:
model = ChatterboxTurboTTS.from_pretrained(device="cuda")
# cfg_weight ~0.3 for slower, more natural rhythm
# exaggeration 0.5 default; higher = more expressive + faster speaking
# ONNX export available for cross-platformOuteTTS 1.0 (via llama.cpp):
# GGUF Q4-Q8 available
# Windowed repetition penalty (64-token window) is CRITICAL
# Full context penalty breaks output
# llama.cpp llama-tts binary (PR #12794 pending for 1.0)Essential Environment Variables
# DGX Spark general
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export CUDA_HOME=/usr/local/cuda
export TORCH_CUDA_ARCH_LIST="12.1a"
# PyTorch optimization
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"
# llama.cpp unified memory
export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
# Disable GUI for headless serving
sudo systemctl set-default multi-user.target# PyTorch inference optimization (all models)
torch.set_grad_enabled(False)
torch.set_float32_matmul_precision('high') # TF32 on Blackwell
torch.backends.cudnn.benchmark = True # for vocoder conv layers
torch.backends.cuda.matmul.allow_tf32 = TrueCanonical TTSConfig Schema
This portable, offline-first configuration schema can drive multiple engines. It separates static capabilities from request-time controls and deployment settings:
{
"engine": "qwen3_tts | cosyvoice3 | orpheus | tada | chatterbox | fish_s2 | ...",
"voice": {
"locale": "en-US",
"voice_id": "engine-specific voice name or voicepack ID",
"model_id": "HuggingFace model hub identifier",
"model_path": "/path/for/air-gapped/deployment",
"speaker_ref_wav": "/path/to/reference.wav",
"speaker_index": 0
},
"audio": {
"format": "wav | raw_pcm_s16le | flac | opus",
"sample_rate_hz": 24000,
"channels": 1,
"normalize": true,
"volume_gain": 1.0,
"postfx": { "lufs_target": -18, "limiter": true, "resample_quality": "high" }
},
"controls": {
"rate": 1.0,
"pitch": 0.0,
"temperature": 0.7,
"noise_scale": 0.667,
"style": "warm and conversational",
"seed": 42
},
"segmentation": {
"mode": "sentence",
"max_chars": 500,
"split_pattern": "\\n+",
"pause_ms_between_segments": 200
},
"execution": {
"device": "cuda",
"precision": "bf16",
"threads": 4,
"streaming": {
"enabled": true,
"chunk_ms": 40,
"ttfa_budget_ms": 300,
"buffer_strategy": "low_latency"
},
"cache": {
"model_cache_dir": "/models",
"warm_start": true,
"max_models_loaded": 3
}
}
}Validation and Testing
Metrics to measure
| Metric | Definition | Target (conversational) |
|---|---|---|
| TTFA | Time from text input to first playable audio chunk | <300 ms |
| RTF | Wall-clock seconds / audio seconds generated | <0.5 (2× real-time) |
| WER | Word error rate (TTS output → STT → compare to input) | <3% |
| SIM | Speaker similarity (cosine similarity of speaker embeddings) | >0.85 for cloning |
| Peak VRAM | Maximum GPU memory usage during generation | Per-model budget |
| Pipeline latency | Voice-in to voice-out total time | <1,000 ms |
Test suite structure
Unit tests: Text normalization determinism, phoneme override verification, SSML parsing for supported engines.
Integration tests: Golden-audio spectral fingerprinting (not raw waveform — lossy variability exists), cross-engine output duration bounds, architecture smoke tests (x86_64, aarch64).
Perceptual tests: MOS listening panels, ABX preference tests, accessibility evaluation at high speech rates.
Benchmark harness
Capture cold-start vs warm-start TTFA, steady-state RTF at 1/2/4/8 concurrent requests, CPU thread scaling, quantization impact (BF16 vs INT8 vs INT4), and output sample rate normalization. sherpa-onnx publishes per-model RTF tables including Raspberry Pi measurements, providing a useful cross-platform reference.
References and Links
Model repositories
| Model | GitHub | HuggingFace |
|---|---|---|
| TADA | github.com/HumeAI/tada | huggingface.co/HumeAI/tada-1b |
| Qwen3-TTS | github.com/QwenLM/Qwen3-TTS | huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base |
| CosyVoice / Fun-CosyVoice3 | github.com/FunAudioLLM/CosyVoice | huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512 |
| Fish Audio S2 Pro | github.com/fishaudio/fish-speech | huggingface.co/fishaudio/s2-pro |
| SGLang-Omni (S2 Pro) | github.com/sgl-project/sglang-omni | — |
| VibeVoice | github.com/microsoft/VibeVoice | huggingface.co/microsoft/VibeVoice-Realtime-0.5B |
| Orpheus TTS | github.com/canopyai/Orpheus-TTS | huggingface.co/canopylabs/orpheus-3b-0.1-ft |
| Chatterbox | github.com/resemble-ai/chatterbox | huggingface.co/ResembleAI/chatterbox-turbo |
| Spark-TTS | github.com/SparkAudio/Spark-TTS | huggingface.co/SparkAudio/Spark-TTS-0.5B |
| OuteTTS | github.com/edwko/OuteTTS | huggingface.co/OuteAI/OuteTTS-1.0-0.6B |
| NeuTTS | github.com/neuphonic/neutts | huggingface.co/neuphonic/neutts-air |
| Kokoro | github.com/hexgrad/kokoro | huggingface.co/hexgrad/Kokoro-82M |
| Sesame CSM | github.com/SesameAILabs/csm | — |
| Dia / Dia2 | github.com/nari-labs/dia2 | huggingface.co/nari-labs/Dia2-2B |
| Zonos | github.com/Zyphra/Zonos | huggingface.co/Zyphra/Zonos-v0.1-transformer |
| Higgs Audio | github.com/boson-ai/higgs-audio | huggingface.co/bosonai/higgs-audio-v2-generation-3B-base |
| IndexTTS | github.com/index-tts/index-tts | huggingface.co/IndexTeam/IndexTTS-1.5 |
| Kani-TTS-2 | github.com/nineninesix-ai/kani-tts-2 | huggingface.co/nineninesix/kani-tts-2-en |
| F5-TTS | github.com/SWivid/F5-TTS | huggingface.co/SWivid/F5-TTS |
| GPT-SoVITS | github.com/RVC-Boss/GPT-SoVITS | — |
| MaskGCT / Amphion | github.com/open-mmlab/Amphion | huggingface.co/amphion/MaskGCT |
| Piper TTS | github.com/OHF-Voice/piper1-gpl | — |
| Kokoro-82M | github.com/hexgrad/kokoro | huggingface.co/hexgrad/Kokoro-82M |
| MeloTTS | github.com/myshell-ai/MeloTTS | — |
| KittenTTS | github.com/KittenML/KittenTTS | — |
| Magpie TTS | — | huggingface.co/nvidia/magpie_tts_multilingual_357m |
| Parler-TTS | github.com/huggingface/parler-tts | huggingface.co/parler-tts/parler-tts-mini-v1 |
| sherpa-onnx | github.com/k2-fsa/sherpa-onnx | — |
Classical engines
| Engine | URL |
|---|---|
| eSpeak-NG | github.com/espeak-ng/espeak-ng |
| Festival | github.com/festvox/festival |
| Flite | github.com/festvox/flite |
| MARY TTS | github.com/marytts/marytts |
Frameworks and tools
| Tool | URL |
|---|---|
| Pipecat | github.com/pipecat-ai/pipecat |
| nvidia-pipecat | github.com/pipecat-ai/nemotron-january-2026 |
| LightTTS (CosyVoice optimizer) | github.com/ModelTC/LightTTS |
| Chatterbox-TTS-Server | github.com/devnen/Chatterbox-TTS-Server |
| Kokoro-FastAPI | github.com/remsky/Kokoro-FastAPI |
| LocalOrpheusTTS | github.com/legraphista/LocalOrpheusTTS |
| TTS Arena V2 Leaderboard | huggingface.co/spaces/TTS-AGI/TTS-Arena |
DGX Spark resources
| Resource | URL |
|---|---|
| NVIDIA DGX Spark documentation | docs.nvidia.com/dgx/dgx-spark/ |
| SGLang Spark Docker | lmsysorg/sglang:spark |
| LMSYS Spark review | lmsys.org/blog/2025-10-13-nvidia-dgx-spark/ |
| ARM learning path (llama.cpp) | learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/ |
| NVIDIA Spark playbooks | github.com/NVIDIA/dgx-spark-playbooks |
| Community ML guide | github.com/martimramos/dgx-spark-ml-guide |
| Natolambert setup guide | github.com/natolambert/dgx-spark-setup |
| llama.cpp Spark benchmarks | github.com/DandinPower/llama.cpp_bench |
Key papers
| Paper | URL |
|---|---|
| TADA: Text-Acoustic Dual Alignment | arxiv.org/abs/2602.23068 |
| Qwen3-TTS Technical Report | arxiv.org/abs/2601.15621 |
| Fish Audio S2 Technical Report | arxiv.org/abs/2603.08823 |
| Spark-TTS | arxiv.org/abs/2503.01710 |
| MaskGCT | arxiv.org/abs/2409.00750 |
| F5-TTS | arxiv.org/abs/2410.06261 |
| CosyVoice 3 | arxiv.org/abs/2505.17589 |
| Speculative Decoding for TTS | arxiv.org/abs/2410.21951 |
Related Guides
Complete Offline TTS Guide 2026
Comprehensive guide to offline text-to-speech: 30+ models, optimization techniques, deployment configurations, and DGX Spark compatibility.
TTS Decision Guide
Interactive decision flowchart, model recommendations by use case, and 5-minute quickstart guides for the top TTS models.
Hardware Comparison Guide
Side-by-side comparison of 8 hardware platforms for TTS deployment: DGX Spark, Jetson Orin, Mac Studio, consumer GPUs, and cloud instances.
Was this guide helpful?