TTS Model Catalog & Optimization

Complete TTS Model Catalog

4A. LLM-Backbone Models with Voice Cloning

These are the current state-of-the-art for quality, expressiveness, and zero-shot voice cloning. All require GPU acceleration for real-time performance.

Model	Developer	Params	Architecture	Codec / Frame Rate	Voice Cloning	Ref Audio	TTFA	RTF	Streaming	Languages	Emotion Control	License
Hume TADA-1B	Hume AI	1B	Llama 3.2 + flow-matching	TADA ~2-3 Hz	✅ Zero-shot	Varies	Not published	0.09	Not yet	EN (1B); 9 lang (3B-ML)	Via ref audio	MIT
Hume TADA-3B-ML	Hume AI	3B	Llama 3.2 + flow-matching	TADA ~2-3 Hz	✅ Zero-shot	Varies	Not published	~0.09	Not yet	9 (ar,ch,de,es,fr,it,ja,pl,pt)	Via ref audio	MIT
Qwen3-TTS-12Hz-0.6B	Alibaba Qwen	0.6B	Qwen3 LLM + MTP	12.5 Hz, 16-layer multi-codebook	✅ 3-second	3s min, 10-15s optimal	97ms	Not published	✅ Bidirectional	10	NL instruction	Apache 2.0
Qwen3-TTS-12Hz-1.7B	Alibaba Qwen	1.7B	Qwen3 LLM + MTP	12.5 Hz, 16-layer multi-codebook	✅ 3-second	3s min, 10-15s optimal	101ms	Not published	✅ Bidirectional	10	NL instruction	Apache 2.0
Fun-CosyVoice 3	Alibaba FunAudioLLM	0.5B + DiT	LLM + conditional flow matching	25 Hz	✅ Few-shot	Few seconds	~150ms	Not published	✅ Bidirectional	9 + 18 dialects	Instruction-based	Apache 2.0
Fish Audio S2 Pro	Fish Audio	~4.4B (4B slow + 400M fast)	Dual-AR (Qwen3 backbone)	~21 Hz, 10-codebook RVQ	✅ 10-30s ref	10-30s	~100ms (H200)	0.195 (H200)	✅ SGLang	80+	15,000+ NL tags	Fish Research License
VibeVoice-1.5B	Microsoft	~1.5B total	Qwen2.5 + σ-VAE + diffusion	σ-VAE 7.5 Hz	✅ Up to 4 speakers	~10s	~200ms	0.2	✅	2 + 9 experimental	Context-aware	MIT
VibeVoice-Realtime-0.5B	Microsoft	~880M total	Qwen2.5 + σ-VAE + diffusion	σ-VAE 7.5 Hz	❌ Pre-computed only	N/A (7 presets)	~300ms	0.48 (Spark)	✅ Core design	2	Limited	MIT
Orpheus TTS 3B	Canopy Labs	3B	Llama 3.2 + SNAC codec	SNAC 7 tok/frame @ 21 Hz	✅ Built-in voices; fine-tune for custom	10-30s	~200ms	Competitive	✅ SNAC chunks	9 (EN primary)	8+ emotive tags	Apache 2.0
Chatterbox Turbo	Resemble AI	350M	AR + 1-step distilled diffusion	Custom	✅ Zero-shot	5-10s	Sub-200ms	Up to 6× RT	Chunked	EN only	`[laugh]` `[cough]` tags + exaggeration param	MIT
Chatterbox Multilingual	Resemble AI	~550M	AR + diffusion	Custom	✅ Zero-shot	5-10s	Sub-200ms	~2× RT	Chunked	23	Emotion exaggeration	MIT
Spark-TTS	SparkAudio / HKUST	0.5B	Qwen2.5 + BiCodec	BiCodec 50 Hz	✅ Zero-shot	5-30s	Not published	0.07 (L20, 4× conc) / 0.14 (single)	✅ Triton gRPC	2 (ZH, EN)	Gender/pitch/speed	Apache 2.0
OuteTTS 1.0 (0.6B)	OuteAI	0.6B	Pure AR (Qwen3) + DAC	DAC ~150 TPS	✅ Zero-shot	10-15s	Not published	Competitive	Partial (async)	14	Via ref speaker	Apache 2.0
OuteTTS 1.0 (1B)	OuteAI	1B	Pure AR (Llama 3.2) + DAC	DAC ~150 TPS	✅ Zero-shot	10-15s	Not published	Competitive	Partial (async)	23+	Via ref speaker	Apache 2.0
NeuTTS Air	Neuphonic	748M	Qwen2 + NeuCodec	NeuCodec 50 Hz	✅ 3-15s	3-15s	Not published	<0.5 CPU	✅ GGUF stream	EN primary	None	Apache 2.0
NeuTTS Nano	Neuphonic	~229M	Compact LM + NeuCodec	NeuCodec 50 Hz	✅ Same-language ref	~10s	Not published	CPU real-time	✅ GGUF stream	EN, ES, DE, FR	None	Apache 2.0
Sesame CSM-1B	Sesame AI Labs	1.1B	Dual-AR (Llama)	Mimi 12.5 Hz	✅ Conversational context	Audio prompt	Not published	Not published	Partial	EN	Conversational prosody	Apache 2.0
Dia2	Nari Labs	1B / 2B	AR streaming transformer	Mimi 12.5 Hz	✅ Audio prefix	≤10s	Not published	Not published	✅ Core feature	EN	16+ nonverbal tags	Apache 2.0
Zonos	Zyphra	1.6B	Transformer / Transformer+Mamba2 SSM hybrid	DAC ~86 Hz	✅ 5-30s	5-30s	Not published	~2× on 4090	❌	EN primary (+JA,ZH,FR,DE)	Rich (rate, pitch, 5 emotions)	Apache 2.0
Higgs Audio V2	Boson AI	~5.8B (3.6B LLM + 2.2B DualFFN)	Llama-3.2-3B + DualFFN	Custom	✅ ~3s	~3s	2-3s e2e	Not published	Limited	20+	75.7% vs gpt-4o-mini-tts (Emotions)	Apache 2.0 (generation weights)
Higgs Audio V2.5	Boson AI	~1B condensed	Condensed architecture + GRPO	Custom	✅	Short ref	<150ms claimed	Faster than V2	✅ OpenAI-compat	20+	Improved over V2	Apache 2.0
Kani-TTS-2	nineninesix.ai	~400M	LiquidAI LFM2 + NanoCodec	NanoCodec 12.5 fps	✅ Zero-shot	10-20s	Not published	0.2 (RTX 5080)	Via OpenAI server	EN, PT	Accent/dialect	Apache 2.0
F5-TTS	SWivid / SJTU	335M	Flow Matching DiT	Mel spectrogram + vocoder	✅ 1-5s	1-5s	Not published	0.15 (3090); 0.030 (Fast variant)	✅ Chunked	2 (+10 community)	Via ref audio	Code: MIT; Weights: CC-BY-NC
GPT-SoVITS v2 ProPlus	RVC-Boss	~210M (v2Pro)	GPT + SoVITS hybrid	Custom	✅ 5s zero-shot; 1min best	5s / 1min	Not published	0.014 claimed (4090, optimal)	WebUI	5 (ZH/EN/JA/KR/Cantonese)	DPO-trained	MIT
IndexTTS 1.5	Bilibili	~600M GPT + vocoder	GPT AR + BigVGAN2	Custom	✅ Zero-shot	5-10s	Not published	Faster than CV2 claimed	❌ documented	2 (ZH, EN)	Limited	Code: Apache 2.0; Weights: bilibili license (commercial auth required)
IndexTTS-2	Bilibili	~600M+	GPT AR + BigVGAN2 + Qwen3	Custom	✅ Zero-shot	5-10s	Not published	Not published	❌ documented	3 (ZH, EN, JP)	Full timbre-emotion disentanglement	bilibili Model Use License (non-commercial for improving AI)
MaskGCT	Amphion / CUHK-SZ	~1B+ total	Masked Generative Codec Transformer	Custom	✅ 5-10s	5-10s	Not published	Not published	❌	2 (EN, ZH)	Via prompt audio	CC-BY-NC-4.0
XTTS-v2	Coqui AI (community)	~467M	GPT-2 + VQ-VAE + HiFi-GAN	Custom	✅ 3-6s	3-6s	<150ms claimed	Not published	✅	17	Via ref audio	CPML (non-commercial)
Mars5-TTS	Camb.ai	~1.2B	AR Transformer + DDPM	Custom	✅ 2-12s	2-12s	Not published	Not published	❌	EN only	Via punctuation	AGPL 3.0
StyleTTS2	Columbia University	~100M est.	Diffusion + ISTFTNet + WavLM	Mel + ISTFTNet	✅ 5-10s	5-10s	Not published	~2× VITS	Experimental	EN only	Style vectors (strong)	Code: MIT; Pretrained: must disclose synthetic

4B. Models Without Voice Cloning (Preset Voices)

Model	Developer	Params	Cloning	Languages	Latency	License	Notes
Kokoro-82M	Hexgrad	82M	❌ 54 voicepacks	9	<0.1 RTF	Apache 2.0	#17 on TTS Arena V2 (highest open-source). ONNX, runs in browser
NVIDIA Magpie TTS	NVIDIA	357M	❌ (open weights); ✅ via Riva NIM	9	~600ms/sentence (Spark)	NVIDIA Open Model License	Confirmed on DGX Spark. Frame stacking + MaskGit parallel decode
Parler-TTS	HuggingFace	880M (large) / 400M (mini)	❌ NL voice description	EN primary	Not published	Apache 2.0	Control voice via text description instead of reference audio
MeloTTS	MyShell.ai	~80M est.	❌ Preset speakers	6 (EN/ES/FR/ZH/JA/KR)	CPU real-time	MIT	4 English accents. ONNX via sherpa-onnx
KittenTTS	KittenML	15M (nano) / 40M (micro) / 80M (mini)	❌ Built-in voices	EN	CPU real-time	Apache 2.0	INT8 checkpoints available. 25 MB smallest variant
Piper TTS	Open Home Foundation	VITS-based (varies)	❌ (fine-tune for custom)	30+	CPU real-time	GPL-3.0	aarch64 binaries. Hundreds of voices. ONNX-based

4C. Classical / Formant Engines

Engine	Type	SSML Support	x86/arm64	Streaming	License	Notes
eSpeak-NG	Formant synthesis	✅ SSML subset	Both	✅ stdout	Various (verify repo)	Foundation for many neural model phonemizers
Festival	Unit selection + parametric	SABLE markup	Unix platforms	Batch + client/server	X11-type (unrestricted)	Scheme scriptable. Multiple synthesis modes
Flite	Lightweight Festival	Basic	x86/ARM/Android	Batch	BSD-like (verify COPYING)	>70× real-time on legacy hardware. Thread-safe
MARY TTS	Java framework	✅ SSML prosody	JVM cross-arch	Client/server	Verify repo	Research-oriented

4D. Closed-Source Offline-Capable Options

Product	Voice Cloning	Latency	Languages	Offline Method	DGX Spark	Pricing
NVIDIA Riva/Magpie NIM	✅ (Zeroshot/Flow via NIM)	~100-150ms/utterance	9+	Docker container	x86_64 only currently	NVIDIA AI Enterprise license
Resemble AI On-Prem	✅ Rapid clone (10s) + Professional (25+ sentences)	Sub-200ms	149+	Air-gapped Python package	Likely (Python-based)	Enterprise sales
Cartesia Sonic 3	✅ ~3s	~40ms TTFA claimed	15	On-premise + on-device	Not confirmed	Enterprise sales
Microsoft Embedded Speech	❌ Neural preset voices	Not published	Multiple	Speech SDK offline	Linux x64+Arm64, macOS x64+Arm64, Win11	Commercial license
Acapela TTS Embedded	❌	Not published	Multiple	Vendor SDK	ARM + x86	Commercial
ElevenLabs	✅	Low	29+	No offline option	N/A	Cloud-only

4E. Corrections Log

Seven material errors were identified in our initial research and corrected:

Corrections to widely-circulated claims

Fish Audio S2 Pro: Parameter count is ~4.4B, not 5B (4B Slow AR + 400M Fast AR)
VibeVoice Spark benchmark: RTF 0.48 and 766ms pipeline latency are from a single community developer's forum post (logosflux, Jan 2026), not official NVIDIA/Microsoft figures
Spark-TTS RTF: 0.07 is achieved at 4× concurrency on L20; single-stream RTF is ~0.14
Higgs Audio V2 win rate: 75.7% is specifically against gpt-4o-mini-tts (not GPT-4o) in the Emotions category only
CosyVoice 2 language count: CosyVoice 2 supports ~4 primary languages; the "9 languages" applies to Fun-CosyVoice 3
Orpheus variant names: Correct names are Nano (150M), Tiny (400M), Small (1B), Medium (3B) — not "Large." Only the 3B model appears publicly released
IndexTTS 1.x license: Code is Apache 2.0 but model weights carry bilibili license restrictions requiring written authorization for commercial use

DGX Spark Compatibility Status

Confirmed working on GPU (March 2026)

Model	Method	Performance	Source
VibeVoice-Realtime-0.5B	Native PyTorch (SDPA)	RTF 0.48×, ~766ms full pipeline	Community benchmark (logosflux, NVIDIA Forums, Jan 2026)
NVIDIA Magpie TTS (357M)	NeMo / Pipecat	~600ms batch per sentence	Daily.co blog + NVIDIA validation
Qwen3-TTS-1.7B	vLLM-Omni 0.16.0	Working (no RTF published)	NVIDIA Forums (shahizat, Feb 2026)

Confirmed broken on DGX Spark GPU ("Blackwell Noise")

Blackwell Noise — these models produce white noise on DGX Spark GPU

These models produce white noise after ~2 seconds due to FP16/BF16 numerical instability on sm_121 Tensor Cores:

Coqui xTTS-v2
F5-TTS
Kokoro-82M (GPU mode)
AllTalk TTS
NVIDIA Riva Magpie (in Docker — container compatibility issue)
PersonaPlex

Workarounds for broken models

CPU fallback: Many models work correctly on the Grace ARM CPU (which is quite capable at 4 GHz). Use device="cpu" and accept slower generation
Replace torchaudio with soundfile for audio I/O
Build Flash Attention 2 from source (~15 min with MAX_JOBS=4)
Use SDPA fallback instead of Flash Attention (VibeVoice works with SDPA and may actually be faster on Blackwell)

Not yet tested but likely compatible

TADA-1B/3B (raw PyTorch, Llama backbone — should work with cu130 wheels)
Fun-CosyVoice 3 (PyTorch-based, similar architecture to working Qwen3-TTS)
Orpheus 3B via llama.cpp GGUF (llama.cpp has excellent DGX Spark support)
OuteTTS 1.0 via llama.cpp GGUF (same inference path as Orpheus)
NeuTTS Air/Nano via llama.cpp GGUF (CPU-first architecture)
Sesame CSM-1B (HuggingFace Transformers, should work with BF16)

Key DGX Spark deployment resources

Resource	URL
VibeVoice pipeline on Spark	forums.developer.nvidia.com/t/dgx-spark-vibevoice-tts-streaming-voice-pipeline-setup-guide/356424
Qwen3-TTS on Spark	forums.developer.nvidia.com/t/running-vllm-omni-for-qwen3-tts-voice-design-voice-clone-on-dgx-spark/361255
xTTS Docker attempt	forums.developer.nvidia.com/t/xtts-in-a-dockercontainer-on-the-dgx-spark/357850
aarch64 compatibility thread	forums.developer.nvidia.com/t/architecture-and-library-compatibility-on-aarch64/350389
SGLang build guide	forums.developer.nvidia.com/t/build-sglang-from-source-on-blackwell-pro-6000-dgx-spark/360785
Community ML setup guide	github.com/martimramos/dgx-spark-ml-guide
NVIDIA official playbooks	github.com/NVIDIA/dgx-spark-playbooks
Full voice pipeline code	github.com/Logos-Flux/spark-voice-pipeline
Daily.co Magpie voice agent	daily.co/blog/building-voice-agents-with-nvidia-open-models/

Optimization Techniques Catalog

PagedAttention (vLLM) splits KV cache into fixed-size blocks, reducing waste from 60–80% to under 4% and enabling 2–4× throughput. SGLang's RadixAttention stores KV tensors in a radix tree — when the same voice prompt is reused, cached computation yields 75–95% cache hit rates. Fish Audio S2 Pro achieves 86.4% average prefix-cache hit rate (>90% at peak) via RadixAttention, making repeated voice cloning nearly free. On DGX Spark, NVFP4 KV cache quantization cuts KV memory by ~50%.

Speculative Decoding

A small draft model proposes token sequences that the target model verifies in parallel. On DGX Spark, EAGLE3 speculative decoding delivers ~2× speedup and is supported by SGLang's official Spark Docker image. Speech-specific variants: VADUSA applies Medusa-style draft heads to TTS; PredGen overlaps speculative text/audio generation with user speech for up to 3× perceived latency reduction.

Quantization

On DGX Spark, quantization directly reduces bytes-per-token through the 273 GB/s bandwidth wall. INT8 achieves 2–3× speedup with <3% MOS drop. INT4 (W4A16) with Marlin kernels outperforms NVFP4 for decode on DGX Spark — FP8→FP4 disappoints because the bottleneck is memory reads, not compute. Quality hierarchy: BF16 > FP16 > FP8 > INT8 >> INT4 (INT4 can cause prosody artifacts).

INT4 prosody artifacts

While INT4 quantization doubles decode throughput on DGX Spark, it sits at the bottom of the quality hierarchy and can introduce prosody artifacts. Use INT8 where quality is a priority; reserve INT4 for maximum-throughput scenarios where naturalness is secondary.

torch.compile

mode="reduce-overhead" uses CUDA graphs and delivers 30–50% latency reduction. However, on DGX Spark sm_121a, CUDA graph capture via Triton frequently fails — vLLM requires enforce_eager=True. Raw CUDA graphs via SGLang work better (17 ms first-request overhead vs torch.compile's ~4 second warm-up).

Flash Attention

FA2 reduces attention memory from O(n²) to O(n). For Qwen3-TTS specifically, without FlashAttention, inference is 10× slower even on RTX 5090. FA3 is incompatible with Blackwell; use FA2 built from source or SDPA fallback.

Flash Attention 3 incompatible with Blackwell

Flash Attention 3 (FA3) does not work on Blackwell (sm_121). Build Flash Attention 2 from source or use the PyTorch SDPA fallback. Without any FlashAttention, Qwen3-TTS is 10× slower.

Multi-Token Prediction (MTP)

Qwen3-TTS-12Hz predicts all 16 codebook layers simultaneously per frame via hierarchical MTP. Self-distilled MTP achieves >3× faster decoding with <5% accuracy drop. This is a game-changer for multi-codebook architectures.

Continuous Batching

Orca-style iteration-level scheduling lets requests join/leave at each generation step. On DGX Spark, throughput scales nearly linearly: Llama 8B goes from 20.5 tok/s at BS=1 to 368 tok/s at BS=32 via SGLang.

Model Distillation

Chatterbox Turbo's 10→1 step diffusion distillation eliminates the decoder bottleneck. Fast F5-TTS reduces diffusion steps from 32 to 7, achieving RTF 0.030 (4× speedup).

6B. Pipeline and Streaming Optimizations

Sentence-Level Chunking

Split LLM output at punctuation; submit first sentence to TTS while LLM generates the rest. Deepgram explicitly warns against splitting mid-sentence (produces choppy prosody).

Streaming Text Input

True streaming TTS processes text character-by-character and starts synthesizing 6.5× faster than batch mode. Supported by: Qwen3-TTS, CosyVoice 2/3, VibeVoice-Realtime, Dia2.

Pipeline Stage Overlap

Start TTS once first few LLM tokens are available. Pipecat automates this with sentence-boundary aggregation. The Cresta production approach uses hedging (multiple parallel LLM calls, use whichever returns first).

Voice Prompt Caching

Pre-encode reference audio and reuse across requests. Implementation: SGLang RadixAttention (automatic for S2 Pro), CosyVoice LRU timbre manager, manual dict cache for TADA/others.

Warm-Up Passes

Non-negotiable for production. torch.compile requires ~3 min warm-up; CUDA graphs need sub-second. Pre-warm vocoder/codec separately — often the cold-start bottleneck.

6C. Audio Codec Frame Rate Impact

The single most impactful architectural choice for TTS speed:

Frame Rate	Tokens per 10s	Representative Codecs	Speed Impact
75–150 Hz	750–1,500	EnCodec, DAC, APCodec	Impractical for real-time AR
50 Hz	500	SpeechTokenizer, NeuCodec	Standard; heavy for AR
25 Hz	250	Qwen3-TTS-25Hz, CosyVoice	2× reduction
21 Hz	210	NanoCodec (Magpie), Fish S2	Good balance
12.5 Hz	125	Mimi, Qwen3-TTS-12Hz	4× fewer tokens — sweet spot
7.5 Hz	75	σ-VAE (VibeVoice)	Ultra-low; continuous latents
2–3 Hz	20–30	TADA	Extreme compression; RTF 0.09

Lower frame rate wins at same bitrate

Lower frame rate with more quantization layers outperforms higher frame rate with fewer layers at the same bitrate. Prefer 12.5 Hz codecs (Mimi, Qwen3-TTS-12Hz) over 50 Hz codecs for autoregressive decode on bandwidth-limited hardware like DGX Spark.

6D. Serving Infrastructure Comparison

Framework	TTS Models	Strength	DGX Spark Status
SGLang	Fish S2 Pro, Orpheus	RadixAttention voice caching	Official Docker; EAGLE3 works
vLLM	Qwen3-TTS, Orpheus	Best TTFT; PagedAttention	Requires `enforce_eager=True`
TensorRT-LLM	CosyVoice (4× speedup), Spark-TTS, Magpie	Deepest NVIDIA optimization	Fully supported
llama.cpp	OuteTTS, Orpheus, NeuTTS	GGUF; ARM64 native	Excellent; official benchmarks
ONNX Runtime	Kokoro, Piper, Chatterbox Turbo	Cross-platform; lightweight	ARM64 with CUDA EP
Triton	Multi-model pipelines	Ensemble scheduling	Full NVIDIA support

6E. Conversational "Feel" Optimizations

Filler Word Injection

Reduces perceived latency by 50–70% (from 1.5–2s to 0.5–0.7s). Instruct LLM to start responses with "umm," "so," or "well." Send tiny filler chunk to TTS instantly, stream the rest behind it.

Turn-Taking Prediction

Beyond VAD silence detection. NVIDIA ACE Agent uses two-pass endpointing: send interim transcript at 240 ms pause, final at 800 ms — saving 500–600 ms overall.

Barge-in Support

Stop TTS playback within 200 ms when user interrupts. Pipecat uses Silero VAD + SmartTurn, flushing downstream buffers instantly.

Backchanneling

Filter short utterances ("yeah," "okay") from triggering full responses. NVIDIA ACE Agent filters words under 3 characters.

6F. DGX Spark Specific Optimizations

DGX Spark optimization checklist

Use BF16 everywhere: Blackwell has native BF16 tensor cores. Both Qwen3-TTS and Fish S2 Pro were trained in BF16
Prefer low frame-rate codecs: 12.5 Hz (Qwen3-TTS-12Hz, Mimi) → only ~12.5 tokens/sec of audio, well within bandwidth
INT4 quantization > NVFP4 for decode: Marlin kernel INT4 outperforms FP4 on Spark
Disable GUI: sudo systemctl set-default multi-user.target frees ~60 MB GPU memory
Unified memory advantage: No cudaMemcpy needed. Models, KV cache, and audio buffers share the same 128 GB pool
No NUMA topology: Single SoC, no transfer penalty for CPU offloading of text preprocessing

Model Optimization Compatibility Matrix

Serving Framework Support

Model	SGLang	vLLM	TRT-LLM	llama.cpp	ONNX RT	Raw PyTorch
TADA-1B/3B	❌	❌	❌	❌	❌	✅
Qwen3-TTS-12Hz	🔜 PR open	✅ vLLM-Omni	❌	❌	❌	✅
Fun-CosyVoice3	❌	✅ v0.9+	✅ 4× speedup	❌	Partial	✅
Chatterbox Turbo	❌	❌	❌	❌	✅	✅
Fish S2 Pro	✅ Primary	Compatible	❌	❌	❌	✅
NeuTTS Air/Nano	❌	✅	❌	✅ Primary	✅	✅
VibeVoice-RT	❌	🔜	❌	❌	❌	✅
Orpheus 3B	✅	✅	❌	✅ GGUF	❌	✅
Spark-TTS	❌	Possible	✅ Official	❌	❌	✅
OuteTTS 1.0	❌	✅ Exp	❌	✅ Primary	❌	✅
Kokoro-82M	❌	❌	❌	❌	✅ Full	✅
Sesame CSM-1B	❌	❌	❌	❌	❌	✅ HF
Dia2	❌	❌	❌	❌	❌	✅
Magpie 357M	❌	❌	✅ Via Riva	❌	❌	✅ NeMo

Quantization and Optimization Availability

Model	Quantization Formats	Voice Cache	CUDA Graphs	torch.compile
TADA-1B/3B	FP16/BF16 only	❌	Likely	Likely
Qwen3-TTS-12Hz	BF16 (mandatory), 4-bit MLX	Implicit KV	Unknown	Unknown
Fun-CosyVoice3	FP16/FP32	✅ LRU timbre manager	Unknown	Unknown
Chatterbox Turbo	FP16, ONNX FP32	❌	Unknown	Unknown
Fish S2 Pro	BF16 only	✅ 86.4% RadixAttention	✅ SGLang	✅ `--compile`
NeuTTS Air/Nano	GGUF Q4/Q8, FP16	✅ Reference .pt	Unknown	Unknown
VibeVoice-RT	FP16/BF16	✅ Pre-computed	Unknown	Unknown
Orpheus 3B	FP16, FP8, GGUF Q4-Q8, GPTQ, AWQ	Implicit KV	✅ Via vLLM/SGLang	✅
Spark-TTS	FP16, 4-bit BnB	Implicit globals	Via TRT-LLM	Unknown
OuteTTS 1.0	GGUF Q2-Q8, FP8, EXL2, FP16	✅ JSON profiles	Via backends	Possible
Kokoro-82M	ONNX fp32/fp16/q8/q4/q4f16	Pre-computed voicepacks	❌	⚠️ Issues
Sesame CSM-1B	FP16/BF16	✅ Conversation ctx	✅ Documented	✅ Static cache
Dia2	BF16 only	✅ Audio prefix	✅ `--cuda-graph`	✅ `use_torch_compile`
Magpie 357M	FP16, TRT optimized	✅ Context encoder	Likely via Riva	Unknown

Voice Pipeline Architecture

STT → LLM → TTS Pipeline

The standard real-time voice agent pipeline has three stages, with the LLM consistently dominating total latency:

User speaks → [VAD] → STT (~50ms) → LLM (~300-700ms) → TTS (~100-600ms) → Audio playback
                                      ↑ DOMINANT        ↑ YOU ARE HERE

Latency budget (conversational threshold: <500ms perceived):

Setup	Voice-to-Voice Latency
GPT-4o Realtime (native audio)	232–320 ms
Optimized cloud pipeline	<500 ms
Pipecat local on M-series Mac	<800 ms
NVIDIA Blueprint target	500–1,500 ms
VibeVoice pipeline on DGX Spark	~766 ms
Naive cloud pipeline	3–5 s

Pipecat Integration

Pipecat is the leading open-source Python framework for real-time voice AI, supporting 25+ TTS backends. Key integration patterns:

TextAggregationMode.SENTENCE (default): Natural prosody, sends complete sentences to TTS
TextAggregationMode.TOKEN: Lowest latency, may produce unnatural prosody
Interruption handling: Silero VAD + SmartTurn → flush all downstream buffers
OpenAI-compatible servers bridge any local TTS to Pipecat via OpenAITTSService(base_url=...)

Community TTS API servers for Pipecat integration:

Server	TTS Backend	API Compatibility
Chatterbox-TTS-Server	Chatterbox (all 3 variants)	OpenAI TTS
Kokoro-FastAPI	Kokoro-82M	OpenAI TTS
LocalAI	XTTS, Piper, multiple	OpenAI + ElevenLabs
local-openai-tts-server	Silero + Piper	OpenAI TTS
LocalOrpheusTTS	Orpheus via Ollama	Custom

DGX Spark Memory Budget for Full Pipeline

Component	Model Example	Est. Memory
STT	Whisper Large v3 Turbo	~3 GB
LLM	Llama-3.2-8B Q4	~5 GB
TTS (primary)	Qwen3-TTS-12Hz-0.6B	~1.2 GB
TTS (alternate)	Orpheus 3B GGUF Q4	~2 GB
TTS (lightweight)	Kokoro-82M ONNX	~0.05 GB
Framework overhead	KV cache, CUDA context	~4-8 GB
Total		~15-20 GB
Free for voice profiles, context, upgrades		~108-113 GB

Configuration Reference

Model-Specific Deployment Settings

TADA-1B:

# Ultra-low frame rate is the optimization itself
model = TadaForCausalLM.from_pretrained("HumeAI/tada-1b").to("cuda").to(torch.bfloat16)
# Flow matching steps configurable for quality/speed tradeoff
# ~700 seconds continuous audio with 2048 token context

Qwen3-TTS-12Hz (0.6B/1.7B):

# BF16 is MANDATORY — FP16 causes NaN/overflow in MTP sampling
# vLLM-Omni on DGX Spark:
vllm serve Qwen/Qwen3-TTS-12Hz-0.6B-Base --enforce-eager --trust-remote-code \
  --gpu-memory-utilization 0.9 --dtype bfloat16
# Reference audio: 10-15s optimal; append 0.5s silence to end
# 3s minimum for cloning

Qwen3-TTS requires BF16 — FP16 will break it

BF16 is mandatory for Qwen3-TTS-12Hz. Using FP16 causes NaN/overflow in the multi-token prediction (MTP) sampling step. Always pass --dtype bfloat16 when serving via vLLM.

Fun-CosyVoice 3:

# LightTTS framework for optimized serving:
# github.com/ModelTC/LightTTS
# TRT-LLM gives 4× speedup on LLM component
# Shared memory timbre manager with LRU for repeated speakers

Fish S2 Pro (via SGLang):

# Build sgl-kernel for Spark:
TORCH_CUDA_ARCH_LIST="12.1a" MAX_JOBS=4 CMAKE_BUILD_PARALLEL_LEVEL=1 \
  python -m build --wheel --no-isolation
# CRITICAL: BF16 RoPE truncation patch required
# --mem-fraction-static 0.15 (keeps allocation small for shared memory)
# --max-running-requests 2 (bandwidth-limited on Spark)

Orpheus 3B (via llama.cpp):

# Build for DGX Spark:
cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_F16=ON -DCMAKE_CUDA_ARCHITECTURES=121
export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
# Serving:
./llama-server -m orpheus-3b-q4_k_m.gguf -c 8192 -ngl 99 \
  --cache-type-k q8_0 --cache-type-v q8_0 -fa --mlock
# repetition_penalty >= 1.1 REQUIRED for stable output
# Stream every 7 SNAC tokens (= 1 audio frame)

Chatterbox Turbo:

model = ChatterboxTurboTTS.from_pretrained(device="cuda")
# cfg_weight ~0.3 for slower, more natural rhythm
# exaggeration 0.5 default; higher = more expressive + faster speaking
# ONNX export available for cross-platform

OuteTTS 1.0 (via llama.cpp):

# GGUF Q4-Q8 available
# Windowed repetition penalty (64-token window) is CRITICAL
# Full context penalty breaks output
# llama.cpp llama-tts binary (PR #12794 pending for 1.0)

Essential Environment Variables

# DGX Spark general
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export CUDA_HOME=/usr/local/cuda
export TORCH_CUDA_ARCH_LIST="12.1a"
 
# PyTorch optimization
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"
 
# llama.cpp unified memory
export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
 
# Disable GUI for headless serving
sudo systemctl set-default multi-user.target

# PyTorch inference optimization (all models)
torch.set_grad_enabled(False)
torch.set_float32_matmul_precision('high')  # TF32 on Blackwell
torch.backends.cudnn.benchmark = True  # for vocoder conv layers
torch.backends.cuda.matmul.allow_tf32 = True

Canonical TTSConfig Schema

This portable, offline-first configuration schema can drive multiple engines. It separates static capabilities from request-time controls and deployment settings:

{
  "engine": "qwen3_tts | cosyvoice3 | orpheus | tada | chatterbox | fish_s2 | ...",
  "voice": {
    "locale": "en-US",
    "voice_id": "engine-specific voice name or voicepack ID",
    "model_id": "HuggingFace model hub identifier",
    "model_path": "/path/for/air-gapped/deployment",
    "speaker_ref_wav": "/path/to/reference.wav",
    "speaker_index": 0
  },
  "audio": {
    "format": "wav | raw_pcm_s16le | flac | opus",
    "sample_rate_hz": 24000,
    "channels": 1,
    "normalize": true,
    "volume_gain": 1.0,
    "postfx": { "lufs_target": -18, "limiter": true, "resample_quality": "high" }
  },
  "controls": {
    "rate": 1.0,
    "pitch": 0.0,
    "temperature": 0.7,
    "noise_scale": 0.667,
    "style": "warm and conversational",
    "seed": 42
  },
  "segmentation": {
    "mode": "sentence",
    "max_chars": 500,
    "split_pattern": "\\n+",
    "pause_ms_between_segments": 200
  },
  "execution": {
    "device": "cuda",
    "precision": "bf16",
    "threads": 4,
    "streaming": {
      "enabled": true,
      "chunk_ms": 40,
      "ttfa_budget_ms": 300,
      "buffer_strategy": "low_latency"
    },
    "cache": {
      "model_cache_dir": "/models",
      "warm_start": true,
      "max_models_loaded": 3
    }
  }
}

Validation and Testing

Metrics to measure

Metric	Definition	Target (conversational)
TTFA	Time from text input to first playable audio chunk	<300 ms
RTF	Wall-clock seconds / audio seconds generated	<0.5 (2× real-time)
WER	Word error rate (TTS output → STT → compare to input)	<3%
SIM	Speaker similarity (cosine similarity of speaker embeddings)	>0.85 for cloning
Peak VRAM	Maximum GPU memory usage during generation	Per-model budget
Pipeline latency	Voice-in to voice-out total time	<1,000 ms

Test suite structure

Unit tests: Text normalization determinism, phoneme override verification, SSML parsing for supported engines.

Integration tests: Golden-audio spectral fingerprinting (not raw waveform — lossy variability exists), cross-engine output duration bounds, architecture smoke tests (x86_64, aarch64).

Perceptual tests: MOS listening panels, ABX preference tests, accessibility evaluation at high speech rates.

Benchmark harness

Capture cold-start vs warm-start TTFA, steady-state RTF at 1/2/4/8 concurrent requests, CPU thread scaling, quantization impact (BF16 vs INT8 vs INT4), and output sample rate normalization. sherpa-onnx publishes per-model RTF tables including Raspberry Pi measurements, providing a useful cross-platform reference.

References and Links

Model repositories

Model	GitHub	HuggingFace
TADA	github.com/HumeAI/tada	huggingface.co/HumeAI/tada-1b
Qwen3-TTS	github.com/QwenLM/Qwen3-TTS	huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base
CosyVoice / Fun-CosyVoice3	github.com/FunAudioLLM/CosyVoice	huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512
Fish Audio S2 Pro	github.com/fishaudio/fish-speech	huggingface.co/fishaudio/s2-pro
SGLang-Omni (S2 Pro)	github.com/sgl-project/sglang-omni	—
VibeVoice	github.com/microsoft/VibeVoice	huggingface.co/microsoft/VibeVoice-Realtime-0.5B
Orpheus TTS	github.com/canopyai/Orpheus-TTS	huggingface.co/canopylabs/orpheus-3b-0.1-ft
Chatterbox	github.com/resemble-ai/chatterbox	huggingface.co/ResembleAI/chatterbox-turbo
Spark-TTS	github.com/SparkAudio/Spark-TTS	huggingface.co/SparkAudio/Spark-TTS-0.5B
OuteTTS	github.com/edwko/OuteTTS	huggingface.co/OuteAI/OuteTTS-1.0-0.6B
NeuTTS	github.com/neuphonic/neutts	huggingface.co/neuphonic/neutts-air
Kokoro	github.com/hexgrad/kokoro	huggingface.co/hexgrad/Kokoro-82M
Sesame CSM	github.com/SesameAILabs/csm	—
Dia / Dia2	github.com/nari-labs/dia2	huggingface.co/nari-labs/Dia2-2B
Zonos	github.com/Zyphra/Zonos	huggingface.co/Zyphra/Zonos-v0.1-transformer
Higgs Audio	github.com/boson-ai/higgs-audio	huggingface.co/bosonai/higgs-audio-v2-generation-3B-base
IndexTTS	github.com/index-tts/index-tts	huggingface.co/IndexTeam/IndexTTS-1.5
Kani-TTS-2	github.com/nineninesix-ai/kani-tts-2	huggingface.co/nineninesix/kani-tts-2-en
F5-TTS	github.com/SWivid/F5-TTS	huggingface.co/SWivid/F5-TTS
GPT-SoVITS	github.com/RVC-Boss/GPT-SoVITS	—
MaskGCT / Amphion	github.com/open-mmlab/Amphion	huggingface.co/amphion/MaskGCT
Piper TTS	github.com/OHF-Voice/piper1-gpl	—
Kokoro-82M	github.com/hexgrad/kokoro	huggingface.co/hexgrad/Kokoro-82M
MeloTTS	github.com/myshell-ai/MeloTTS	—
KittenTTS	github.com/KittenML/KittenTTS	—
Magpie TTS	—	huggingface.co/nvidia/magpie_tts_multilingual_357m
Parler-TTS	github.com/huggingface/parler-tts	huggingface.co/parler-tts/parler-tts-mini-v1
sherpa-onnx	github.com/k2-fsa/sherpa-onnx	—

Classical engines

Engine	URL
eSpeak-NG	github.com/espeak-ng/espeak-ng
Festival	github.com/festvox/festival
Flite	github.com/festvox/flite
MARY TTS	github.com/marytts/marytts

Frameworks and tools

Tool	URL
Pipecat	github.com/pipecat-ai/pipecat
nvidia-pipecat	github.com/pipecat-ai/nemotron-january-2026
LightTTS (CosyVoice optimizer)	github.com/ModelTC/LightTTS
Chatterbox-TTS-Server	github.com/devnen/Chatterbox-TTS-Server
Kokoro-FastAPI	github.com/remsky/Kokoro-FastAPI
LocalOrpheusTTS	github.com/legraphista/LocalOrpheusTTS
TTS Arena V2 Leaderboard	huggingface.co/spaces/TTS-AGI/TTS-Arena

DGX Spark resources

Resource	URL
NVIDIA DGX Spark documentation	docs.nvidia.com/dgx/dgx-spark/
SGLang Spark Docker	lmsysorg/sglang:spark
LMSYS Spark review	lmsys.org/blog/2025-10-13-nvidia-dgx-spark/
ARM learning path (llama.cpp)	learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/
NVIDIA Spark playbooks	github.com/NVIDIA/dgx-spark-playbooks
Community ML guide	github.com/martimramos/dgx-spark-ml-guide
Natolambert setup guide	github.com/natolambert/dgx-spark-setup
llama.cpp Spark benchmarks	github.com/DandinPower/llama.cpp_bench

Key papers

Paper	URL
TADA: Text-Acoustic Dual Alignment	arxiv.org/abs/2602.23068
Qwen3-TTS Technical Report	arxiv.org/abs/2601.15621
Fish Audio S2 Technical Report	arxiv.org/abs/2603.08823
Spark-TTS	arxiv.org/abs/2503.01710
MaskGCT	arxiv.org/abs/2409.00750
F5-TTS	arxiv.org/abs/2410.06261
CosyVoice 3	arxiv.org/abs/2505.17589
Speculative Decoding for TTS	arxiv.org/abs/2410.21951

TTS Model Catalog & Optimization

Related Guides

Complete Offline TTS Guide 2026

TTS Decision Guide

Hardware Comparison Guide