Skip to content

TTS Model Catalog & Optimization

Complete TTS Model Catalog

4A. LLM-Backbone Models with Voice Cloning

These are the current state-of-the-art for quality, expressiveness, and zero-shot voice cloning. All require GPU acceleration for real-time performance.

ModelDeveloperParamsArchitectureCodec / Frame RateVoice CloningRef AudioTTFARTFStreamingLanguagesEmotion ControlLicense
Hume TADA-1BHume AI1BLlama 3.2 + flow-matchingTADA ~2-3 Hz✅ Zero-shotVariesNot published0.09Not yetEN (1B); 9 lang (3B-ML)Via ref audioMIT
Hume TADA-3B-MLHume AI3BLlama 3.2 + flow-matchingTADA ~2-3 Hz✅ Zero-shotVariesNot published~0.09Not yet9 (ar,ch,de,es,fr,it,ja,pl,pt)Via ref audioMIT
Qwen3-TTS-12Hz-0.6BAlibaba Qwen0.6BQwen3 LLM + MTP12.5 Hz, 16-layer multi-codebook✅ 3-second3s min, 10-15s optimal97msNot published✅ Bidirectional10NL instructionApache 2.0
Qwen3-TTS-12Hz-1.7BAlibaba Qwen1.7BQwen3 LLM + MTP12.5 Hz, 16-layer multi-codebook✅ 3-second3s min, 10-15s optimal101msNot published✅ Bidirectional10NL instructionApache 2.0
Fun-CosyVoice 3Alibaba FunAudioLLM0.5B + DiTLLM + conditional flow matching25 Hz✅ Few-shotFew seconds~150msNot published✅ Bidirectional9 + 18 dialectsInstruction-basedApache 2.0
Fish Audio S2 ProFish Audio~4.4B (4B slow + 400M fast)Dual-AR (Qwen3 backbone)~21 Hz, 10-codebook RVQ✅ 10-30s ref10-30s~100ms (H200)0.195 (H200)✅ SGLang80+15,000+ NL tagsFish Research License
VibeVoice-1.5BMicrosoft~1.5B totalQwen2.5 + σ-VAE + diffusionσ-VAE 7.5 Hz✅ Up to 4 speakers~10s~200ms0.22 + 9 experimentalContext-awareMIT
VibeVoice-Realtime-0.5BMicrosoft~880M totalQwen2.5 + σ-VAE + diffusionσ-VAE 7.5 Hz❌ Pre-computed onlyN/A (7 presets)~300ms0.48 (Spark)✅ Core design2LimitedMIT
Orpheus TTS 3BCanopy Labs3BLlama 3.2 + SNAC codecSNAC 7 tok/frame @ 21 Hz✅ Built-in voices; fine-tune for custom10-30s~200msCompetitive✅ SNAC chunks9 (EN primary)8+ emotive tagsApache 2.0
Chatterbox TurboResemble AI350MAR + 1-step distilled diffusionCustom✅ Zero-shot5-10sSub-200msUp to 6× RTChunkedEN only[laugh] [cough] tags + exaggeration paramMIT
Chatterbox MultilingualResemble AI~550MAR + diffusionCustom✅ Zero-shot5-10sSub-200ms~2× RTChunked23Emotion exaggerationMIT
Spark-TTSSparkAudio / HKUST0.5BQwen2.5 + BiCodecBiCodec 50 Hz✅ Zero-shot5-30sNot published0.07 (L20, 4× conc) / 0.14 (single)✅ Triton gRPC2 (ZH, EN)Gender/pitch/speedApache 2.0
OuteTTS 1.0 (0.6B)OuteAI0.6BPure AR (Qwen3) + DACDAC ~150 TPS✅ Zero-shot10-15sNot publishedCompetitivePartial (async)14Via ref speakerApache 2.0
OuteTTS 1.0 (1B)OuteAI1BPure AR (Llama 3.2) + DACDAC ~150 TPS✅ Zero-shot10-15sNot publishedCompetitivePartial (async)23+Via ref speakerApache 2.0
NeuTTS AirNeuphonic748MQwen2 + NeuCodecNeuCodec 50 Hz✅ 3-15s3-15sNot published<0.5 CPU✅ GGUF streamEN primaryNoneApache 2.0
NeuTTS NanoNeuphonic~229MCompact LM + NeuCodecNeuCodec 50 Hz✅ Same-language ref~10sNot publishedCPU real-time✅ GGUF streamEN, ES, DE, FRNoneApache 2.0
Sesame CSM-1BSesame AI Labs1.1BDual-AR (Llama)Mimi 12.5 Hz✅ Conversational contextAudio promptNot publishedNot publishedPartialENConversational prosodyApache 2.0
Dia2Nari Labs1B / 2BAR streaming transformerMimi 12.5 Hz✅ Audio prefix≤10sNot publishedNot published✅ Core featureEN16+ nonverbal tagsApache 2.0
ZonosZyphra1.6BTransformer / Transformer+Mamba2 SSM hybridDAC ~86 Hz✅ 5-30s5-30sNot published~2× on 4090EN primary (+JA,ZH,FR,DE)Rich (rate, pitch, 5 emotions)Apache 2.0
Higgs Audio V2Boson AI~5.8B (3.6B LLM + 2.2B DualFFN)Llama-3.2-3B + DualFFNCustom✅ ~3s~3s2-3s e2eNot publishedLimited20+75.7% vs gpt-4o-mini-tts (Emotions)Apache 2.0 (generation weights)
Higgs Audio V2.5Boson AI~1B condensedCondensed architecture + GRPOCustomShort ref<150ms claimedFaster than V2✅ OpenAI-compat20+Improved over V2Apache 2.0
Kani-TTS-2nineninesix.ai~400MLiquidAI LFM2 + NanoCodecNanoCodec 12.5 fps✅ Zero-shot10-20sNot published0.2 (RTX 5080)Via OpenAI serverEN, PTAccent/dialectApache 2.0
F5-TTSSWivid / SJTU335MFlow Matching DiTMel spectrogram + vocoder✅ 1-5s1-5sNot published0.15 (3090); 0.030 (Fast variant)✅ Chunked2 (+10 community)Via ref audioCode: MIT; Weights: CC-BY-NC
GPT-SoVITS v2 ProPlusRVC-Boss~210M (v2Pro)GPT + SoVITS hybridCustom✅ 5s zero-shot; 1min best5s / 1minNot published0.014 claimed (4090, optimal)WebUI5 (ZH/EN/JA/KR/Cantonese)DPO-trainedMIT
IndexTTS 1.5Bilibili~600M GPT + vocoderGPT AR + BigVGAN2Custom✅ Zero-shot5-10sNot publishedFaster than CV2 claimed❌ documented2 (ZH, EN)LimitedCode: Apache 2.0; Weights: bilibili license (commercial auth required)
IndexTTS-2Bilibili~600M+GPT AR + BigVGAN2 + Qwen3Custom✅ Zero-shot5-10sNot publishedNot published❌ documented3 (ZH, EN, JP)Full timbre-emotion disentanglementbilibili Model Use License (non-commercial for improving AI)
MaskGCTAmphion / CUHK-SZ~1B+ totalMasked Generative Codec TransformerCustom✅ 5-10s5-10sNot publishedNot published2 (EN, ZH)Via prompt audioCC-BY-NC-4.0
XTTS-v2Coqui AI (community)~467MGPT-2 + VQ-VAE + HiFi-GANCustom✅ 3-6s3-6s<150ms claimedNot published17Via ref audioCPML (non-commercial)
Mars5-TTSCamb.ai~1.2BAR Transformer + DDPMCustom✅ 2-12s2-12sNot publishedNot publishedEN onlyVia punctuationAGPL 3.0
StyleTTS2Columbia University~100M est.Diffusion + ISTFTNet + WavLMMel + ISTFTNet✅ 5-10s5-10sNot published~2× VITSExperimentalEN onlyStyle vectors (strong)Code: MIT; Pretrained: must disclose synthetic

4B. Models Without Voice Cloning (Preset Voices)

ModelDeveloperParamsCloningLanguagesLatencyLicenseNotes
Kokoro-82MHexgrad82M❌ 54 voicepacks9<0.1 RTFApache 2.0#17 on TTS Arena V2 (highest open-source). ONNX, runs in browser
NVIDIA Magpie TTSNVIDIA357M❌ (open weights); ✅ via Riva NIM9~600ms/sentence (Spark)NVIDIA Open Model LicenseConfirmed on DGX Spark. Frame stacking + MaskGit parallel decode
Parler-TTSHuggingFace880M (large) / 400M (mini)❌ NL voice descriptionEN primaryNot publishedApache 2.0Control voice via text description instead of reference audio
MeloTTSMyShell.ai~80M est.❌ Preset speakers6 (EN/ES/FR/ZH/JA/KR)CPU real-timeMIT4 English accents. ONNX via sherpa-onnx
KittenTTSKittenML15M (nano) / 40M (micro) / 80M (mini)❌ Built-in voicesENCPU real-timeApache 2.0INT8 checkpoints available. 25 MB smallest variant
Piper TTSOpen Home FoundationVITS-based (varies)❌ (fine-tune for custom)30+CPU real-timeGPL-3.0aarch64 binaries. Hundreds of voices. ONNX-based

4C. Classical / Formant Engines

EngineTypeSSML Supportx86/arm64StreamingLicenseNotes
eSpeak-NGFormant synthesis✅ SSML subsetBoth✅ stdoutVarious (verify repo)Foundation for many neural model phonemizers
FestivalUnit selection + parametricSABLE markupUnix platformsBatch + client/serverX11-type (unrestricted)Scheme scriptable. Multiple synthesis modes
FliteLightweight FestivalBasicx86/ARM/AndroidBatchBSD-like (verify COPYING)>70× real-time on legacy hardware. Thread-safe
MARY TTSJava framework✅ SSML prosodyJVM cross-archClient/serverVerify repoResearch-oriented

4D. Closed-Source Offline-Capable Options

ProductVoice CloningLatencyLanguagesOffline MethodDGX SparkPricing
NVIDIA Riva/Magpie NIM✅ (Zeroshot/Flow via NIM)~100-150ms/utterance9+Docker containerx86_64 only currentlyNVIDIA AI Enterprise license
Resemble AI On-Prem✅ Rapid clone (10s) + Professional (25+ sentences)Sub-200ms149+Air-gapped Python packageLikely (Python-based)Enterprise sales
Cartesia Sonic 3✅ ~3s~40ms TTFA claimed15On-premise + on-deviceNot confirmedEnterprise sales
Microsoft Embedded Speech❌ Neural preset voicesNot publishedMultipleSpeech SDK offlineLinux x64+Arm64, macOS x64+Arm64, Win11Commercial license
Acapela TTS EmbeddedNot publishedMultipleVendor SDKARM + x86Commercial
ElevenLabsLow29+No offline optionN/ACloud-only

4E. Corrections Log

Seven material errors were identified in our initial research and corrected:

Corrections to widely-circulated claims
  1. Fish Audio S2 Pro: Parameter count is ~4.4B, not 5B (4B Slow AR + 400M Fast AR)
  2. VibeVoice Spark benchmark: RTF 0.48 and 766ms pipeline latency are from a single community developer's forum post (logosflux, Jan 2026), not official NVIDIA/Microsoft figures
  3. Spark-TTS RTF: 0.07 is achieved at 4× concurrency on L20; single-stream RTF is ~0.14
  4. Higgs Audio V2 win rate: 75.7% is specifically against gpt-4o-mini-tts (not GPT-4o) in the Emotions category only
  5. CosyVoice 2 language count: CosyVoice 2 supports ~4 primary languages; the "9 languages" applies to Fun-CosyVoice 3
  6. Orpheus variant names: Correct names are Nano (150M), Tiny (400M), Small (1B), Medium (3B) — not "Large." Only the 3B model appears publicly released
  7. IndexTTS 1.x license: Code is Apache 2.0 but model weights carry bilibili license restrictions requiring written authorization for commercial use

DGX Spark Compatibility Status

Confirmed working on GPU (March 2026)

ModelMethodPerformanceSource
VibeVoice-Realtime-0.5BNative PyTorch (SDPA)RTF 0.48×, ~766ms full pipelineCommunity benchmark (logosflux, NVIDIA Forums, Jan 2026)
NVIDIA Magpie TTS (357M)NeMo / Pipecat~600ms batch per sentenceDaily.co blog + NVIDIA validation
Qwen3-TTS-1.7BvLLM-Omni 0.16.0Working (no RTF published)NVIDIA Forums (shahizat, Feb 2026)

Confirmed broken on DGX Spark GPU ("Blackwell Noise")

Blackwell Noise — these models produce white noise on DGX Spark GPU

These models produce white noise after ~2 seconds due to FP16/BF16 numerical instability on sm_121 Tensor Cores:

  • Coqui xTTS-v2
  • F5-TTS
  • Kokoro-82M (GPU mode)
  • AllTalk TTS
  • NVIDIA Riva Magpie (in Docker — container compatibility issue)
  • PersonaPlex

Workarounds for broken models

  • CPU fallback: Many models work correctly on the Grace ARM CPU (which is quite capable at 4 GHz). Use device="cpu" and accept slower generation
  • Replace torchaudio with soundfile for audio I/O
  • Build Flash Attention 2 from source (~15 min with MAX_JOBS=4)
  • Use SDPA fallback instead of Flash Attention (VibeVoice works with SDPA and may actually be faster on Blackwell)

Not yet tested but likely compatible

  • TADA-1B/3B (raw PyTorch, Llama backbone — should work with cu130 wheels)
  • Fun-CosyVoice 3 (PyTorch-based, similar architecture to working Qwen3-TTS)
  • Orpheus 3B via llama.cpp GGUF (llama.cpp has excellent DGX Spark support)
  • OuteTTS 1.0 via llama.cpp GGUF (same inference path as Orpheus)
  • NeuTTS Air/Nano via llama.cpp GGUF (CPU-first architecture)
  • Sesame CSM-1B (HuggingFace Transformers, should work with BF16)

Key DGX Spark deployment resources

ResourceURL
VibeVoice pipeline on Sparkforums.developer.nvidia.com/t/dgx-spark-vibevoice-tts-streaming-voice-pipeline-setup-guide/356424
Qwen3-TTS on Sparkforums.developer.nvidia.com/t/running-vllm-omni-for-qwen3-tts-voice-design-voice-clone-on-dgx-spark/361255
xTTS Docker attemptforums.developer.nvidia.com/t/xtts-in-a-dockercontainer-on-the-dgx-spark/357850
aarch64 compatibility threadforums.developer.nvidia.com/t/architecture-and-library-compatibility-on-aarch64/350389
SGLang build guideforums.developer.nvidia.com/t/build-sglang-from-source-on-blackwell-pro-6000-dgx-spark/360785
Community ML setup guidegithub.com/martimramos/dgx-spark-ml-guide
NVIDIA official playbooksgithub.com/NVIDIA/dgx-spark-playbooks
Full voice pipeline codegithub.com/Logos-Flux/spark-voice-pipeline
Daily.co Magpie voice agentdaily.co/blog/building-voice-agents-with-nvidia-open-models/

Optimization Techniques Catalog

6A. Model-Level Optimizations

KV Cache Optimization

PagedAttention (vLLM) splits KV cache into fixed-size blocks, reducing waste from 60–80% to under 4% and enabling 2–4× throughput. SGLang's RadixAttention stores KV tensors in a radix tree — when the same voice prompt is reused, cached computation yields 75–95% cache hit rates. Fish Audio S2 Pro achieves 86.4% average prefix-cache hit rate (>90% at peak) via RadixAttention, making repeated voice cloning nearly free. On DGX Spark, NVFP4 KV cache quantization cuts KV memory by ~50%.

Speculative Decoding

A small draft model proposes token sequences that the target model verifies in parallel. On DGX Spark, EAGLE3 speculative decoding delivers ~2× speedup and is supported by SGLang's official Spark Docker image. Speech-specific variants: VADUSA applies Medusa-style draft heads to TTS; PredGen overlaps speculative text/audio generation with user speech for up to 3× perceived latency reduction.

Quantization

On DGX Spark, quantization directly reduces bytes-per-token through the 273 GB/s bandwidth wall. INT8 achieves 2–3× speedup with <3% MOS drop. INT4 (W4A16) with Marlin kernels outperforms NVFP4 for decode on DGX Spark — FP8→FP4 disappoints because the bottleneck is memory reads, not compute. Quality hierarchy: BF16 > FP16 > FP8 > INT8 >> INT4 (INT4 can cause prosody artifacts).

INT4 prosody artifacts

While INT4 quantization doubles decode throughput on DGX Spark, it sits at the bottom of the quality hierarchy and can introduce prosody artifacts. Use INT8 where quality is a priority; reserve INT4 for maximum-throughput scenarios where naturalness is secondary.

torch.compile

mode="reduce-overhead" uses CUDA graphs and delivers 30–50% latency reduction. However, on DGX Spark sm_121a, CUDA graph capture via Triton frequently fails — vLLM requires enforce_eager=True. Raw CUDA graphs via SGLang work better (17 ms first-request overhead vs torch.compile's ~4 second warm-up).

Flash Attention

FA2 reduces attention memory from O(n²) to O(n). For Qwen3-TTS specifically, without FlashAttention, inference is 10× slower even on RTX 5090. FA3 is incompatible with Blackwell; use FA2 built from source or SDPA fallback.

Flash Attention 3 incompatible with Blackwell

Flash Attention 3 (FA3) does not work on Blackwell (sm_121). Build Flash Attention 2 from source or use the PyTorch SDPA fallback. Without any FlashAttention, Qwen3-TTS is 10× slower.

Multi-Token Prediction (MTP)

Qwen3-TTS-12Hz predicts all 16 codebook layers simultaneously per frame via hierarchical MTP. Self-distilled MTP achieves >3× faster decoding with <5% accuracy drop. This is a game-changer for multi-codebook architectures.

Continuous Batching

Orca-style iteration-level scheduling lets requests join/leave at each generation step. On DGX Spark, throughput scales nearly linearly: Llama 8B goes from 20.5 tok/s at BS=1 to 368 tok/s at BS=32 via SGLang.

Model Distillation

Chatterbox Turbo's 10→1 step diffusion distillation eliminates the decoder bottleneck. Fast F5-TTS reduces diffusion steps from 32 to 7, achieving RTF 0.030 (4× speedup).

6B. Pipeline and Streaming Optimizations

Sentence-Level Chunking

Split LLM output at punctuation; submit first sentence to TTS while LLM generates the rest. Deepgram explicitly warns against splitting mid-sentence (produces choppy prosody).

Streaming Text Input

True streaming TTS processes text character-by-character and starts synthesizing 6.5× faster than batch mode. Supported by: Qwen3-TTS, CosyVoice 2/3, VibeVoice-Realtime, Dia2.

Pipeline Stage Overlap

Start TTS once first few LLM tokens are available. Pipecat automates this with sentence-boundary aggregation. The Cresta production approach uses hedging (multiple parallel LLM calls, use whichever returns first).

Voice Prompt Caching

Pre-encode reference audio and reuse across requests. Implementation: SGLang RadixAttention (automatic for S2 Pro), CosyVoice LRU timbre manager, manual dict cache for TADA/others.

Warm-Up Passes

Non-negotiable for production. torch.compile requires ~3 min warm-up; CUDA graphs need sub-second. Pre-warm vocoder/codec separately — often the cold-start bottleneck.

6C. Audio Codec Frame Rate Impact

The single most impactful architectural choice for TTS speed:

Frame RateTokens per 10sRepresentative CodecsSpeed Impact
75–150 Hz750–1,500EnCodec, DAC, APCodecImpractical for real-time AR
50 Hz500SpeechTokenizer, NeuCodecStandard; heavy for AR
25 Hz250Qwen3-TTS-25Hz, CosyVoice2× reduction
21 Hz210NanoCodec (Magpie), Fish S2Good balance
12.5 Hz125Mimi, Qwen3-TTS-12Hz4× fewer tokens — sweet spot
7.5 Hz75σ-VAE (VibeVoice)Ultra-low; continuous latents
2–3 Hz20–30TADAExtreme compression; RTF 0.09
Lower frame rate wins at same bitrate

Lower frame rate with more quantization layers outperforms higher frame rate with fewer layers at the same bitrate. Prefer 12.5 Hz codecs (Mimi, Qwen3-TTS-12Hz) over 50 Hz codecs for autoregressive decode on bandwidth-limited hardware like DGX Spark.

6D. Serving Infrastructure Comparison

FrameworkTTS ModelsStrengthDGX Spark Status
SGLangFish S2 Pro, OrpheusRadixAttention voice cachingOfficial Docker; EAGLE3 works
vLLMQwen3-TTS, OrpheusBest TTFT; PagedAttentionRequires enforce_eager=True
TensorRT-LLMCosyVoice (4× speedup), Spark-TTS, MagpieDeepest NVIDIA optimizationFully supported
llama.cppOuteTTS, Orpheus, NeuTTSGGUF; ARM64 nativeExcellent; official benchmarks
ONNX RuntimeKokoro, Piper, Chatterbox TurboCross-platform; lightweightARM64 with CUDA EP
TritonMulti-model pipelinesEnsemble schedulingFull NVIDIA support

6E. Conversational "Feel" Optimizations

Filler Word Injection

Reduces perceived latency by 50–70% (from 1.5–2s to 0.5–0.7s). Instruct LLM to start responses with "umm," "so," or "well." Send tiny filler chunk to TTS instantly, stream the rest behind it.

Turn-Taking Prediction

Beyond VAD silence detection. NVIDIA ACE Agent uses two-pass endpointing: send interim transcript at 240 ms pause, final at 800 ms — saving 500–600 ms overall.

Barge-in Support

Stop TTS playback within 200 ms when user interrupts. Pipecat uses Silero VAD + SmartTurn, flushing downstream buffers instantly.

Backchanneling

Filter short utterances ("yeah," "okay") from triggering full responses. NVIDIA ACE Agent filters words under 3 characters.

6F. DGX Spark Specific Optimizations

DGX Spark optimization checklist
  • Use BF16 everywhere: Blackwell has native BF16 tensor cores. Both Qwen3-TTS and Fish S2 Pro were trained in BF16
  • Prefer low frame-rate codecs: 12.5 Hz (Qwen3-TTS-12Hz, Mimi) → only ~12.5 tokens/sec of audio, well within bandwidth
  • INT4 quantization > NVFP4 for decode: Marlin kernel INT4 outperforms FP4 on Spark
  • Disable GUI: sudo systemctl set-default multi-user.target frees ~60 MB GPU memory
  • Unified memory advantage: No cudaMemcpy needed. Models, KV cache, and audio buffers share the same 128 GB pool
  • No NUMA topology: Single SoC, no transfer penalty for CPU offloading of text preprocessing

Model Optimization Compatibility Matrix

Serving Framework Support

ModelSGLangvLLMTRT-LLMllama.cppONNX RTRaw PyTorch
TADA-1B/3B
Qwen3-TTS-12Hz🔜 PR open✅ vLLM-Omni
Fun-CosyVoice3✅ v0.9+✅ 4× speedupPartial
Chatterbox Turbo
Fish S2 Pro✅ PrimaryCompatible
NeuTTS Air/Nano✅ Primary
VibeVoice-RT🔜
Orpheus 3B✅ GGUF
Spark-TTSPossible✅ Official
OuteTTS 1.0✅ Exp✅ Primary
Kokoro-82M✅ Full
Sesame CSM-1B✅ HF
Dia2
Magpie 357M✅ Via Riva✅ NeMo

Quantization and Optimization Availability

ModelQuantization FormatsVoice CacheCUDA Graphstorch.compile
TADA-1B/3BFP16/BF16 onlyLikelyLikely
Qwen3-TTS-12HzBF16 (mandatory), 4-bit MLXImplicit KVUnknownUnknown
Fun-CosyVoice3FP16/FP32✅ LRU timbre managerUnknownUnknown
Chatterbox TurboFP16, ONNX FP32UnknownUnknown
Fish S2 ProBF16 only✅ 86.4% RadixAttention✅ SGLang--compile
NeuTTS Air/NanoGGUF Q4/Q8, FP16✅ Reference .ptUnknownUnknown
VibeVoice-RTFP16/BF16✅ Pre-computedUnknownUnknown
Orpheus 3BFP16, FP8, GGUF Q4-Q8, GPTQ, AWQImplicit KV✅ Via vLLM/SGLang
Spark-TTSFP16, 4-bit BnBImplicit globalsVia TRT-LLMUnknown
OuteTTS 1.0GGUF Q2-Q8, FP8, EXL2, FP16✅ JSON profilesVia backendsPossible
Kokoro-82MONNX fp32/fp16/q8/q4/q4f16Pre-computed voicepacks⚠️ Issues
Sesame CSM-1BFP16/BF16✅ Conversation ctx✅ Documented✅ Static cache
Dia2BF16 only✅ Audio prefix--cuda-graphuse_torch_compile
Magpie 357MFP16, TRT optimized✅ Context encoderLikely via RivaUnknown

Voice Pipeline Architecture

STT → LLM → TTS Pipeline

The standard real-time voice agent pipeline has three stages, with the LLM consistently dominating total latency:

User speaks → [VAD] → STT (~50ms) → LLM (~300-700ms) → TTS (~100-600ms) → Audio playback
                                      ↑ DOMINANT        ↑ YOU ARE HERE

Latency budget (conversational threshold: <500ms perceived):

SetupVoice-to-Voice Latency
GPT-4o Realtime (native audio)232–320 ms
Optimized cloud pipeline<500 ms
Pipecat local on M-series Mac<800 ms
NVIDIA Blueprint target500–1,500 ms
VibeVoice pipeline on DGX Spark~766 ms
Naive cloud pipeline3–5 s

Pipecat Integration

Pipecat is the leading open-source Python framework for real-time voice AI, supporting 25+ TTS backends. Key integration patterns:

  • TextAggregationMode.SENTENCE (default): Natural prosody, sends complete sentences to TTS
  • TextAggregationMode.TOKEN: Lowest latency, may produce unnatural prosody
  • Interruption handling: Silero VAD + SmartTurn → flush all downstream buffers
  • OpenAI-compatible servers bridge any local TTS to Pipecat via OpenAITTSService(base_url=...)

Community TTS API servers for Pipecat integration:

ServerTTS BackendAPI Compatibility
Chatterbox-TTS-ServerChatterbox (all 3 variants)OpenAI TTS
Kokoro-FastAPIKokoro-82MOpenAI TTS
LocalAIXTTS, Piper, multipleOpenAI + ElevenLabs
local-openai-tts-serverSilero + PiperOpenAI TTS
LocalOrpheusTTSOrpheus via OllamaCustom

DGX Spark Memory Budget for Full Pipeline

ComponentModel ExampleEst. Memory
STTWhisper Large v3 Turbo~3 GB
LLMLlama-3.2-8B Q4~5 GB
TTS (primary)Qwen3-TTS-12Hz-0.6B~1.2 GB
TTS (alternate)Orpheus 3B GGUF Q4~2 GB
TTS (lightweight)Kokoro-82M ONNX~0.05 GB
Framework overheadKV cache, CUDA context~4-8 GB
Total~15-20 GB
Free for voice profiles, context, upgrades~108-113 GB

Configuration Reference

Model-Specific Deployment Settings

TADA-1B:

# Ultra-low frame rate is the optimization itself
model = TadaForCausalLM.from_pretrained("HumeAI/tada-1b").to("cuda").to(torch.bfloat16)
# Flow matching steps configurable for quality/speed tradeoff
# ~700 seconds continuous audio with 2048 token context

Qwen3-TTS-12Hz (0.6B/1.7B):

# BF16 is MANDATORY — FP16 causes NaN/overflow in MTP sampling
# vLLM-Omni on DGX Spark:
vllm serve Qwen/Qwen3-TTS-12Hz-0.6B-Base --enforce-eager --trust-remote-code \
  --gpu-memory-utilization 0.9 --dtype bfloat16
# Reference audio: 10-15s optimal; append 0.5s silence to end
# 3s minimum for cloning
Qwen3-TTS requires BF16 — FP16 will break it

BF16 is mandatory for Qwen3-TTS-12Hz. Using FP16 causes NaN/overflow in the multi-token prediction (MTP) sampling step. Always pass --dtype bfloat16 when serving via vLLM.

Fun-CosyVoice 3:

# LightTTS framework for optimized serving:
# github.com/ModelTC/LightTTS
# TRT-LLM gives 4× speedup on LLM component
# Shared memory timbre manager with LRU for repeated speakers

Fish S2 Pro (via SGLang):

# Build sgl-kernel for Spark:
TORCH_CUDA_ARCH_LIST="12.1a" MAX_JOBS=4 CMAKE_BUILD_PARALLEL_LEVEL=1 \
  python -m build --wheel --no-isolation
# CRITICAL: BF16 RoPE truncation patch required
# --mem-fraction-static 0.15 (keeps allocation small for shared memory)
# --max-running-requests 2 (bandwidth-limited on Spark)

Orpheus 3B (via llama.cpp):

# Build for DGX Spark:
cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_F16=ON -DCMAKE_CUDA_ARCHITECTURES=121
export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
# Serving:
./llama-server -m orpheus-3b-q4_k_m.gguf -c 8192 -ngl 99 \
  --cache-type-k q8_0 --cache-type-v q8_0 -fa --mlock
# repetition_penalty >= 1.1 REQUIRED for stable output
# Stream every 7 SNAC tokens (= 1 audio frame)

Chatterbox Turbo:

model = ChatterboxTurboTTS.from_pretrained(device="cuda")
# cfg_weight ~0.3 for slower, more natural rhythm
# exaggeration 0.5 default; higher = more expressive + faster speaking
# ONNX export available for cross-platform

OuteTTS 1.0 (via llama.cpp):

# GGUF Q4-Q8 available
# Windowed repetition penalty (64-token window) is CRITICAL
# Full context penalty breaks output
# llama.cpp llama-tts binary (PR #12794 pending for 1.0)

Essential Environment Variables

# DGX Spark general
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export CUDA_HOME=/usr/local/cuda
export TORCH_CUDA_ARCH_LIST="12.1a"
 
# PyTorch optimization
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"
 
# llama.cpp unified memory
export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
 
# Disable GUI for headless serving
sudo systemctl set-default multi-user.target
# PyTorch inference optimization (all models)
torch.set_grad_enabled(False)
torch.set_float32_matmul_precision('high')  # TF32 on Blackwell
torch.backends.cudnn.benchmark = True  # for vocoder conv layers
torch.backends.cuda.matmul.allow_tf32 = True

Canonical TTSConfig Schema

This portable, offline-first configuration schema can drive multiple engines. It separates static capabilities from request-time controls and deployment settings:

{
  "engine": "qwen3_tts | cosyvoice3 | orpheus | tada | chatterbox | fish_s2 | ...",
  "voice": {
    "locale": "en-US",
    "voice_id": "engine-specific voice name or voicepack ID",
    "model_id": "HuggingFace model hub identifier",
    "model_path": "/path/for/air-gapped/deployment",
    "speaker_ref_wav": "/path/to/reference.wav",
    "speaker_index": 0
  },
  "audio": {
    "format": "wav | raw_pcm_s16le | flac | opus",
    "sample_rate_hz": 24000,
    "channels": 1,
    "normalize": true,
    "volume_gain": 1.0,
    "postfx": { "lufs_target": -18, "limiter": true, "resample_quality": "high" }
  },
  "controls": {
    "rate": 1.0,
    "pitch": 0.0,
    "temperature": 0.7,
    "noise_scale": 0.667,
    "style": "warm and conversational",
    "seed": 42
  },
  "segmentation": {
    "mode": "sentence",
    "max_chars": 500,
    "split_pattern": "\\n+",
    "pause_ms_between_segments": 200
  },
  "execution": {
    "device": "cuda",
    "precision": "bf16",
    "threads": 4,
    "streaming": {
      "enabled": true,
      "chunk_ms": 40,
      "ttfa_budget_ms": 300,
      "buffer_strategy": "low_latency"
    },
    "cache": {
      "model_cache_dir": "/models",
      "warm_start": true,
      "max_models_loaded": 3
    }
  }
}

Validation and Testing

Metrics to measure

MetricDefinitionTarget (conversational)
TTFATime from text input to first playable audio chunk<300 ms
RTFWall-clock seconds / audio seconds generated<0.5 (2× real-time)
WERWord error rate (TTS output → STT → compare to input)<3%
SIMSpeaker similarity (cosine similarity of speaker embeddings)>0.85 for cloning
Peak VRAMMaximum GPU memory usage during generationPer-model budget
Pipeline latencyVoice-in to voice-out total time<1,000 ms

Test suite structure

Unit tests: Text normalization determinism, phoneme override verification, SSML parsing for supported engines.

Integration tests: Golden-audio spectral fingerprinting (not raw waveform — lossy variability exists), cross-engine output duration bounds, architecture smoke tests (x86_64, aarch64).

Perceptual tests: MOS listening panels, ABX preference tests, accessibility evaluation at high speech rates.

Benchmark harness

Capture cold-start vs warm-start TTFA, steady-state RTF at 1/2/4/8 concurrent requests, CPU thread scaling, quantization impact (BF16 vs INT8 vs INT4), and output sample rate normalization. sherpa-onnx publishes per-model RTF tables including Raspberry Pi measurements, providing a useful cross-platform reference.


Model repositories

ModelGitHubHuggingFace
TADAgithub.com/HumeAI/tadahuggingface.co/HumeAI/tada-1b
Qwen3-TTSgithub.com/QwenLM/Qwen3-TTShuggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base
CosyVoice / Fun-CosyVoice3github.com/FunAudioLLM/CosyVoicehuggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512
Fish Audio S2 Progithub.com/fishaudio/fish-speechhuggingface.co/fishaudio/s2-pro
SGLang-Omni (S2 Pro)github.com/sgl-project/sglang-omni
VibeVoicegithub.com/microsoft/VibeVoicehuggingface.co/microsoft/VibeVoice-Realtime-0.5B
Orpheus TTSgithub.com/canopyai/Orpheus-TTShuggingface.co/canopylabs/orpheus-3b-0.1-ft
Chatterboxgithub.com/resemble-ai/chatterboxhuggingface.co/ResembleAI/chatterbox-turbo
Spark-TTSgithub.com/SparkAudio/Spark-TTShuggingface.co/SparkAudio/Spark-TTS-0.5B
OuteTTSgithub.com/edwko/OuteTTShuggingface.co/OuteAI/OuteTTS-1.0-0.6B
NeuTTSgithub.com/neuphonic/neuttshuggingface.co/neuphonic/neutts-air
Kokorogithub.com/hexgrad/kokorohuggingface.co/hexgrad/Kokoro-82M
Sesame CSMgithub.com/SesameAILabs/csm
Dia / Dia2github.com/nari-labs/dia2huggingface.co/nari-labs/Dia2-2B
Zonosgithub.com/Zyphra/Zonoshuggingface.co/Zyphra/Zonos-v0.1-transformer
Higgs Audiogithub.com/boson-ai/higgs-audiohuggingface.co/bosonai/higgs-audio-v2-generation-3B-base
IndexTTSgithub.com/index-tts/index-ttshuggingface.co/IndexTeam/IndexTTS-1.5
Kani-TTS-2github.com/nineninesix-ai/kani-tts-2huggingface.co/nineninesix/kani-tts-2-en
F5-TTSgithub.com/SWivid/F5-TTShuggingface.co/SWivid/F5-TTS
GPT-SoVITSgithub.com/RVC-Boss/GPT-SoVITS
MaskGCT / Amphiongithub.com/open-mmlab/Amphionhuggingface.co/amphion/MaskGCT
Piper TTSgithub.com/OHF-Voice/piper1-gpl
Kokoro-82Mgithub.com/hexgrad/kokorohuggingface.co/hexgrad/Kokoro-82M
MeloTTSgithub.com/myshell-ai/MeloTTS
KittenTTSgithub.com/KittenML/KittenTTS
Magpie TTShuggingface.co/nvidia/magpie_tts_multilingual_357m
Parler-TTSgithub.com/huggingface/parler-ttshuggingface.co/parler-tts/parler-tts-mini-v1
sherpa-onnxgithub.com/k2-fsa/sherpa-onnx

Classical engines

EngineURL
eSpeak-NGgithub.com/espeak-ng/espeak-ng
Festivalgithub.com/festvox/festival
Flitegithub.com/festvox/flite
MARY TTSgithub.com/marytts/marytts

Frameworks and tools

ToolURL
Pipecatgithub.com/pipecat-ai/pipecat
nvidia-pipecatgithub.com/pipecat-ai/nemotron-january-2026
LightTTS (CosyVoice optimizer)github.com/ModelTC/LightTTS
Chatterbox-TTS-Servergithub.com/devnen/Chatterbox-TTS-Server
Kokoro-FastAPIgithub.com/remsky/Kokoro-FastAPI
LocalOrpheusTTSgithub.com/legraphista/LocalOrpheusTTS
TTS Arena V2 Leaderboardhuggingface.co/spaces/TTS-AGI/TTS-Arena

DGX Spark resources

ResourceURL
NVIDIA DGX Spark documentationdocs.nvidia.com/dgx/dgx-spark/
SGLang Spark Dockerlmsysorg/sglang:spark
LMSYS Spark reviewlmsys.org/blog/2025-10-13-nvidia-dgx-spark/
ARM learning path (llama.cpp)learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/
NVIDIA Spark playbooksgithub.com/NVIDIA/dgx-spark-playbooks
Community ML guidegithub.com/martimramos/dgx-spark-ml-guide
Natolambert setup guidegithub.com/natolambert/dgx-spark-setup
llama.cpp Spark benchmarksgithub.com/DandinPower/llama.cpp_bench

Key papers

PaperURL
TADA: Text-Acoustic Dual Alignmentarxiv.org/abs/2602.23068
Qwen3-TTS Technical Reportarxiv.org/abs/2601.15621
Fish Audio S2 Technical Reportarxiv.org/abs/2603.08823
Spark-TTSarxiv.org/abs/2503.01710
MaskGCTarxiv.org/abs/2409.00750
F5-TTSarxiv.org/abs/2410.06261
CosyVoice 3arxiv.org/abs/2505.17589
Speculative Decoding for TTSarxiv.org/abs/2410.21951

Related Guides

Was this guide helpful?