Executive Summary
Offline TTS in 2026 spans three distinct tiers of technology:
LLM-backbone neural models (CosyVoice 3, Qwen3-TTS, Fish Audio S2 Pro, Orpheus, TADA) represent the current state of the art. These use large language model architectures to generate discrete audio tokens, which are then decoded by neural codecs into waveforms. They offer zero-shot voice cloning, emotion control, and streaming output, but typically require GPU acceleration and careful dependency management on ARM64 platforms.
Lightweight neural models (Kokoro-82M, Piper, MeloTTS, KittenTTS) prioritize deployment simplicity over maximum expressiveness. These run on CPUs, Raspberry Pis, and browsers via ONNX, with sub-100MB footprints. Most lack voice cloning but offer high-quality preset voices with real-time or faster-than-real-time synthesis.
Classical engines (eSpeak-NG, Festival, Flite) remain relevant for accessibility applications, SSML-driven workflows, and ultra-constrained embedded deployments where neural models are impractical.
For the NVIDIA DGX Spark specifically, the 128 GB unified memory accommodates any model (or several simultaneously), but the ~273 GB/s memory bandwidth — not VRAM capacity — is the primary constraint for autoregressive TTS decode throughput. Models with low codec frame rates (12.5 Hz or lower) perform disproportionately well on this hardware.
Key findings
- Only 3 TTS models are confirmed working on DGX Spark GPU as of March 2026: VibeVoice-Realtime-0.5B, Qwen3-TTS (via vLLM-Omni), and NVIDIA Magpie TTS (via NeMo)
- "Blackwell Noise" — numerical instability in FP16/BF16 on sm_121 Tensor Cores — breaks xTTS-v2, F5-TTS, Kokoro, and several other popular models on DGX Spark GPU
- TADA-1B achieves the lowest token-per-second overhead (2–3 tokens/sec of audio) of any model, making it theoretically optimal for bandwidth-constrained hardware
- Fun-CosyVoice 3 is the recommended successor to CosyVoice 2, with 150ms first-packet latency, 9 languages, and RL-optimized prosody
- Orpheus 3B offers the broadest deployment ecosystem (vLLM, SGLang, llama.cpp, Ollama, ONNX) among voice-cloning-capable models
- Fish Audio S2 Pro leads quality benchmarks (lowest WER on Seed-TTS Eval among all models including closed-source) but requires SGLang source builds for DGX Spark
How to Use This Guide
This guide is organized for two reading patterns:
For model selection: Start with the Models Comparison guide to identify candidates matching your requirements, then check the DGX Spark Compatibility section if deploying on GB10 hardware, and the Optimization Matrix for deployment-specific capabilities.
For pipeline engineering: Start with the Optimization Techniques catalog for the full catalog of latency reduction approaches, then the Voice Pipeline Architecture section for STT→LLM→TTS integration patterns, and the Configuration Reference for concrete deployment settings.
Accuracy notes
All claims have been validated against primary sources (GitHub repos, HuggingFace model cards, arXiv papers, NVIDIA Developer Forums). Where claims could not be independently verified, they are marked with source attribution. Seven material corrections were identified during validation — see the Corrections Log in the Models Comparison guide.
Platform Reference: DGX Spark
Hardware specifications
| Specification | Value |
|---|---|
| SoC | NVIDIA GB10 Grace Blackwell Superchip |
| CPU | 20-core ARM (10× Cortex-X925 @ 4.0 GHz + 10× Cortex-A725 @ 2.86 GHz) |
| GPU | Blackwell, 6,144 CUDA cores, 192 Tensor Cores (5th gen), compute capability 12.1 (sm_121a) |
| Memory | 128 GB unified LPDDR5x |
| Memory bandwidth | ~273 GB/s (measured; shared between CPU and GPU) |
| CPU↔GPU interconnect | NVLink-C2C, 900 GB/s bidirectional |
| Compute (BF16) | ~99.8 TFLOPS (measured) |
| Compute (FP8) | ~207.7 TFLOPS |
| Compute (FP4, sparse) | ~1 PFLOPS (marketing; non-sparse ~500 TFLOPS) |
| CUDA version | 13.0 |
| OS | DGX OS (Ubuntu 24.04 base) |
| Architecture | aarch64 (ARM v9.2-A) |
| Power | <100W typical AI workload |
| Form factor | 1.13 liters |
Why bandwidth matters more than compute for TTS
For autoregressive TTS decode, every generated token requires reading the full model weights from memory. The formula:
Max tokens/sec ≈ Memory bandwidth ÷ Model size (bytes)
| Model size (BF16) | Bytes | Max tok/s on Spark | Real-time viable at... |
|---|---|---|---|
| 82M (Kokoro) | 164 MB | ~1,665 | Any codec rate |
| 350M (Chatterbox Turbo) | 700 MB | ~390 | Up to ~50 Hz |
| 500M (CosyVoice/Qwen3-0.6B) | 1 GB | ~273 | Up to ~25 Hz |
| 1B (TADA/Orpheus) | 2 GB | ~136 | Up to ~12.5 Hz |
| 3B (Orpheus-3B) | 6 GB | ~45 | Tight at 21 Hz (SNAC) |
| 4.4B (S2 Pro) | ~9 GB | ~30 | Needs prefix caching |
INT4 quantization doubles these numbers by halving the bytes read per token. On DGX Spark, INT4 (W4A16 with Marlin kernels) actually outperforms NVFP4 for decode throughput.
Critical software compatibility notes
- PyTorch: Official cu130 index has aarch64 wheels (
pip install torch==2.9.1 --index-url https://download.pytorch.org/whl/cu130) - Flash Attention 3: Incompatible with Blackwell. Use FA2 (build from source) or SDPA fallback
- Triton compiler: Not fully mature for sm_121a. Many compiled kernels fail
- vLLM: Requires
enforce_eager=True(CUDA graphs fail via Triton) - SGLang: Use official Docker image
lmsysorg/sglang:sparkor build sgl-kernel withTORCH_CUDA_ARCH_LIST="12.1a" - torchaudio: Replace with
soundfilefor audio I/O — torchaudio CUDA kernels often fail on aarch64 - "Blackwell Noise": FP16/BF16 numerical instability on sm_121 Tensor Cores causes white noise after ~2 seconds in several TTS frameworks
Related Guides
TTS Model Catalog & Optimization
Complete catalog of 30+ offline TTS models with DGX Spark compatibility, optimization techniques, voice pipeline architecture, and deployment configurations.
TTS Decision Guide
Interactive decision flowchart, model recommendations by use case, and 5-minute quickstart guides for the top TTS models.
Hardware Comparison Guide
Side-by-side comparison of 8 hardware platforms for TTS deployment: DGX Spark, Jetson Orin, Mac Studio, consumer GPUs, and cloud instances.
Was this guide helpful?