Skip to content

Complete Offline TTS Guide 2026

Executive Summary

Offline TTS in 2026 spans three distinct tiers of technology:

LLM-backbone neural models (CosyVoice 3, Qwen3-TTS, Fish Audio S2 Pro, Orpheus, TADA) represent the current state of the art. These use large language model architectures to generate discrete audio tokens, which are then decoded by neural codecs into waveforms. They offer zero-shot voice cloning, emotion control, and streaming output, but typically require GPU acceleration and careful dependency management on ARM64 platforms.

Lightweight neural models (Kokoro-82M, Piper, MeloTTS, KittenTTS) prioritize deployment simplicity over maximum expressiveness. These run on CPUs, Raspberry Pis, and browsers via ONNX, with sub-100MB footprints. Most lack voice cloning but offer high-quality preset voices with real-time or faster-than-real-time synthesis.

Classical engines (eSpeak-NG, Festival, Flite) remain relevant for accessibility applications, SSML-driven workflows, and ultra-constrained embedded deployments where neural models are impractical.

For the NVIDIA DGX Spark specifically, the 128 GB unified memory accommodates any model (or several simultaneously), but the ~273 GB/s memory bandwidth — not VRAM capacity — is the primary constraint for autoregressive TTS decode throughput. Models with low codec frame rates (12.5 Hz or lower) perform disproportionately well on this hardware.

Key findings

  • Only 3 TTS models are confirmed working on DGX Spark GPU as of March 2026: VibeVoice-Realtime-0.5B, Qwen3-TTS (via vLLM-Omni), and NVIDIA Magpie TTS (via NeMo)
  • "Blackwell Noise" — numerical instability in FP16/BF16 on sm_121 Tensor Cores — breaks xTTS-v2, F5-TTS, Kokoro, and several other popular models on DGX Spark GPU
  • TADA-1B achieves the lowest token-per-second overhead (2–3 tokens/sec of audio) of any model, making it theoretically optimal for bandwidth-constrained hardware
  • Fun-CosyVoice 3 is the recommended successor to CosyVoice 2, with 150ms first-packet latency, 9 languages, and RL-optimized prosody
  • Orpheus 3B offers the broadest deployment ecosystem (vLLM, SGLang, llama.cpp, Ollama, ONNX) among voice-cloning-capable models
  • Fish Audio S2 Pro leads quality benchmarks (lowest WER on Seed-TTS Eval among all models including closed-source) but requires SGLang source builds for DGX Spark

How to Use This Guide

This guide is organized for two reading patterns:

For model selection: Start with the Models Comparison guide to identify candidates matching your requirements, then check the DGX Spark Compatibility section if deploying on GB10 hardware, and the Optimization Matrix for deployment-specific capabilities.

For pipeline engineering: Start with the Optimization Techniques catalog for the full catalog of latency reduction approaches, then the Voice Pipeline Architecture section for STT→LLM→TTS integration patterns, and the Configuration Reference for concrete deployment settings.

Accuracy notes

All claims have been validated against primary sources (GitHub repos, HuggingFace model cards, arXiv papers, NVIDIA Developer Forums). Where claims could not be independently verified, they are marked with source attribution. Seven material corrections were identified during validation — see the Corrections Log in the Models Comparison guide.


Platform Reference: DGX Spark

Hardware specifications

SpecificationValue
SoCNVIDIA GB10 Grace Blackwell Superchip
CPU20-core ARM (10× Cortex-X925 @ 4.0 GHz + 10× Cortex-A725 @ 2.86 GHz)
GPUBlackwell, 6,144 CUDA cores, 192 Tensor Cores (5th gen), compute capability 12.1 (sm_121a)
Memory128 GB unified LPDDR5x
Memory bandwidth~273 GB/s (measured; shared between CPU and GPU)
CPU↔GPU interconnectNVLink-C2C, 900 GB/s bidirectional
Compute (BF16)~99.8 TFLOPS (measured)
Compute (FP8)~207.7 TFLOPS
Compute (FP4, sparse)~1 PFLOPS (marketing; non-sparse ~500 TFLOPS)
CUDA version13.0
OSDGX OS (Ubuntu 24.04 base)
Architectureaarch64 (ARM v9.2-A)
Power<100W typical AI workload
Form factor1.13 liters

Why bandwidth matters more than compute for TTS

For autoregressive TTS decode, every generated token requires reading the full model weights from memory. The formula:

Max tokens/sec ≈ Memory bandwidth ÷ Model size (bytes)
Model size (BF16)BytesMax tok/s on SparkReal-time viable at...
82M (Kokoro)164 MB~1,665Any codec rate
350M (Chatterbox Turbo)700 MB~390Up to ~50 Hz
500M (CosyVoice/Qwen3-0.6B)1 GB~273Up to ~25 Hz
1B (TADA/Orpheus)2 GB~136Up to ~12.5 Hz
3B (Orpheus-3B)6 GB~45Tight at 21 Hz (SNAC)
4.4B (S2 Pro)~9 GB~30Needs prefix caching

INT4 quantization doubles these numbers by halving the bytes read per token. On DGX Spark, INT4 (W4A16 with Marlin kernels) actually outperforms NVFP4 for decode throughput.

Critical software compatibility notes

  • PyTorch: Official cu130 index has aarch64 wheels (pip install torch==2.9.1 --index-url https://download.pytorch.org/whl/cu130)
  • Flash Attention 3: Incompatible with Blackwell. Use FA2 (build from source) or SDPA fallback
  • Triton compiler: Not fully mature for sm_121a. Many compiled kernels fail
  • vLLM: Requires enforce_eager=True (CUDA graphs fail via Triton)
  • SGLang: Use official Docker image lmsysorg/sglang:spark or build sgl-kernel with TORCH_CUDA_ARCH_LIST="12.1a"
  • torchaudio: Replace with soundfile for audio I/O — torchaudio CUDA kernels often fail on aarch64
  • "Blackwell Noise": FP16/BF16 numerical instability on sm_121 Tensor Cores causes white noise after ~2 seconds in several TTS frameworks

Related Guides

Was this guide helpful?