This document explains the key technologies, frameworks, and concepts referenced throughout the main guide. It's designed for readers who may be new to TTS deployment or need a refresher on the rapidly evolving ecosystem.
How Modern TTS Works
The three generations of TTS
Generation 1 — Formant Synthesis (1960s–2000s) Engines like eSpeak generate speech by simulating the physics of the human vocal tract. They're tiny (under 5 MB), extremely fast, and run on anything — but they sound robotic. Still used today for accessibility (screen readers) and as phonemizers (converting text to pronunciation symbols) inside neural systems.
Generation 2 — Neural Network TTS (2018–2023) Models like VITS, Tacotron, and FastSpeech use neural networks to generate mel spectrograms (visual representations of audio frequency over time), then a separate "vocoder" neural network converts those spectrograms into actual audio waveforms. Piper, MeloTTS, and Kokoro-82M use this approach. They sound much more natural than formant synthesis but typically can't clone arbitrary voices.
Generation 3 — LLM-Based TTS (2024–present) The current state of the art. These models treat speech generation like a language modeling problem: text goes in, and discrete "audio tokens" come out, which are then decoded into waveforms by a neural audio codec. Fish Audio S2 Pro, Qwen3-TTS, CosyVoice, Orpheus, TADA, and most models in this guide use this approach. They can clone voices, express emotions, and generate highly natural speech — but they need more compute.
The two-stage pipeline (LLM-based TTS)
Text → [LLM Backbone] → Audio Tokens → [Neural Codec Decoder] → Waveform → Speaker
"What to say" "How it sounds"
Stage 1 — The LLM backbone reads your text and generates a sequence of numbers called "audio tokens." These tokens encode what the speech should sound like — the words, rhythm, pitch, emotion, and speaker identity. This is the slow part because it generates tokens one at a time (autoregressive).
Stage 2 — The codec decoder takes those audio tokens and reconstructs actual audio waveforms. This is fast because it runs in parallel. Think of it like decompressing a ZIP file — the hard work was creating the compressed version.
Three architecture families
Autoregressive (AR): Generates audio tokens one at a time, left to right, like typing one letter at a time. Produces natural prosody but is inherently sequential. Used by: Fish S2 Pro, Orpheus, OuteTTS, GPT-SoVITS.
- Pros: Most natural prosody, best long-range coherence, can stream output
- Cons: Speed limited by sequential generation, can hallucinate (skip or repeat words)
Flow Matching / Diffusion: Generates all audio tokens simultaneously by iteratively refining random noise into speech. Like starting with TV static and gradually sharpening it into a clear picture. Used by: F5-TTS, CosyVoice (acoustic stage), Chatterbox (decoder), Zonos.
- Pros: Parallel generation (fast), high audio fidelity
- Cons: Multiple refinement steps needed, harder to stream, less natural prosody for long utterances
Hybrid: Combines both — typically AR for temporal structure (deciding what comes when) and flow matching for acoustic detail (making it sound good). This is the dominant approach in 2026. Used by: CosyVoice 2/3, Qwen3-TTS, TADA, VibeVoice, Kani-TTS-2.
- Pros: Best of both worlds — natural flow + high fidelity
- Cons: More complex architectures, two-stage latency
Inference Frameworks Explained
These are the software systems that actually run the TTS models on your GPU or CPU. Choosing the right one significantly impacts speed, memory usage, and deployment complexity.
vLLM — The Production Standard
What it is: An open-source inference engine originally built for serving large language models. It's become the most widely-used framework for deploying LLM-based applications in production.
How it helps TTS: Since modern TTS models use LLM backbones (Llama, Qwen), vLLM's optimizations directly apply. Its key innovation is PagedAttention — a memory management system that reduces GPU memory waste from 60–80% to under 4% by splitting the KV cache (the model's "working memory") into efficiently-managed blocks.
Key features:
- PagedAttention (efficient memory management)
- Continuous batching (handles multiple requests efficiently)
- OpenAI-compatible API server (drop-in replacement)
- Broad model support
TTS models supported: Qwen3-TTS (via vLLM-Omni), Orpheus, Fun-CosyVoice3, OuteTTS (experimental)
Pros:
- Most mature and well-documented
- Largest community and ecosystem
- Best time-to-first-token (TTFT) performance
- OpenAI-compatible API makes integration easy
Cons:
- Requires
enforce_eager=Trueon DGX Spark (CUDA graphs fail via Triton) - vLLM-Omni (TTS extension) is separate from main vLLM and less mature
- Heavy installation — full PyTorch dependency chain
- Not all TTS models have vLLM backends yet
Works with workarounds. Must build from source for aarch64.
SGLang — The Performance Leader
What it is: An inference framework developed by the LMSYS team (the people behind Chatbot Arena). It emphasizes maximum throughput and minimum latency.
How it helps TTS: SGLang's killer feature for TTS is RadixAttention — a system that caches computed results in a tree structure. When you use the same voice reference clip across multiple TTS requests, SGLang automatically reuses the cached computation instead of re-processing it. Fish Audio S2 Pro achieves 86.4% cache hit rates with this, making repeated voice cloning nearly free.
Key features:
- RadixAttention (prefix caching — huge for voice cloning)
- CUDA graph replay (faster individual token generation)
- Zero-overhead CPU scheduler
- EAGLE3 speculative decoding support
TTS models supported: Fish Audio S2 Pro (primary), Orpheus
Pros:
- Fastest per-token generation speed
- RadixAttention is a game-changer for voice cloning workloads
- Official DGX Spark Docker image (
lmsysorg/sglang:spark) - EAGLE3 speculative decoding (~2× speedup)
Cons:
- Smaller community than vLLM
- Fewer TTS model integrations
- Requires building sgl-kernel from source for DGX Spark (30–60 min build)
- Documentation is less comprehensive
Works via official Docker image. sgl-kernel must be built with TORCH_CUDA_ARCH_LIST="12.1a".
TensorRT-LLM — NVIDIA's Optimized Runtime
What it is: NVIDIA's proprietary inference optimization toolkit. It compiles models into highly optimized execution plans that squeeze maximum performance from NVIDIA GPUs.
How it helps TTS: TRT-LLM achieves the deepest hardware-specific optimizations, including NVFP4 quantization, custom CUDA kernels, and hardware-aware scheduling. CosyVoice achieves a 4× speedup with TRT-LLM compared to raw PyTorch.
Key features:
- Maximum NVIDIA GPU utilization
- NVFP4 / FP8 quantization with quality preservation
- Integrated with NVIDIA Triton Inference Server
- Part of NVIDIA NIM ecosystem
TTS models supported: CosyVoice 2/3, Spark-TTS, NVIDIA Magpie
Pros:
- Fastest absolute throughput on NVIDIA hardware
- Native DGX Spark support
- NVIDIA's full engineering team behind it
- Best quantization quality at lowest precision (FP4)
Cons:
- NVIDIA-only (no AMD, no Apple Silicon)
- Complex setup and compilation process
- Fewer community-contributed model conversions
- Proprietary components limit community contribution
Fully supported. Best path for NVIDIA-ecosystem models.
llama.cpp — The Universal Runner
What it is: A C/C++ implementation of LLM inference that runs on virtually any hardware — CPUs, GPUs (NVIDIA, AMD, Apple), and even phones. It uses its own model format called GGUF.
How it helps TTS: Since TTS models increasingly use LLM backbones (Llama, Qwen), llama.cpp can run them with aggressive quantization (Q4, Q8) and minimal dependencies. It has native TTS support via the llama-tts binary for OuteTTS models, and Orpheus runs through it via Ollama.
Key features:
- Runs everywhere (CPU, CUDA, Metal, Vulkan, ROCm)
- GGUF quantization (Q2 through Q8, plus mixed precision)
- Minimal dependencies (no Python, no PyTorch)
- Memory-mapped model loading (instant startup)
- ARM64 / aarch64 native support
TTS models supported: OuteTTS 1.0 (native), Orpheus (via SNAC wrapper), NeuTTS Air/Nano
Pros:
- Best DGX Spark compatibility — excellent ARM64 support with official benchmarks
- Tiny footprint, no Python dependency chain
- GGUF quantization offers finest-grained size/quality tradeoff
- Works offline with zero network requirements
- Active development with frequent releases
Cons:
- Not all TTS models have GGUF conversions
- No built-in audio codec support (SNAC decoding requires separate Python script)
- Lower throughput than GPU-native frameworks for large models
- Less sophisticated batching and scheduling
Excellent. Build with -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=121. Set GGML_CUDA_ENABLE_UNIFIED_MEMORY=1.
ONNX Runtime — The Lightweight Option
What it is: Microsoft's cross-platform inference engine for models exported to the ONNX (Open Neural Network Exchange) format. Think of ONNX as a "universal model format" that many frameworks can export to.
How it helps TTS: Lightweight models (Kokoro-82M, Piper, KittenTTS) export to ONNX for deployment on CPUs, mobile devices, and browsers. The sherpa-onnx project provides a curated TTS model zoo with pre-converted ONNX models and cross-platform binaries.
Key features:
- Cross-platform (Windows, Linux, macOS, Android, iOS, browser)
- CPU and GPU execution providers
- Small runtime footprint
- Pre-built binaries for many architectures
TTS models supported: Kokoro-82M, Piper, KittenTTS, Matcha-TTS, Chatterbox Turbo (ONNX export), MeloTTS (via sherpa-onnx)
Pros:
- Simplest deployment path for lightweight models
- Runs in browsers via WebAssembly
- No Python or PyTorch required for inference
- sherpa-onnx provides ready-to-use cross-platform binaries
Cons:
- Only works with models that have ONNX exports
- LLM-backbone TTS models generally don't export well to ONNX
- Limited optimization compared to vLLM/SGLang for large models
- No sophisticated KV cache management
ARM64 CUDA execution provider works. Community wheels from Ultralytics fill the gap for official aarch64 binaries.
Framework decision matrix
| Your situation | Recommended framework | Why |
|---|---|---|
| Deploying Fish S2 Pro for maximum quality | SGLang | RadixAttention voice caching is essential |
| Deploying Qwen3-TTS or CosyVoice | vLLM-Omni | Day-0 support with streaming |
| Maximum speed on NVIDIA hardware | TensorRT-LLM | Deepest hardware optimization |
| DGX Spark with minimal build pain | llama.cpp | Best ARM64 support, GGUF models work |
| Lightweight models (Kokoro, Piper) | ONNX Runtime | Simplest deployment, runs anywhere |
| Multiple models, production serving | Triton Inference Server | Ensemble scheduling, per-stage batching |
| Quick prototyping, any model | Raw PyTorch | Works with everything, no framework overhead |
Audio Codecs and Tokenizers
What is a neural audio codec?
A neural audio codec is a neural network that compresses audio into a compact sequence of numbers (tokens) and decompresses them back into audio. Think of it like MP3 compression, but powered by AI and designed specifically for speech.
Audio waveform → [Encoder] → Tokens (small numbers) → [Decoder] → Audio waveform
"Compressed speech"
The codec is what makes LLM-based TTS possible: the LLM generates tokens (which it's good at), and the codec decoder converts those tokens into actual sound (which it's good at).
Why frame rate matters so much
The "frame rate" of a codec is how many tokens it produces per second of audio. This directly determines how fast TTS can run:
- 50 Hz codec (NeuCodec): 50 tokens per second of audio. To generate 10 seconds of speech, the LLM must produce 500 tokens.
- 12.5 Hz codec (Qwen3-TTS-12Hz, Mimi): Only 12.5 tokens per second. Same 10 seconds needs only 125 tokens — 4× less work.
- 2–3 Hz codec (TADA): Only 20–30 tokens for 10 seconds — 17× less work than 50 Hz.
On the DGX Spark, where memory bandwidth limits token generation to ~45–273 tok/s depending on model size, this difference is the difference between real-time and non-real-time operation.
Major codecs in the TTS ecosystem
| Codec | Frame Rate | Creator | Used By | Key Feature |
|---|---|---|---|---|
| SNAC | Multi-scale (12/24/48/96 Hz) | Independent | Orpheus | Hierarchical structure captures detail at multiple scales |
| DAC (Descript Audio Codec) | ~86 Hz | Descript | OuteTTS, Zonos | High fidelity, widely adopted |
| Mimi | 12.5 Hz | Kyutai | Sesame CSM, Dia2 | Ultra-low frame rate, good quality |
| EnCodec | 24/48 Hz | Meta | MusicGen, early TTS | First widely-used neural codec |
| NeuCodec | 50 Hz | Neuphonic | NeuTTS Air | Single codebook, simple decoding |
| NanoCodec | 21.5 Hz | NVIDIA | Magpie TTS, Kani-TTS-2 | Optimized for NVIDIA hardware |
| BiCodec | 50 Hz | HKUST | Spark-TTS | Decouples speaker identity from content |
| σ-VAE | 7.5 Hz | Microsoft | VibeVoice | Continuous (not discrete) tokens |
| RVQ (Fish S2) | ~21 Hz, 10 codebooks | Fish Audio | Fish S2 Pro | 10-layer residual quantization |
Single-codebook vs multi-codebook
Single-codebook: One token per time step. Simpler for the LLM to generate. Used by CosyVoice (25Hz), Kokoro.
Multi-codebook (RVQ): Multiple tokens per time step, each adding finer acoustic detail. Like a base layer + enhancement layers. The LLM generates the first codebook, then either another network or MTP generates the rest. Used by Qwen3-TTS (16 layers), Fish S2 Pro (10 layers), SNAC (4 layers).
Practical impact: Multi-codebook models sound better but need strategies (MTP, fast-AR heads, or flow matching) to handle the extra tokens efficiently.
Quantization Formats
What is quantization?
Quantization means storing model weights using fewer bits per number. A model trained in 32-bit floating point (FP32) can often run nearly as well in 16-bit (FP16/BF16) or even 4-bit (INT4) format. This matters because:
- Smaller model = less memory needed (a 3B model goes from ~12 GB in FP32 to ~6 GB in FP16 to ~1.5 GB in INT4)
- Smaller model = faster inference (less data to read from memory each token)
- On bandwidth-limited hardware like DGX Spark, this is the single biggest lever for performance
Format comparison
| Format | Bits | Size Reduction | Quality Impact | Where Used |
|---|---|---|---|---|
| FP32 | 32 | 1× (baseline) | Perfect | Training, reference |
| BF16 | 16 | 2× | Negligible | Default for Blackwell. Used by Qwen3-TTS, Fish S2 Pro |
| FP16 | 16 | 2× | Negligible for most models | General GPU inference |
| FP8 | 8 | 4× | <1% quality loss typically | vLLM, SGLang (newer GPUs) |
| INT8 | 8 | 4× | 1–3% MOS drop | Good tradeoff. KittenTTS offers INT8 |
| INT4 (W4A16) | 4 | 8× | Noticeable on some models; prosody artifacts possible | Best decode speed on DGX Spark |
| NVFP4 | 4 | 8× | Similar to INT4 | NVIDIA-specific; underperforms INT4 for decode on Spark |
| GGUF Q4 | ~4 | ~8× | Good with K-quant variants | llama.cpp ecosystem (Orpheus, OuteTTS, NeuTTS) |
| GGUF Q8 | ~8 | ~4× | Nearly lossless | Higher quality GGUF option |
| ONNX Q4 | 4 | ~8× | Acceptable for small models | Browser/mobile deployment (Kokoro) |
GGUF explained
GGUF (GPT-Generated Unified Format) is the model format used by llama.cpp. It packages model weights, tokenizer, and metadata into a single file with built-in quantization. GGUF files are self-contained — no Python, no PyTorch, no separate tokenizer files needed.
Key GGUF variants:
- Q2_K: Smallest (~2-bit average), significant quality loss
- Q4_K_M: Best balance of size and quality for most use cases
- Q5_K_M: Slightly larger, slightly better than Q4
- Q8_0: Near-lossless, ~4× compression
- F16: Full FP16, largest GGUF option
Which format for DGX Spark?
For decode-bound TTS inference on the Spark's 273 GB/s bandwidth:
- If quality is paramount: BF16 (native Blackwell format)
- If speed matters most: INT4 with Marlin kernels (fastest decode)
- If using llama.cpp: GGUF Q4_K_M (best quality-per-byte)
NVFP4 is counterintuitively slower than INT4 on the DGX Spark for decode-bound TTS inference.
Attention Mechanisms
What is attention and why does it matter for TTS?
Attention is the mechanism that lets a model consider all previous context when generating each new token. In TTS, this means the model looks at the entire text input, the voice reference audio, and all previously generated audio tokens to decide what comes next.
The problem: attention's memory usage grows with the square of the sequence length (O(n²)). A 1,000-token sequence needs ~1 million attention computations; 10,000 tokens needs ~100 million. For TTS, where audio token sequences can exceed 500 tokens plus text and reference audio, this becomes a real bottleneck.
Flash Attention
What it is: An optimized implementation of the attention computation that reduces memory usage from O(n²) to O(n) by processing attention in tiles rather than materializing the full attention matrix. Developed by Tri Dao at Princeton.
Flash Attention 2 (FA2): The current production standard. Required by Qwen3-TTS (without it, inference is 10× slower). Must be built from source on DGX Spark (~15 min).
Flash Attention 3 (FA3) does NOT work on DGX Spark. Use FA2 instead.
Practical impact: For a typical TTS generation with 500 audio tokens plus 100 text tokens, FA2 reduces attention memory from ~1.4 GB to ~24 MB and speeds up the attention step by 2–4×.
SDPA (Scaled Dot-Product Attention)
What it is: PyTorch's built-in optimized attention implementation (torch.nn.functional.scaled_dot_product_attention). It automatically selects the best available backend (Flash Attention, memory-efficient attention, or math fallback).
Why it matters for DGX Spark: When Flash Attention builds fail on aarch64, SDPA provides a reliable fallback that still offers significant speedups. VibeVoice on DGX Spark actually runs well with SDPA and may be faster than FA2 on Blackwell in some cases.
FlashInfer
What it is: A library of specialized attention kernels optimized for serving (as opposed to training). Used by SGLang for its paged KV cache attention implementation.
DGX Spark note: FlashInfer kernels may not all have sm_121 targets yet. SGLang's Spark Docker image includes workarounds.
PagedAttention
What it is: Not an attention algorithm per se, but a memory management strategy for the KV cache (the model's stored attention state). Instead of allocating one contiguous block per request, PagedAttention splits the KV cache into fixed-size "pages" that can be allocated on demand and shared across requests.
Why it matters: Reduces GPU memory waste from 60–80% to under 4%, allowing more requests or larger contexts. Essential for vLLM's efficiency.
RadixAttention
What it is: SGLang's extension of PagedAttention that stores cached KV states in a radix tree data structure. When multiple requests share the same prefix (like the same voice reference audio), the cached computation is automatically reused.
Why it's critical for TTS voice cloning: If you're generating speech in the same cloned voice repeatedly, the voice reference audio produces the same KV cache entries every time. RadixAttention caches these and skips re-computation, achieving 86.4% cache hit rates for Fish S2 Pro. This means the voice reference audio is effectively "free" after the first request.
Key Metrics Explained
RTF (Real-Time Factor)
Definition: The ratio of wall-clock time to generated audio duration.
RTF = Time to generate audio / Duration of generated audio
- RTF 1.0: Takes 10 seconds to generate 10 seconds of audio (exactly real-time)
- RTF 0.5: Takes 5 seconds to generate 10 seconds of audio (2× faster than real-time)
- RTF 0.1: Takes 1 second to generate 10 seconds of audio (10× faster than real-time)
- RTF 2.0: Takes 20 seconds to generate 10 seconds of audio (too slow for real-time)
For real-time conversation: You need RTF < 1.0, ideally < 0.5 to leave headroom for pipeline overhead.
TTFA (Time-to-First-Audio)
Definition: The time from submitting text to receiving the first playable audio chunk.
This is arguably more important than RTF for conversational feel. A model with RTF 0.3 but TTFA 2 seconds feels sluggish because the user waits 2 seconds of silence before hearing anything. A model with RTF 0.8 but TTFA 100 ms feels responsive because audio starts almost immediately.
Targets:
- < 100 ms: Imperceptible delay (Qwen3-TTS achieves 97 ms)
- 100–300 ms: Feels responsive
- 300–500 ms: Noticeable but acceptable for conversation
- 500–1000 ms: Feels slow
- > 1000 ms: Feels broken
WER (Word Error Rate)
Definition: Percentage of words incorrectly reproduced when TTS output is fed through a speech-to-text system. Measures intelligibility and accuracy.
WER = (Substitutions + Insertions + Deletions) / Total words × 100%
A WER of 0% means every word was perfectly reproduced. The best models achieve 1–3% WER.
SIM (Speaker Similarity)
Definition: Cosine similarity between speaker embeddings extracted from the reference audio and the generated audio. Measures how well voice cloning preserves the original speaker's identity.
- > 0.90: Excellent cloning fidelity
- 0.80–0.90: Good — recognizably the same speaker
- 0.70–0.80: Moderate — some characteristics preserved
- < 0.70: Poor — different speaker
MOS (Mean Opinion Score)
Definition: Average subjective quality rating on a 1–5 scale from human listeners.
- 5.0: Indistinguishable from human speech (impossible to achieve consistently)
- 4.0–4.5: Excellent quality, natural-sounding
- 3.5–4.0: Good quality, minor artifacts
- 3.0–3.5: Acceptable, noticeable synthesis artifacts
- < 3.0: Poor quality
Voice Pipeline Concepts
What is a voice pipeline?
A voice pipeline is the complete system that enables spoken conversation with an AI. It chains three (or more) AI models together:
You speak → [STT] → Text → [LLM] → Response text → [TTS] → AI speaks
↑ ↑ ↑
"Listen" "Think" "Speak"
STT (Speech-to-Text): Converts your spoken words into text. Common models: Whisper, NVIDIA Parakeet/Canary.
LLM (Large Language Model): Processes the text and generates a response. Common models: Llama, Qwen, GPT-OSS.
TTS (Text-to-Speech): Converts the response text into spoken audio. This is what this guide covers.
What is Pipecat?
Pipecat is an open-source Python framework by Daily.co that orchestrates voice pipelines. Instead of manually connecting STT → LLM → TTS yourself, Pipecat provides a pipeline abstraction where you plug in "Frame Processors" and it handles:
- Streaming: Automatically streams LLM tokens into TTS as they're generated
- Interruption: Detects when the user starts speaking and stops the AI mid-sentence
- Turn-taking: Determines when the user has finished speaking
- Audio transport: Handles WebRTC, WebSocket, or local audio device I/O
What is "streaming" in TTS?
There are two kinds of streaming in TTS:
Text input streaming: The TTS model starts generating audio before it receives the complete text. As the LLM produces words, the TTS starts speaking them. Not all models support this — it requires architectures that can begin synthesis from partial input.
Audio output streaming: The system starts playing audio before the TTS has finished generating the entire response. The first chunk of audio plays while subsequent chunks are still being generated. Most modern TTS models support this.
Bidirectional streaming: Both text input and audio output streaming simultaneously. The TTS accepts text character by character AND outputs audio chunks as they're ready. Supported by: Qwen3-TTS, CosyVoice 2/3, VibeVoice-Realtime, Dia2. This is the gold standard for conversational latency.
What is VAD?
VAD (Voice Activity Detection) determines when someone is speaking vs when there's silence. It's the gatekeeper that triggers STT processing and determines turn-taking.
Silero VAD is the most commonly used open-source VAD. It's tiny (~2 MB), runs on CPU, and processes audio in real-time with ~100 ms frame windows. Pipecat uses it by default.
SmartTurn is Pipecat's proprietary turn-detection model that goes beyond simple silence detection to understand when a user has actually finished their thought (vs just pausing mid-sentence).
What is "barge-in"?
Barge-in is when the user interrupts the AI while it's speaking. A good voice pipeline:
- Detects the interruption within ~200 ms
- Immediately stops TTS audio playback
- Flushes any buffered audio that hasn't been played yet
- Starts processing the user's new speech
- Tracks exactly what the AI had said before being interrupted (for context)
Licensing Explained
Why TTS licensing is tricky
A TTS model has (at least) two separately-licensed components:
- The code (inference scripts, training scripts, server code)
- The model weights (the actual neural network parameters)
A project can have MIT-licensed code but non-commercial weights. You must check both.
Common license types in TTS
| License | Commercial Use | Redistribution | Key Restriction |
|---|---|---|---|
| MIT | Yes | Yes | None |
| Apache 2.0 | Yes | Yes | Patent grant; must include notice |
| CC-BY-NC-4.0 | No | Yes (non-commercial) | No commercial use of model or outputs |
| CPML (Coqui) | No | Limited | Non-commercial use of model AND generated audio |
| Fish Audio Research | No (research free) | Research only | Commercial requires separate license from Fish Audio |
| GPL-3.0 | Yes (with conditions) | Must share source | Derivative works must also be GPL |
| AGPL-3.0 | Yes (with conditions) | Must share source | Network use triggers copyleft (strictest GPL variant) |
| bilibili Model License | Requires authorization | Limited | Written auth needed for commercial; can't improve other AI models |
| NVIDIA Open Model | Check terms | Limited | May restrict commercial use on HuggingFace |
| Llama-derivative | Yes (Meta terms) | Yes (Meta terms) | Inherits Meta's Llama license requirements |
The "dual license" trap
Always check both code and weights. Common traps:
- F5-TTS: Code is MIT but weights are CC-BY-NC (because of training data)
- IndexTTS 1.x: Code is Apache 2.0 but weights have bilibili restrictions
- XTTS-v2: Code is MPL-2.0 but model weights are CPML (non-commercial)
- StyleTTS2: Code is MIT but pretrained models require disclosure that speech is synthetic
Safe commercial options
For commercial deployment without licensing concerns, these models have both permissive code AND weight licenses:
Qwen3-TTS (Apache 2.0), TADA (MIT), Chatterbox (MIT), Orpheus (Apache 2.0), OuteTTS (Apache 2.0), Kokoro (Apache 2.0), NeuTTS Air (Apache 2.0), Kani-TTS-2 (Apache 2.0), Dia/Dia2 (Apache 2.0), Spark-TTS (Apache 2.0), Sesame CSM (Apache 2.0), CosyVoice/Fun-CosyVoice3 (Apache 2.0)
Complete Glossary
| Term | Definition |
|---|---|
| aarch64 / ARM64 | The CPU architecture used by DGX Spark, Apple Silicon Macs, Raspberry Pi, and most smartphones |
| AR (Autoregressive) | Generating output one token at a time, where each token depends on all previous ones |
| Attention | The mechanism that lets a model consider all context when generating each output token |
| BF16 (bfloat16) | A 16-bit floating-point format with the same range as FP32 but less precision. Native on Blackwell |
| Barge-in | When a user interrupts the AI while it's speaking |
| Blackwell | NVIDIA's GPU architecture used in DGX Spark (GB10) and RTX 50-series |
| Codec (neural audio) | A neural network that compresses/decompresses audio to/from discrete tokens |
| Continuous batching | Dynamically adding/removing requests during inference rather than fixed batch sizes |
| CUDA | NVIDIA's parallel computing platform for GPU programming |
| CUDA graphs | Pre-recorded sequences of GPU operations that can be replayed with minimal overhead |
| DAC | Descript Audio Codec — a widely-used neural audio codec |
| DGX Spark | NVIDIA's compact AI inference device with GB10 Grace Blackwell chip and 128 GB unified memory |
| Diffusion | A generation method that iteratively refines noise into the target output |
| DiT | Diffusion Transformer — a transformer-based diffusion model |
| Dual-AR | Two autoregressive components: one for temporal (time-axis) and one for depth (codebook layers) |
| EAGLE3 | A speculative decoding method that uses a small draft model for ~2× speedup |
| eSpeak-NG | A lightweight formant-based speech synthesizer, often used as a phonemizer in neural TTS |
| FA2 / FA3 | Flash Attention version 2 / 3. FA3 does NOT work on Blackwell |
| Flow matching | A generation method that learns a smooth path from noise to output (faster than diffusion) |
| FP4 / FP8 / FP16 / FP32 | Floating-point number formats using 4/8/16/32 bits per number |
| GGUF | Model format used by llama.cpp with built-in quantization |
| Grace | NVIDIA's ARM-based CPU architecture used in DGX Spark |
| INT4 / INT8 | Integer quantization using 4/8 bits per weight |
| KV cache | Stored attention key-value pairs from previous tokens, reused during generation |
| LLM | Large Language Model — the backbone of modern TTS systems |
| Mel spectrogram | A visual representation of audio frequency content, used as intermediate representation in older TTS |
| Mimi | Kyutai's 12.5 Hz neural audio codec used by Sesame CSM and Dia2 |
| MOS | Mean Opinion Score — subjective quality rating (1–5 scale) |
| MTP | Multi-Token Prediction — generating multiple codec layers in a single model step |
| NVLink-C2C | High-speed interconnect between CPU and GPU dies in DGX Spark (900 GB/s) |
| ONNX | Open Neural Network Exchange — a universal model format for inference |
| PagedAttention | vLLM's memory management for KV cache using fixed-size pages |
| Pipecat | Open-source Python framework for real-time voice AI pipelines |
| Phonemizer | Converts written text to pronunciation symbols (phonemes) |
| Prosody | The rhythm, stress, and intonation patterns in speech |
| RadixAttention | SGLang's prefix-caching system using radix trees for KV cache reuse |
| RTF | Real-Time Factor — ratio of generation time to audio duration |
| RVQ | Residual Vector Quantization — multi-layer codec quantization |
| SDPA | Scaled Dot-Product Attention — PyTorch's built-in optimized attention |
| SGLang | High-performance inference framework with RadixAttention |
| sm_121 / sm_121a | NVIDIA GPU compute capability for Blackwell GB10 |
| SNAC | Multi-scale Neural Audio Codec with hierarchical structure |
| Speculative decoding | Using a small draft model to propose tokens verified by the main model |
| SSML | Speech Synthesis Markup Language — XML-based speech control standard |
| STT | Speech-to-Text (also called ASR — Automatic Speech Recognition) |
| Tensor Cores | Specialized GPU units for fast matrix multiplication |
| torch.compile | PyTorch's JIT compilation for optimized inference |
| TRT-LLM | TensorRT-LLM — NVIDIA's optimized LLM inference engine |
| TTFA | Time-to-First-Audio — latency before first audio chunk is ready |
| Unified memory | DGX Spark's shared memory pool accessible by both CPU and GPU |
| VAD | Voice Activity Detection — determines when someone is speaking |
| vLLM | Open-source LLM inference framework with PagedAttention |
| vLLM-Omni | Extension of vLLM for multimodal models including TTS |
| Vocoder | Converts mel spectrograms or latent representations into audio waveforms |
| Voice cloning | Generating speech that matches a specific speaker's voice from a reference recording |
| WER | Word Error Rate — measures TTS accuracy via STT round-trip |
| Zero-shot | Voice cloning without fine-tuning — just provide a reference audio clip |
This is a companion document to "The Complete Offline TTS Guide (2026 Edition)." Return to the main guide for model comparisons, deployment configurations, and DGX Spark-specific instructions.
Related Guides
Complete Offline TTS Guide 2026
Comprehensive guide to offline text-to-speech: 30+ models, optimization techniques, deployment configurations, and DGX Spark compatibility.
TTS Model Catalog & Optimization
Complete catalog of 30+ offline TTS models with DGX Spark compatibility, optimization techniques, voice pipeline architecture, and deployment configurations.
TTS Decision Guide
Interactive decision flowchart, model recommendations by use case, and 5-minute quickstart guides for the top TTS models.
Was this guide helpful?