TTS Concepts & Glossary

This document explains the key technologies, frameworks, and concepts referenced throughout the main guide. It's designed for readers who may be new to TTS deployment or need a refresher on the rapidly evolving ecosystem.

How Modern TTS Works

The three generations of TTS

Generation 1 — Formant Synthesis (1960s–2000s) Engines like eSpeak generate speech by simulating the physics of the human vocal tract. They're tiny (under 5 MB), extremely fast, and run on anything — but they sound robotic. Still used today for accessibility (screen readers) and as phonemizers (converting text to pronunciation symbols) inside neural systems.

Generation 2 — Neural Network TTS (2018–2023) Models like VITS, Tacotron, and FastSpeech use neural networks to generate mel spectrograms (visual representations of audio frequency over time), then a separate "vocoder" neural network converts those spectrograms into actual audio waveforms. Piper, MeloTTS, and Kokoro-82M use this approach. They sound much more natural than formant synthesis but typically can't clone arbitrary voices.

Generation 3 — LLM-Based TTS (2024–present) The current state of the art. These models treat speech generation like a language modeling problem: text goes in, and discrete "audio tokens" come out, which are then decoded into waveforms by a neural audio codec. Fish Audio S2 Pro, Qwen3-TTS, CosyVoice, Orpheus, TADA, and most models in this guide use this approach. They can clone voices, express emotions, and generate highly natural speech — but they need more compute.

The two-stage pipeline (LLM-based TTS)

Text → [LLM Backbone] → Audio Tokens → [Neural Codec Decoder] → Waveform → Speaker
        "What to say"                    "How it sounds"

Stage 1 — The LLM backbone reads your text and generates a sequence of numbers called "audio tokens." These tokens encode what the speech should sound like — the words, rhythm, pitch, emotion, and speaker identity. This is the slow part because it generates tokens one at a time (autoregressive).

Stage 2 — The codec decoder takes those audio tokens and reconstructs actual audio waveforms. This is fast because it runs in parallel. Think of it like decompressing a ZIP file — the hard work was creating the compressed version.

Three architecture families

Autoregressive (AR): Generates audio tokens one at a time, left to right, like typing one letter at a time. Produces natural prosody but is inherently sequential. Used by: Fish S2 Pro, Orpheus, OuteTTS, GPT-SoVITS.

Pros: Most natural prosody, best long-range coherence, can stream output
Cons: Speed limited by sequential generation, can hallucinate (skip or repeat words)

Flow Matching / Diffusion: Generates all audio tokens simultaneously by iteratively refining random noise into speech. Like starting with TV static and gradually sharpening it into a clear picture. Used by: F5-TTS, CosyVoice (acoustic stage), Chatterbox (decoder), Zonos.

Pros: Parallel generation (fast), high audio fidelity
Cons: Multiple refinement steps needed, harder to stream, less natural prosody for long utterances

Hybrid: Combines both — typically AR for temporal structure (deciding what comes when) and flow matching for acoustic detail (making it sound good). This is the dominant approach in 2026. Used by: CosyVoice 2/3, Qwen3-TTS, TADA, VibeVoice, Kani-TTS-2.

Pros: Best of both worlds — natural flow + high fidelity
Cons: More complex architectures, two-stage latency

Inference Frameworks Explained

These are the software systems that actually run the TTS models on your GPU or CPU. Choosing the right one significantly impacts speed, memory usage, and deployment complexity.

vLLM — The Production Standard

What it is: An open-source inference engine originally built for serving large language models. It's become the most widely-used framework for deploying LLM-based applications in production.

How it helps TTS: Since modern TTS models use LLM backbones (Llama, Qwen), vLLM's optimizations directly apply. Its key innovation is PagedAttention — a memory management system that reduces GPU memory waste from 60–80% to under 4% by splitting the KV cache (the model's "working memory") into efficiently-managed blocks.

Key features:

PagedAttention (efficient memory management)
Continuous batching (handles multiple requests efficiently)
OpenAI-compatible API server (drop-in replacement)
Broad model support

TTS models supported: Qwen3-TTS (via vLLM-Omni), Orpheus, Fun-CosyVoice3, OuteTTS (experimental)

Pros:

Most mature and well-documented
Largest community and ecosystem
Best time-to-first-token (TTFT) performance
OpenAI-compatible API makes integration easy

Cons:

Requires enforce_eager=True on DGX Spark (CUDA graphs fail via Triton)
vLLM-Omni (TTS extension) is separate from main vLLM and less mature
Heavy installation — full PyTorch dependency chain
Not all TTS models have vLLM backends yet

DGX Spark Status

Works with workarounds. Must build from source for aarch64.

SGLang — The Performance Leader

What it is: An inference framework developed by the LMSYS team (the people behind Chatbot Arena). It emphasizes maximum throughput and minimum latency.

How it helps TTS: SGLang's killer feature for TTS is RadixAttention — a system that caches computed results in a tree structure. When you use the same voice reference clip across multiple TTS requests, SGLang automatically reuses the cached computation instead of re-processing it. Fish Audio S2 Pro achieves 86.4% cache hit rates with this, making repeated voice cloning nearly free.

Key features:

RadixAttention (prefix caching — huge for voice cloning)
CUDA graph replay (faster individual token generation)
Zero-overhead CPU scheduler
EAGLE3 speculative decoding support

TTS models supported: Fish Audio S2 Pro (primary), Orpheus

Pros:

Fastest per-token generation speed
RadixAttention is a game-changer for voice cloning workloads
Official DGX Spark Docker image (lmsysorg/sglang:spark)
EAGLE3 speculative decoding (~2× speedup)

Cons:

Smaller community than vLLM
Fewer TTS model integrations
Requires building sgl-kernel from source for DGX Spark (30–60 min build)
Documentation is less comprehensive

DGX Spark Status

Works via official Docker image. sgl-kernel must be built with TORCH_CUDA_ARCH_LIST="12.1a".

TensorRT-LLM — NVIDIA's Optimized Runtime

What it is: NVIDIA's proprietary inference optimization toolkit. It compiles models into highly optimized execution plans that squeeze maximum performance from NVIDIA GPUs.

How it helps TTS: TRT-LLM achieves the deepest hardware-specific optimizations, including NVFP4 quantization, custom CUDA kernels, and hardware-aware scheduling. CosyVoice achieves a 4× speedup with TRT-LLM compared to raw PyTorch.

Key features:

Maximum NVIDIA GPU utilization
NVFP4 / FP8 quantization with quality preservation
Integrated with NVIDIA Triton Inference Server
Part of NVIDIA NIM ecosystem

TTS models supported: CosyVoice 2/3, Spark-TTS, NVIDIA Magpie

Pros:

Fastest absolute throughput on NVIDIA hardware
Native DGX Spark support
NVIDIA's full engineering team behind it
Best quantization quality at lowest precision (FP4)

Cons:

NVIDIA-only (no AMD, no Apple Silicon)
Complex setup and compilation process
Fewer community-contributed model conversions
Proprietary components limit community contribution

DGX Spark Status

Fully supported. Best path for NVIDIA-ecosystem models.

llama.cpp — The Universal Runner

What it is: A C/C++ implementation of LLM inference that runs on virtually any hardware — CPUs, GPUs (NVIDIA, AMD, Apple), and even phones. It uses its own model format called GGUF.

How it helps TTS: Since TTS models increasingly use LLM backbones (Llama, Qwen), llama.cpp can run them with aggressive quantization (Q4, Q8) and minimal dependencies. It has native TTS support via the llama-tts binary for OuteTTS models, and Orpheus runs through it via Ollama.

Key features:

Runs everywhere (CPU, CUDA, Metal, Vulkan, ROCm)
GGUF quantization (Q2 through Q8, plus mixed precision)
Minimal dependencies (no Python, no PyTorch)
Memory-mapped model loading (instant startup)
ARM64 / aarch64 native support

TTS models supported: OuteTTS 1.0 (native), Orpheus (via SNAC wrapper), NeuTTS Air/Nano

Pros:

Best DGX Spark compatibility — excellent ARM64 support with official benchmarks
Tiny footprint, no Python dependency chain
GGUF quantization offers finest-grained size/quality tradeoff
Works offline with zero network requirements
Active development with frequent releases

Cons:

Not all TTS models have GGUF conversions
No built-in audio codec support (SNAC decoding requires separate Python script)
Lower throughput than GPU-native frameworks for large models
Less sophisticated batching and scheduling

DGX Spark Status

Excellent. Build with -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=121. Set GGML_CUDA_ENABLE_UNIFIED_MEMORY=1.

ONNX Runtime — The Lightweight Option

What it is: Microsoft's cross-platform inference engine for models exported to the ONNX (Open Neural Network Exchange) format. Think of ONNX as a "universal model format" that many frameworks can export to.

How it helps TTS: Lightweight models (Kokoro-82M, Piper, KittenTTS) export to ONNX for deployment on CPUs, mobile devices, and browsers. The sherpa-onnx project provides a curated TTS model zoo with pre-converted ONNX models and cross-platform binaries.

Key features:

Cross-platform (Windows, Linux, macOS, Android, iOS, browser)
CPU and GPU execution providers
Small runtime footprint
Pre-built binaries for many architectures

TTS models supported: Kokoro-82M, Piper, KittenTTS, Matcha-TTS, Chatterbox Turbo (ONNX export), MeloTTS (via sherpa-onnx)

Pros:

Simplest deployment path for lightweight models
Runs in browsers via WebAssembly
No Python or PyTorch required for inference
sherpa-onnx provides ready-to-use cross-platform binaries

Cons:

Only works with models that have ONNX exports
LLM-backbone TTS models generally don't export well to ONNX
Limited optimization compared to vLLM/SGLang for large models
No sophisticated KV cache management

DGX Spark Status

ARM64 CUDA execution provider works. Community wheels from Ultralytics fill the gap for official aarch64 binaries.

Framework decision matrix

Your situation	Recommended framework	Why
Deploying Fish S2 Pro for maximum quality	SGLang	RadixAttention voice caching is essential
Deploying Qwen3-TTS or CosyVoice	vLLM-Omni	Day-0 support with streaming
Maximum speed on NVIDIA hardware	TensorRT-LLM	Deepest hardware optimization
DGX Spark with minimal build pain	llama.cpp	Best ARM64 support, GGUF models work
Lightweight models (Kokoro, Piper)	ONNX Runtime	Simplest deployment, runs anywhere
Multiple models, production serving	Triton Inference Server	Ensemble scheduling, per-stage batching
Quick prototyping, any model	Raw PyTorch	Works with everything, no framework overhead

Audio Codecs and Tokenizers

What is a neural audio codec?

A neural audio codec is a neural network that compresses audio into a compact sequence of numbers (tokens) and decompresses them back into audio. Think of it like MP3 compression, but powered by AI and designed specifically for speech.

Audio waveform → [Encoder] → Tokens (small numbers) → [Decoder] → Audio waveform
                              "Compressed speech"

The codec is what makes LLM-based TTS possible: the LLM generates tokens (which it's good at), and the codec decoder converts those tokens into actual sound (which it's good at).

Why frame rate matters so much

The "frame rate" of a codec is how many tokens it produces per second of audio. This directly determines how fast TTS can run:

50 Hz codec (NeuCodec): 50 tokens per second of audio. To generate 10 seconds of speech, the LLM must produce 500 tokens.
12.5 Hz codec (Qwen3-TTS-12Hz, Mimi): Only 12.5 tokens per second. Same 10 seconds needs only 125 tokens — 4× less work.
2–3 Hz codec (TADA): Only 20–30 tokens for 10 seconds — 17× less work than 50 Hz.

On the DGX Spark, where memory bandwidth limits token generation to ~45–273 tok/s depending on model size, this difference is the difference between real-time and non-real-time operation.

Major codecs in the TTS ecosystem

Codec	Frame Rate	Creator	Used By	Key Feature
SNAC	Multi-scale (12/24/48/96 Hz)	Independent	Orpheus	Hierarchical structure captures detail at multiple scales
DAC (Descript Audio Codec)	~86 Hz	Descript	OuteTTS, Zonos	High fidelity, widely adopted
Mimi	12.5 Hz	Kyutai	Sesame CSM, Dia2	Ultra-low frame rate, good quality
EnCodec	24/48 Hz	Meta	MusicGen, early TTS	First widely-used neural codec
NeuCodec	50 Hz	Neuphonic	NeuTTS Air	Single codebook, simple decoding
NanoCodec	21.5 Hz	NVIDIA	Magpie TTS, Kani-TTS-2	Optimized for NVIDIA hardware
BiCodec	50 Hz	HKUST	Spark-TTS	Decouples speaker identity from content
σ-VAE	7.5 Hz	Microsoft	VibeVoice	Continuous (not discrete) tokens
RVQ (Fish S2)	~21 Hz, 10 codebooks	Fish Audio	Fish S2 Pro	10-layer residual quantization

Single-codebook vs multi-codebook

Single-codebook: One token per time step. Simpler for the LLM to generate. Used by CosyVoice (25Hz), Kokoro.

Multi-codebook (RVQ): Multiple tokens per time step, each adding finer acoustic detail. Like a base layer + enhancement layers. The LLM generates the first codebook, then either another network or MTP generates the rest. Used by Qwen3-TTS (16 layers), Fish S2 Pro (10 layers), SNAC (4 layers).

Practical impact: Multi-codebook models sound better but need strategies (MTP, fast-AR heads, or flow matching) to handle the extra tokens efficiently.

Quantization Formats

What is quantization?

Quantization means storing model weights using fewer bits per number. A model trained in 32-bit floating point (FP32) can often run nearly as well in 16-bit (FP16/BF16) or even 4-bit (INT4) format. This matters because:

Smaller model = less memory needed (a 3B model goes from ~12 GB in FP32 to ~6 GB in FP16 to ~1.5 GB in INT4)
Smaller model = faster inference (less data to read from memory each token)
On bandwidth-limited hardware like DGX Spark, this is the single biggest lever for performance

Format comparison

Format	Bits	Size Reduction	Quality Impact	Where Used
FP32	32	1× (baseline)	Perfect	Training, reference
BF16	16	2×	Negligible	Default for Blackwell. Used by Qwen3-TTS, Fish S2 Pro
FP16	16	2×	Negligible for most models	General GPU inference
FP8	8	4×	<1% quality loss typically	vLLM, SGLang (newer GPUs)
INT8	8	4×	1–3% MOS drop	Good tradeoff. KittenTTS offers INT8
INT4 (W4A16)	4	8×	Noticeable on some models; prosody artifacts possible	Best decode speed on DGX Spark
NVFP4	4	8×	Similar to INT4	NVIDIA-specific; underperforms INT4 for decode on Spark
GGUF Q4	~4	~8×	Good with K-quant variants	llama.cpp ecosystem (Orpheus, OuteTTS, NeuTTS)
GGUF Q8	~8	~4×	Nearly lossless	Higher quality GGUF option
ONNX Q4	4	~8×	Acceptable for small models	Browser/mobile deployment (Kokoro)

GGUF explained

GGUF (GPT-Generated Unified Format) is the model format used by llama.cpp. It packages model weights, tokenizer, and metadata into a single file with built-in quantization. GGUF files are self-contained — no Python, no PyTorch, no separate tokenizer files needed.

Key GGUF variants:

Q2_K: Smallest (~2-bit average), significant quality loss
Q4_K_M: Best balance of size and quality for most use cases
Q5_K_M: Slightly larger, slightly better than Q4
Q8_0: Near-lossless, ~4× compression
F16: Full FP16, largest GGUF option

Which format for DGX Spark?

For decode-bound TTS inference on the Spark's 273 GB/s bandwidth:

If quality is paramount: BF16 (native Blackwell format)
If speed matters most: INT4 with Marlin kernels (fastest decode)
If using llama.cpp: GGUF Q4_K_M (best quality-per-byte)

Avoid NVFP4 for Decode-Heavy Workloads

NVFP4 is counterintuitively slower than INT4 on the DGX Spark for decode-bound TTS inference.

Attention Mechanisms

What is attention and why does it matter for TTS?

Attention is the mechanism that lets a model consider all previous context when generating each new token. In TTS, this means the model looks at the entire text input, the voice reference audio, and all previously generated audio tokens to decide what comes next.

The problem: attention's memory usage grows with the square of the sequence length (O(n²)). A 1,000-token sequence needs ~1 million attention computations; 10,000 tokens needs ~100 million. For TTS, where audio token sequences can exceed 500 tokens plus text and reference audio, this becomes a real bottleneck.

Flash Attention

What it is: An optimized implementation of the attention computation that reduces memory usage from O(n²) to O(n) by processing attention in tiles rather than materializing the full attention matrix. Developed by Tri Dao at Princeton.

Flash Attention 2 (FA2): The current production standard. Required by Qwen3-TTS (without it, inference is 10× slower). Must be built from source on DGX Spark (~15 min).

Flash Attention 3 is Incompatible with Blackwell

Flash Attention 3 (FA3) does NOT work on DGX Spark. Use FA2 instead.

Practical impact: For a typical TTS generation with 500 audio tokens plus 100 text tokens, FA2 reduces attention memory from ~1.4 GB to ~24 MB and speeds up the attention step by 2–4×.

SDPA (Scaled Dot-Product Attention)

What it is: PyTorch's built-in optimized attention implementation (torch.nn.functional.scaled_dot_product_attention). It automatically selects the best available backend (Flash Attention, memory-efficient attention, or math fallback).

Why it matters for DGX Spark: When Flash Attention builds fail on aarch64, SDPA provides a reliable fallback that still offers significant speedups. VibeVoice on DGX Spark actually runs well with SDPA and may be faster than FA2 on Blackwell in some cases.

FlashInfer

What it is: A library of specialized attention kernels optimized for serving (as opposed to training). Used by SGLang for its paged KV cache attention implementation.

DGX Spark note: FlashInfer kernels may not all have sm_121 targets yet. SGLang's Spark Docker image includes workarounds.

PagedAttention

What it is: Not an attention algorithm per se, but a memory management strategy for the KV cache (the model's stored attention state). Instead of allocating one contiguous block per request, PagedAttention splits the KV cache into fixed-size "pages" that can be allocated on demand and shared across requests.

Why it matters: Reduces GPU memory waste from 60–80% to under 4%, allowing more requests or larger contexts. Essential for vLLM's efficiency.

RadixAttention

What it is: SGLang's extension of PagedAttention that stores cached KV states in a radix tree data structure. When multiple requests share the same prefix (like the same voice reference audio), the cached computation is automatically reused.

Why it's critical for TTS voice cloning: If you're generating speech in the same cloned voice repeatedly, the voice reference audio produces the same KV cache entries every time. RadixAttention caches these and skips re-computation, achieving 86.4% cache hit rates for Fish S2 Pro. This means the voice reference audio is effectively "free" after the first request.

Key Metrics Explained

RTF (Real-Time Factor)

Definition: The ratio of wall-clock time to generated audio duration.

RTF = Time to generate audio / Duration of generated audio

RTF 1.0: Takes 10 seconds to generate 10 seconds of audio (exactly real-time)
RTF 0.5: Takes 5 seconds to generate 10 seconds of audio (2× faster than real-time)
RTF 0.1: Takes 1 second to generate 10 seconds of audio (10× faster than real-time)
RTF 2.0: Takes 20 seconds to generate 10 seconds of audio (too slow for real-time)

For real-time conversation: You need RTF < 1.0, ideally < 0.5 to leave headroom for pipeline overhead.

TTFA (Time-to-First-Audio)

Definition: The time from submitting text to receiving the first playable audio chunk.

This is arguably more important than RTF for conversational feel. A model with RTF 0.3 but TTFA 2 seconds feels sluggish because the user waits 2 seconds of silence before hearing anything. A model with RTF 0.8 but TTFA 100 ms feels responsive because audio starts almost immediately.

Targets:

< 100 ms: Imperceptible delay (Qwen3-TTS achieves 97 ms)
100–300 ms: Feels responsive
300–500 ms: Noticeable but acceptable for conversation
500–1000 ms: Feels slow
> 1000 ms: Feels broken

WER (Word Error Rate)

Definition: Percentage of words incorrectly reproduced when TTS output is fed through a speech-to-text system. Measures intelligibility and accuracy.

WER = (Substitutions + Insertions + Deletions) / Total words × 100%

A WER of 0% means every word was perfectly reproduced. The best models achieve 1–3% WER.

SIM (Speaker Similarity)

Definition: Cosine similarity between speaker embeddings extracted from the reference audio and the generated audio. Measures how well voice cloning preserves the original speaker's identity.

> 0.90: Excellent cloning fidelity
0.80–0.90: Good — recognizably the same speaker
0.70–0.80: Moderate — some characteristics preserved
< 0.70: Poor — different speaker

MOS (Mean Opinion Score)

Definition: Average subjective quality rating on a 1–5 scale from human listeners.

5.0: Indistinguishable from human speech (impossible to achieve consistently)
4.0–4.5: Excellent quality, natural-sounding
3.5–4.0: Good quality, minor artifacts
3.0–3.5: Acceptable, noticeable synthesis artifacts
< 3.0: Poor quality

Voice Pipeline Concepts

What is a voice pipeline?

A voice pipeline is the complete system that enables spoken conversation with an AI. It chains three (or more) AI models together:

You speak → [STT] → Text → [LLM] → Response text → [TTS] → AI speaks
             ↑                ↑                        ↑
         "Listen"          "Think"                  "Speak"

STT (Speech-to-Text): Converts your spoken words into text. Common models: Whisper, NVIDIA Parakeet/Canary.

LLM (Large Language Model): Processes the text and generates a response. Common models: Llama, Qwen, GPT-OSS.

TTS (Text-to-Speech): Converts the response text into spoken audio. This is what this guide covers.

What is Pipecat?

Pipecat is an open-source Python framework by Daily.co that orchestrates voice pipelines. Instead of manually connecting STT → LLM → TTS yourself, Pipecat provides a pipeline abstraction where you plug in "Frame Processors" and it handles:

Streaming: Automatically streams LLM tokens into TTS as they're generated
Interruption: Detects when the user starts speaking and stops the AI mid-sentence
Turn-taking: Determines when the user has finished speaking
Audio transport: Handles WebRTC, WebSocket, or local audio device I/O

What is "streaming" in TTS?

There are two kinds of streaming in TTS:

Text input streaming: The TTS model starts generating audio before it receives the complete text. As the LLM produces words, the TTS starts speaking them. Not all models support this — it requires architectures that can begin synthesis from partial input.

Audio output streaming: The system starts playing audio before the TTS has finished generating the entire response. The first chunk of audio plays while subsequent chunks are still being generated. Most modern TTS models support this.

Bidirectional streaming: Both text input and audio output streaming simultaneously. The TTS accepts text character by character AND outputs audio chunks as they're ready. Supported by: Qwen3-TTS, CosyVoice 2/3, VibeVoice-Realtime, Dia2. This is the gold standard for conversational latency.

What is VAD?

VAD (Voice Activity Detection) determines when someone is speaking vs when there's silence. It's the gatekeeper that triggers STT processing and determines turn-taking.

Silero VAD is the most commonly used open-source VAD. It's tiny (~2 MB), runs on CPU, and processes audio in real-time with ~100 ms frame windows. Pipecat uses it by default.

SmartTurn is Pipecat's proprietary turn-detection model that goes beyond simple silence detection to understand when a user has actually finished their thought (vs just pausing mid-sentence).

What is "barge-in"?

Barge-in is when the user interrupts the AI while it's speaking. A good voice pipeline:

Detects the interruption within ~200 ms
Immediately stops TTS audio playback
Flushes any buffered audio that hasn't been played yet
Starts processing the user's new speech
Tracks exactly what the AI had said before being interrupted (for context)

Licensing Explained

Why TTS licensing is tricky

A TTS model has (at least) two separately-licensed components:

The code (inference scripts, training scripts, server code)
The model weights (the actual neural network parameters)

A project can have MIT-licensed code but non-commercial weights. You must check both.

Common license types in TTS

License	Commercial Use	Redistribution	Key Restriction
MIT	Yes	Yes	None
Apache 2.0	Yes	Yes	Patent grant; must include notice
CC-BY-NC-4.0	No	Yes (non-commercial)	No commercial use of model or outputs
CPML (Coqui)	No	Limited	Non-commercial use of model AND generated audio
Fish Audio Research	No (research free)	Research only	Commercial requires separate license from Fish Audio
GPL-3.0	Yes (with conditions)	Must share source	Derivative works must also be GPL
AGPL-3.0	Yes (with conditions)	Must share source	Network use triggers copyleft (strictest GPL variant)
bilibili Model License	Requires authorization	Limited	Written auth needed for commercial; can't improve other AI models
NVIDIA Open Model	Check terms	Limited	May restrict commercial use on HuggingFace
Llama-derivative	Yes (Meta terms)	Yes (Meta terms)	Inherits Meta's Llama license requirements

The "dual license" trap

Always Check Both Code and Weight Licenses

Always check both code and weights. Common traps:

F5-TTS: Code is MIT but weights are CC-BY-NC (because of training data)
IndexTTS 1.x: Code is Apache 2.0 but weights have bilibili restrictions
XTTS-v2: Code is MPL-2.0 but model weights are CPML (non-commercial)
StyleTTS2: Code is MIT but pretrained models require disclosure that speech is synthetic

Safe commercial options

For commercial deployment without licensing concerns, these models have both permissive code AND weight licenses:

Qwen3-TTS (Apache 2.0), TADA (MIT), Chatterbox (MIT), Orpheus (Apache 2.0), OuteTTS (Apache 2.0), Kokoro (Apache 2.0), NeuTTS Air (Apache 2.0), Kani-TTS-2 (Apache 2.0), Dia/Dia2 (Apache 2.0), Spark-TTS (Apache 2.0), Sesame CSM (Apache 2.0), CosyVoice/Fun-CosyVoice3 (Apache 2.0)

Complete Glossary

Term	Definition
aarch64 / ARM64	The CPU architecture used by DGX Spark, Apple Silicon Macs, Raspberry Pi, and most smartphones
AR (Autoregressive)	Generating output one token at a time, where each token depends on all previous ones
Attention	The mechanism that lets a model consider all context when generating each output token
BF16 (bfloat16)	A 16-bit floating-point format with the same range as FP32 but less precision. Native on Blackwell
Barge-in	When a user interrupts the AI while it's speaking
Blackwell	NVIDIA's GPU architecture used in DGX Spark (GB10) and RTX 50-series
Codec (neural audio)	A neural network that compresses/decompresses audio to/from discrete tokens
Continuous batching	Dynamically adding/removing requests during inference rather than fixed batch sizes
CUDA	NVIDIA's parallel computing platform for GPU programming
CUDA graphs	Pre-recorded sequences of GPU operations that can be replayed with minimal overhead
DAC	Descript Audio Codec — a widely-used neural audio codec
DGX Spark	NVIDIA's compact AI inference device with GB10 Grace Blackwell chip and 128 GB unified memory
Diffusion	A generation method that iteratively refines noise into the target output
DiT	Diffusion Transformer — a transformer-based diffusion model
Dual-AR	Two autoregressive components: one for temporal (time-axis) and one for depth (codebook layers)
EAGLE3	A speculative decoding method that uses a small draft model for ~2× speedup
eSpeak-NG	A lightweight formant-based speech synthesizer, often used as a phonemizer in neural TTS
FA2 / FA3	Flash Attention version 2 / 3. FA3 does NOT work on Blackwell
Flow matching	A generation method that learns a smooth path from noise to output (faster than diffusion)
FP4 / FP8 / FP16 / FP32	Floating-point number formats using 4/8/16/32 bits per number
GGUF	Model format used by llama.cpp with built-in quantization
Grace	NVIDIA's ARM-based CPU architecture used in DGX Spark
INT4 / INT8	Integer quantization using 4/8 bits per weight
KV cache	Stored attention key-value pairs from previous tokens, reused during generation
LLM	Large Language Model — the backbone of modern TTS systems
Mel spectrogram	A visual representation of audio frequency content, used as intermediate representation in older TTS
Mimi	Kyutai's 12.5 Hz neural audio codec used by Sesame CSM and Dia2
MOS	Mean Opinion Score — subjective quality rating (1–5 scale)
MTP	Multi-Token Prediction — generating multiple codec layers in a single model step
NVLink-C2C	High-speed interconnect between CPU and GPU dies in DGX Spark (900 GB/s)
ONNX	Open Neural Network Exchange — a universal model format for inference
PagedAttention	vLLM's memory management for KV cache using fixed-size pages
Pipecat	Open-source Python framework for real-time voice AI pipelines
Phonemizer	Converts written text to pronunciation symbols (phonemes)
Prosody	The rhythm, stress, and intonation patterns in speech
RadixAttention	SGLang's prefix-caching system using radix trees for KV cache reuse
RTF	Real-Time Factor — ratio of generation time to audio duration
RVQ	Residual Vector Quantization — multi-layer codec quantization
SDPA	Scaled Dot-Product Attention — PyTorch's built-in optimized attention
SGLang	High-performance inference framework with RadixAttention
sm_121 / sm_121a	NVIDIA GPU compute capability for Blackwell GB10
SNAC	Multi-scale Neural Audio Codec with hierarchical structure
Speculative decoding	Using a small draft model to propose tokens verified by the main model
SSML	Speech Synthesis Markup Language — XML-based speech control standard
STT	Speech-to-Text (also called ASR — Automatic Speech Recognition)
Tensor Cores	Specialized GPU units for fast matrix multiplication
torch.compile	PyTorch's JIT compilation for optimized inference
TRT-LLM	TensorRT-LLM — NVIDIA's optimized LLM inference engine
TTFA	Time-to-First-Audio — latency before first audio chunk is ready
Unified memory	DGX Spark's shared memory pool accessible by both CPU and GPU
VAD	Voice Activity Detection — determines when someone is speaking
vLLM	Open-source LLM inference framework with PagedAttention
vLLM-Omni	Extension of vLLM for multimodal models including TTS
Vocoder	Converts mel spectrograms or latent representations into audio waveforms
Voice cloning	Generating speech that matches a specific speaker's voice from a reference recording
WER	Word Error Rate — measures TTS accuracy via STT round-trip
Zero-shot	Voice cloning without fine-tuning — just provide a reference audio clip

This is a companion document to "The Complete Offline TTS Guide (2026 Edition)." Return to the main guide for model comparisons, deployment configurations, and DGX Spark-specific instructions.

Related Guides

Complete Offline TTS Guide 2026

TTS Model Catalog & Optimization

TTS Decision Guide