Skip to content

TTS Concepts & Glossary

This document explains the key technologies, frameworks, and concepts referenced throughout the main guide. It's designed for readers who may be new to TTS deployment or need a refresher on the rapidly evolving ecosystem.

How Modern TTS Works

The three generations of TTS

Generation 1 — Formant Synthesis (1960s–2000s) Engines like eSpeak generate speech by simulating the physics of the human vocal tract. They're tiny (under 5 MB), extremely fast, and run on anything — but they sound robotic. Still used today for accessibility (screen readers) and as phonemizers (converting text to pronunciation symbols) inside neural systems.

Generation 2 — Neural Network TTS (2018–2023) Models like VITS, Tacotron, and FastSpeech use neural networks to generate mel spectrograms (visual representations of audio frequency over time), then a separate "vocoder" neural network converts those spectrograms into actual audio waveforms. Piper, MeloTTS, and Kokoro-82M use this approach. They sound much more natural than formant synthesis but typically can't clone arbitrary voices.

Generation 3 — LLM-Based TTS (2024–present) The current state of the art. These models treat speech generation like a language modeling problem: text goes in, and discrete "audio tokens" come out, which are then decoded into waveforms by a neural audio codec. Fish Audio S2 Pro, Qwen3-TTS, CosyVoice, Orpheus, TADA, and most models in this guide use this approach. They can clone voices, express emotions, and generate highly natural speech — but they need more compute.

The two-stage pipeline (LLM-based TTS)

Text → [LLM Backbone] → Audio Tokens → [Neural Codec Decoder] → Waveform → Speaker
        "What to say"                    "How it sounds"

Stage 1 — The LLM backbone reads your text and generates a sequence of numbers called "audio tokens." These tokens encode what the speech should sound like — the words, rhythm, pitch, emotion, and speaker identity. This is the slow part because it generates tokens one at a time (autoregressive).

Stage 2 — The codec decoder takes those audio tokens and reconstructs actual audio waveforms. This is fast because it runs in parallel. Think of it like decompressing a ZIP file — the hard work was creating the compressed version.

Three architecture families

Autoregressive (AR): Generates audio tokens one at a time, left to right, like typing one letter at a time. Produces natural prosody but is inherently sequential. Used by: Fish S2 Pro, Orpheus, OuteTTS, GPT-SoVITS.

  • Pros: Most natural prosody, best long-range coherence, can stream output
  • Cons: Speed limited by sequential generation, can hallucinate (skip or repeat words)

Flow Matching / Diffusion: Generates all audio tokens simultaneously by iteratively refining random noise into speech. Like starting with TV static and gradually sharpening it into a clear picture. Used by: F5-TTS, CosyVoice (acoustic stage), Chatterbox (decoder), Zonos.

  • Pros: Parallel generation (fast), high audio fidelity
  • Cons: Multiple refinement steps needed, harder to stream, less natural prosody for long utterances

Hybrid: Combines both — typically AR for temporal structure (deciding what comes when) and flow matching for acoustic detail (making it sound good). This is the dominant approach in 2026. Used by: CosyVoice 2/3, Qwen3-TTS, TADA, VibeVoice, Kani-TTS-2.

  • Pros: Best of both worlds — natural flow + high fidelity
  • Cons: More complex architectures, two-stage latency

Inference Frameworks Explained

These are the software systems that actually run the TTS models on your GPU or CPU. Choosing the right one significantly impacts speed, memory usage, and deployment complexity.

vLLM — The Production Standard

What it is: An open-source inference engine originally built for serving large language models. It's become the most widely-used framework for deploying LLM-based applications in production.

How it helps TTS: Since modern TTS models use LLM backbones (Llama, Qwen), vLLM's optimizations directly apply. Its key innovation is PagedAttention — a memory management system that reduces GPU memory waste from 60–80% to under 4% by splitting the KV cache (the model's "working memory") into efficiently-managed blocks.

Key features:

  • PagedAttention (efficient memory management)
  • Continuous batching (handles multiple requests efficiently)
  • OpenAI-compatible API server (drop-in replacement)
  • Broad model support

TTS models supported: Qwen3-TTS (via vLLM-Omni), Orpheus, Fun-CosyVoice3, OuteTTS (experimental)

Pros:

  • Most mature and well-documented
  • Largest community and ecosystem
  • Best time-to-first-token (TTFT) performance
  • OpenAI-compatible API makes integration easy

Cons:

  • Requires enforce_eager=True on DGX Spark (CUDA graphs fail via Triton)
  • vLLM-Omni (TTS extension) is separate from main vLLM and less mature
  • Heavy installation — full PyTorch dependency chain
  • Not all TTS models have vLLM backends yet
DGX Spark Status

Works with workarounds. Must build from source for aarch64.

SGLang — The Performance Leader

What it is: An inference framework developed by the LMSYS team (the people behind Chatbot Arena). It emphasizes maximum throughput and minimum latency.

How it helps TTS: SGLang's killer feature for TTS is RadixAttention — a system that caches computed results in a tree structure. When you use the same voice reference clip across multiple TTS requests, SGLang automatically reuses the cached computation instead of re-processing it. Fish Audio S2 Pro achieves 86.4% cache hit rates with this, making repeated voice cloning nearly free.

Key features:

  • RadixAttention (prefix caching — huge for voice cloning)
  • CUDA graph replay (faster individual token generation)
  • Zero-overhead CPU scheduler
  • EAGLE3 speculative decoding support

TTS models supported: Fish Audio S2 Pro (primary), Orpheus

Pros:

  • Fastest per-token generation speed
  • RadixAttention is a game-changer for voice cloning workloads
  • Official DGX Spark Docker image (lmsysorg/sglang:spark)
  • EAGLE3 speculative decoding (~2× speedup)

Cons:

  • Smaller community than vLLM
  • Fewer TTS model integrations
  • Requires building sgl-kernel from source for DGX Spark (30–60 min build)
  • Documentation is less comprehensive
DGX Spark Status

Works via official Docker image. sgl-kernel must be built with TORCH_CUDA_ARCH_LIST="12.1a".

TensorRT-LLM — NVIDIA's Optimized Runtime

What it is: NVIDIA's proprietary inference optimization toolkit. It compiles models into highly optimized execution plans that squeeze maximum performance from NVIDIA GPUs.

How it helps TTS: TRT-LLM achieves the deepest hardware-specific optimizations, including NVFP4 quantization, custom CUDA kernels, and hardware-aware scheduling. CosyVoice achieves a 4× speedup with TRT-LLM compared to raw PyTorch.

Key features:

  • Maximum NVIDIA GPU utilization
  • NVFP4 / FP8 quantization with quality preservation
  • Integrated with NVIDIA Triton Inference Server
  • Part of NVIDIA NIM ecosystem

TTS models supported: CosyVoice 2/3, Spark-TTS, NVIDIA Magpie

Pros:

  • Fastest absolute throughput on NVIDIA hardware
  • Native DGX Spark support
  • NVIDIA's full engineering team behind it
  • Best quantization quality at lowest precision (FP4)

Cons:

  • NVIDIA-only (no AMD, no Apple Silicon)
  • Complex setup and compilation process
  • Fewer community-contributed model conversions
  • Proprietary components limit community contribution
DGX Spark Status

Fully supported. Best path for NVIDIA-ecosystem models.

llama.cpp — The Universal Runner

What it is: A C/C++ implementation of LLM inference that runs on virtually any hardware — CPUs, GPUs (NVIDIA, AMD, Apple), and even phones. It uses its own model format called GGUF.

How it helps TTS: Since TTS models increasingly use LLM backbones (Llama, Qwen), llama.cpp can run them with aggressive quantization (Q4, Q8) and minimal dependencies. It has native TTS support via the llama-tts binary for OuteTTS models, and Orpheus runs through it via Ollama.

Key features:

  • Runs everywhere (CPU, CUDA, Metal, Vulkan, ROCm)
  • GGUF quantization (Q2 through Q8, plus mixed precision)
  • Minimal dependencies (no Python, no PyTorch)
  • Memory-mapped model loading (instant startup)
  • ARM64 / aarch64 native support

TTS models supported: OuteTTS 1.0 (native), Orpheus (via SNAC wrapper), NeuTTS Air/Nano

Pros:

  • Best DGX Spark compatibility — excellent ARM64 support with official benchmarks
  • Tiny footprint, no Python dependency chain
  • GGUF quantization offers finest-grained size/quality tradeoff
  • Works offline with zero network requirements
  • Active development with frequent releases

Cons:

  • Not all TTS models have GGUF conversions
  • No built-in audio codec support (SNAC decoding requires separate Python script)
  • Lower throughput than GPU-native frameworks for large models
  • Less sophisticated batching and scheduling
DGX Spark Status

Excellent. Build with -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=121. Set GGML_CUDA_ENABLE_UNIFIED_MEMORY=1.

ONNX Runtime — The Lightweight Option

What it is: Microsoft's cross-platform inference engine for models exported to the ONNX (Open Neural Network Exchange) format. Think of ONNX as a "universal model format" that many frameworks can export to.

How it helps TTS: Lightweight models (Kokoro-82M, Piper, KittenTTS) export to ONNX for deployment on CPUs, mobile devices, and browsers. The sherpa-onnx project provides a curated TTS model zoo with pre-converted ONNX models and cross-platform binaries.

Key features:

  • Cross-platform (Windows, Linux, macOS, Android, iOS, browser)
  • CPU and GPU execution providers
  • Small runtime footprint
  • Pre-built binaries for many architectures

TTS models supported: Kokoro-82M, Piper, KittenTTS, Matcha-TTS, Chatterbox Turbo (ONNX export), MeloTTS (via sherpa-onnx)

Pros:

  • Simplest deployment path for lightweight models
  • Runs in browsers via WebAssembly
  • No Python or PyTorch required for inference
  • sherpa-onnx provides ready-to-use cross-platform binaries

Cons:

  • Only works with models that have ONNX exports
  • LLM-backbone TTS models generally don't export well to ONNX
  • Limited optimization compared to vLLM/SGLang for large models
  • No sophisticated KV cache management
DGX Spark Status

ARM64 CUDA execution provider works. Community wheels from Ultralytics fill the gap for official aarch64 binaries.

Framework decision matrix

Your situationRecommended frameworkWhy
Deploying Fish S2 Pro for maximum qualitySGLangRadixAttention voice caching is essential
Deploying Qwen3-TTS or CosyVoicevLLM-OmniDay-0 support with streaming
Maximum speed on NVIDIA hardwareTensorRT-LLMDeepest hardware optimization
DGX Spark with minimal build painllama.cppBest ARM64 support, GGUF models work
Lightweight models (Kokoro, Piper)ONNX RuntimeSimplest deployment, runs anywhere
Multiple models, production servingTriton Inference ServerEnsemble scheduling, per-stage batching
Quick prototyping, any modelRaw PyTorchWorks with everything, no framework overhead

Audio Codecs and Tokenizers

What is a neural audio codec?

A neural audio codec is a neural network that compresses audio into a compact sequence of numbers (tokens) and decompresses them back into audio. Think of it like MP3 compression, but powered by AI and designed specifically for speech.

Audio waveform → [Encoder] → Tokens (small numbers) → [Decoder] → Audio waveform
                              "Compressed speech"

The codec is what makes LLM-based TTS possible: the LLM generates tokens (which it's good at), and the codec decoder converts those tokens into actual sound (which it's good at).

Why frame rate matters so much

The "frame rate" of a codec is how many tokens it produces per second of audio. This directly determines how fast TTS can run:

  • 50 Hz codec (NeuCodec): 50 tokens per second of audio. To generate 10 seconds of speech, the LLM must produce 500 tokens.
  • 12.5 Hz codec (Qwen3-TTS-12Hz, Mimi): Only 12.5 tokens per second. Same 10 seconds needs only 125 tokens — 4× less work.
  • 2–3 Hz codec (TADA): Only 20–30 tokens for 10 seconds — 17× less work than 50 Hz.

On the DGX Spark, where memory bandwidth limits token generation to ~45–273 tok/s depending on model size, this difference is the difference between real-time and non-real-time operation.

Major codecs in the TTS ecosystem

CodecFrame RateCreatorUsed ByKey Feature
SNACMulti-scale (12/24/48/96 Hz)IndependentOrpheusHierarchical structure captures detail at multiple scales
DAC (Descript Audio Codec)~86 HzDescriptOuteTTS, ZonosHigh fidelity, widely adopted
Mimi12.5 HzKyutaiSesame CSM, Dia2Ultra-low frame rate, good quality
EnCodec24/48 HzMetaMusicGen, early TTSFirst widely-used neural codec
NeuCodec50 HzNeuphonicNeuTTS AirSingle codebook, simple decoding
NanoCodec21.5 HzNVIDIAMagpie TTS, Kani-TTS-2Optimized for NVIDIA hardware
BiCodec50 HzHKUSTSpark-TTSDecouples speaker identity from content
σ-VAE7.5 HzMicrosoftVibeVoiceContinuous (not discrete) tokens
RVQ (Fish S2)~21 Hz, 10 codebooksFish AudioFish S2 Pro10-layer residual quantization

Single-codebook vs multi-codebook

Single-codebook: One token per time step. Simpler for the LLM to generate. Used by CosyVoice (25Hz), Kokoro.

Multi-codebook (RVQ): Multiple tokens per time step, each adding finer acoustic detail. Like a base layer + enhancement layers. The LLM generates the first codebook, then either another network or MTP generates the rest. Used by Qwen3-TTS (16 layers), Fish S2 Pro (10 layers), SNAC (4 layers).

Practical impact: Multi-codebook models sound better but need strategies (MTP, fast-AR heads, or flow matching) to handle the extra tokens efficiently.


Quantization Formats

What is quantization?

Quantization means storing model weights using fewer bits per number. A model trained in 32-bit floating point (FP32) can often run nearly as well in 16-bit (FP16/BF16) or even 4-bit (INT4) format. This matters because:

  1. Smaller model = less memory needed (a 3B model goes from ~12 GB in FP32 to ~6 GB in FP16 to ~1.5 GB in INT4)
  2. Smaller model = faster inference (less data to read from memory each token)
  3. On bandwidth-limited hardware like DGX Spark, this is the single biggest lever for performance

Format comparison

FormatBitsSize ReductionQuality ImpactWhere Used
FP32321× (baseline)PerfectTraining, reference
BF1616NegligibleDefault for Blackwell. Used by Qwen3-TTS, Fish S2 Pro
FP1616Negligible for most modelsGeneral GPU inference
FP88<1% quality loss typicallyvLLM, SGLang (newer GPUs)
INT881–3% MOS dropGood tradeoff. KittenTTS offers INT8
INT4 (W4A16)4Noticeable on some models; prosody artifacts possibleBest decode speed on DGX Spark
NVFP44Similar to INT4NVIDIA-specific; underperforms INT4 for decode on Spark
GGUF Q4~4~8×Good with K-quant variantsllama.cpp ecosystem (Orpheus, OuteTTS, NeuTTS)
GGUF Q8~8~4×Nearly losslessHigher quality GGUF option
ONNX Q44~8×Acceptable for small modelsBrowser/mobile deployment (Kokoro)

GGUF explained

GGUF (GPT-Generated Unified Format) is the model format used by llama.cpp. It packages model weights, tokenizer, and metadata into a single file with built-in quantization. GGUF files are self-contained — no Python, no PyTorch, no separate tokenizer files needed.

Key GGUF variants:

  • Q2_K: Smallest (~2-bit average), significant quality loss
  • Q4_K_M: Best balance of size and quality for most use cases
  • Q5_K_M: Slightly larger, slightly better than Q4
  • Q8_0: Near-lossless, ~4× compression
  • F16: Full FP16, largest GGUF option

Which format for DGX Spark?

For decode-bound TTS inference on the Spark's 273 GB/s bandwidth:

  • If quality is paramount: BF16 (native Blackwell format)
  • If speed matters most: INT4 with Marlin kernels (fastest decode)
  • If using llama.cpp: GGUF Q4_K_M (best quality-per-byte)
Avoid NVFP4 for Decode-Heavy Workloads

NVFP4 is counterintuitively slower than INT4 on the DGX Spark for decode-bound TTS inference.


Attention Mechanisms

What is attention and why does it matter for TTS?

Attention is the mechanism that lets a model consider all previous context when generating each new token. In TTS, this means the model looks at the entire text input, the voice reference audio, and all previously generated audio tokens to decide what comes next.

The problem: attention's memory usage grows with the square of the sequence length (O(n²)). A 1,000-token sequence needs ~1 million attention computations; 10,000 tokens needs ~100 million. For TTS, where audio token sequences can exceed 500 tokens plus text and reference audio, this becomes a real bottleneck.

Flash Attention

What it is: An optimized implementation of the attention computation that reduces memory usage from O(n²) to O(n) by processing attention in tiles rather than materializing the full attention matrix. Developed by Tri Dao at Princeton.

Flash Attention 2 (FA2): The current production standard. Required by Qwen3-TTS (without it, inference is 10× slower). Must be built from source on DGX Spark (~15 min).

Flash Attention 3 is Incompatible with Blackwell

Flash Attention 3 (FA3) does NOT work on DGX Spark. Use FA2 instead.

Practical impact: For a typical TTS generation with 500 audio tokens plus 100 text tokens, FA2 reduces attention memory from ~1.4 GB to ~24 MB and speeds up the attention step by 2–4×.

SDPA (Scaled Dot-Product Attention)

What it is: PyTorch's built-in optimized attention implementation (torch.nn.functional.scaled_dot_product_attention). It automatically selects the best available backend (Flash Attention, memory-efficient attention, or math fallback).

Why it matters for DGX Spark: When Flash Attention builds fail on aarch64, SDPA provides a reliable fallback that still offers significant speedups. VibeVoice on DGX Spark actually runs well with SDPA and may be faster than FA2 on Blackwell in some cases.

FlashInfer

What it is: A library of specialized attention kernels optimized for serving (as opposed to training). Used by SGLang for its paged KV cache attention implementation.

DGX Spark note: FlashInfer kernels may not all have sm_121 targets yet. SGLang's Spark Docker image includes workarounds.

PagedAttention

What it is: Not an attention algorithm per se, but a memory management strategy for the KV cache (the model's stored attention state). Instead of allocating one contiguous block per request, PagedAttention splits the KV cache into fixed-size "pages" that can be allocated on demand and shared across requests.

Why it matters: Reduces GPU memory waste from 60–80% to under 4%, allowing more requests or larger contexts. Essential for vLLM's efficiency.

RadixAttention

What it is: SGLang's extension of PagedAttention that stores cached KV states in a radix tree data structure. When multiple requests share the same prefix (like the same voice reference audio), the cached computation is automatically reused.

Why it's critical for TTS voice cloning: If you're generating speech in the same cloned voice repeatedly, the voice reference audio produces the same KV cache entries every time. RadixAttention caches these and skips re-computation, achieving 86.4% cache hit rates for Fish S2 Pro. This means the voice reference audio is effectively "free" after the first request.


Key Metrics Explained

RTF (Real-Time Factor)

Definition: The ratio of wall-clock time to generated audio duration.

RTF = Time to generate audio / Duration of generated audio
  • RTF 1.0: Takes 10 seconds to generate 10 seconds of audio (exactly real-time)
  • RTF 0.5: Takes 5 seconds to generate 10 seconds of audio (2× faster than real-time)
  • RTF 0.1: Takes 1 second to generate 10 seconds of audio (10× faster than real-time)
  • RTF 2.0: Takes 20 seconds to generate 10 seconds of audio (too slow for real-time)

For real-time conversation: You need RTF < 1.0, ideally < 0.5 to leave headroom for pipeline overhead.

TTFA (Time-to-First-Audio)

Definition: The time from submitting text to receiving the first playable audio chunk.

This is arguably more important than RTF for conversational feel. A model with RTF 0.3 but TTFA 2 seconds feels sluggish because the user waits 2 seconds of silence before hearing anything. A model with RTF 0.8 but TTFA 100 ms feels responsive because audio starts almost immediately.

Targets:

  • < 100 ms: Imperceptible delay (Qwen3-TTS achieves 97 ms)
  • 100–300 ms: Feels responsive
  • 300–500 ms: Noticeable but acceptable for conversation
  • 500–1000 ms: Feels slow
  • > 1000 ms: Feels broken

WER (Word Error Rate)

Definition: Percentage of words incorrectly reproduced when TTS output is fed through a speech-to-text system. Measures intelligibility and accuracy.

WER = (Substitutions + Insertions + Deletions) / Total words × 100%

A WER of 0% means every word was perfectly reproduced. The best models achieve 1–3% WER.

SIM (Speaker Similarity)

Definition: Cosine similarity between speaker embeddings extracted from the reference audio and the generated audio. Measures how well voice cloning preserves the original speaker's identity.

  • > 0.90: Excellent cloning fidelity
  • 0.80–0.90: Good — recognizably the same speaker
  • 0.70–0.80: Moderate — some characteristics preserved
  • < 0.70: Poor — different speaker

MOS (Mean Opinion Score)

Definition: Average subjective quality rating on a 1–5 scale from human listeners.

  • 5.0: Indistinguishable from human speech (impossible to achieve consistently)
  • 4.0–4.5: Excellent quality, natural-sounding
  • 3.5–4.0: Good quality, minor artifacts
  • 3.0–3.5: Acceptable, noticeable synthesis artifacts
  • < 3.0: Poor quality

Voice Pipeline Concepts

What is a voice pipeline?

A voice pipeline is the complete system that enables spoken conversation with an AI. It chains three (or more) AI models together:

You speak → [STT] → Text → [LLM] → Response text → [TTS] → AI speaks
             ↑                ↑                        ↑
         "Listen"          "Think"                  "Speak"

STT (Speech-to-Text): Converts your spoken words into text. Common models: Whisper, NVIDIA Parakeet/Canary.

LLM (Large Language Model): Processes the text and generates a response. Common models: Llama, Qwen, GPT-OSS.

TTS (Text-to-Speech): Converts the response text into spoken audio. This is what this guide covers.

What is Pipecat?

Pipecat is an open-source Python framework by Daily.co that orchestrates voice pipelines. Instead of manually connecting STT → LLM → TTS yourself, Pipecat provides a pipeline abstraction where you plug in "Frame Processors" and it handles:

  • Streaming: Automatically streams LLM tokens into TTS as they're generated
  • Interruption: Detects when the user starts speaking and stops the AI mid-sentence
  • Turn-taking: Determines when the user has finished speaking
  • Audio transport: Handles WebRTC, WebSocket, or local audio device I/O

What is "streaming" in TTS?

There are two kinds of streaming in TTS:

Text input streaming: The TTS model starts generating audio before it receives the complete text. As the LLM produces words, the TTS starts speaking them. Not all models support this — it requires architectures that can begin synthesis from partial input.

Audio output streaming: The system starts playing audio before the TTS has finished generating the entire response. The first chunk of audio plays while subsequent chunks are still being generated. Most modern TTS models support this.

Bidirectional streaming: Both text input and audio output streaming simultaneously. The TTS accepts text character by character AND outputs audio chunks as they're ready. Supported by: Qwen3-TTS, CosyVoice 2/3, VibeVoice-Realtime, Dia2. This is the gold standard for conversational latency.

What is VAD?

VAD (Voice Activity Detection) determines when someone is speaking vs when there's silence. It's the gatekeeper that triggers STT processing and determines turn-taking.

Silero VAD is the most commonly used open-source VAD. It's tiny (~2 MB), runs on CPU, and processes audio in real-time with ~100 ms frame windows. Pipecat uses it by default.

SmartTurn is Pipecat's proprietary turn-detection model that goes beyond simple silence detection to understand when a user has actually finished their thought (vs just pausing mid-sentence).

What is "barge-in"?

Barge-in is when the user interrupts the AI while it's speaking. A good voice pipeline:

  1. Detects the interruption within ~200 ms
  2. Immediately stops TTS audio playback
  3. Flushes any buffered audio that hasn't been played yet
  4. Starts processing the user's new speech
  5. Tracks exactly what the AI had said before being interrupted (for context)

Licensing Explained

Why TTS licensing is tricky

A TTS model has (at least) two separately-licensed components:

  1. The code (inference scripts, training scripts, server code)
  2. The model weights (the actual neural network parameters)

A project can have MIT-licensed code but non-commercial weights. You must check both.

Common license types in TTS

LicenseCommercial UseRedistributionKey Restriction
MITYesYesNone
Apache 2.0YesYesPatent grant; must include notice
CC-BY-NC-4.0NoYes (non-commercial)No commercial use of model or outputs
CPML (Coqui)NoLimitedNon-commercial use of model AND generated audio
Fish Audio ResearchNo (research free)Research onlyCommercial requires separate license from Fish Audio
GPL-3.0Yes (with conditions)Must share sourceDerivative works must also be GPL
AGPL-3.0Yes (with conditions)Must share sourceNetwork use triggers copyleft (strictest GPL variant)
bilibili Model LicenseRequires authorizationLimitedWritten auth needed for commercial; can't improve other AI models
NVIDIA Open ModelCheck termsLimitedMay restrict commercial use on HuggingFace
Llama-derivativeYes (Meta terms)Yes (Meta terms)Inherits Meta's Llama license requirements

The "dual license" trap

Always Check Both Code and Weight Licenses

Always check both code and weights. Common traps:

  • F5-TTS: Code is MIT but weights are CC-BY-NC (because of training data)
  • IndexTTS 1.x: Code is Apache 2.0 but weights have bilibili restrictions
  • XTTS-v2: Code is MPL-2.0 but model weights are CPML (non-commercial)
  • StyleTTS2: Code is MIT but pretrained models require disclosure that speech is synthetic

Safe commercial options

For commercial deployment without licensing concerns, these models have both permissive code AND weight licenses:

Qwen3-TTS (Apache 2.0), TADA (MIT), Chatterbox (MIT), Orpheus (Apache 2.0), OuteTTS (Apache 2.0), Kokoro (Apache 2.0), NeuTTS Air (Apache 2.0), Kani-TTS-2 (Apache 2.0), Dia/Dia2 (Apache 2.0), Spark-TTS (Apache 2.0), Sesame CSM (Apache 2.0), CosyVoice/Fun-CosyVoice3 (Apache 2.0)


Complete Glossary

TermDefinition
aarch64 / ARM64The CPU architecture used by DGX Spark, Apple Silicon Macs, Raspberry Pi, and most smartphones
AR (Autoregressive)Generating output one token at a time, where each token depends on all previous ones
AttentionThe mechanism that lets a model consider all context when generating each output token
BF16 (bfloat16)A 16-bit floating-point format with the same range as FP32 but less precision. Native on Blackwell
Barge-inWhen a user interrupts the AI while it's speaking
BlackwellNVIDIA's GPU architecture used in DGX Spark (GB10) and RTX 50-series
Codec (neural audio)A neural network that compresses/decompresses audio to/from discrete tokens
Continuous batchingDynamically adding/removing requests during inference rather than fixed batch sizes
CUDANVIDIA's parallel computing platform for GPU programming
CUDA graphsPre-recorded sequences of GPU operations that can be replayed with minimal overhead
DACDescript Audio Codec — a widely-used neural audio codec
DGX SparkNVIDIA's compact AI inference device with GB10 Grace Blackwell chip and 128 GB unified memory
DiffusionA generation method that iteratively refines noise into the target output
DiTDiffusion Transformer — a transformer-based diffusion model
Dual-ARTwo autoregressive components: one for temporal (time-axis) and one for depth (codebook layers)
EAGLE3A speculative decoding method that uses a small draft model for ~2× speedup
eSpeak-NGA lightweight formant-based speech synthesizer, often used as a phonemizer in neural TTS
FA2 / FA3Flash Attention version 2 / 3. FA3 does NOT work on Blackwell
Flow matchingA generation method that learns a smooth path from noise to output (faster than diffusion)
FP4 / FP8 / FP16 / FP32Floating-point number formats using 4/8/16/32 bits per number
GGUFModel format used by llama.cpp with built-in quantization
GraceNVIDIA's ARM-based CPU architecture used in DGX Spark
INT4 / INT8Integer quantization using 4/8 bits per weight
KV cacheStored attention key-value pairs from previous tokens, reused during generation
LLMLarge Language Model — the backbone of modern TTS systems
Mel spectrogramA visual representation of audio frequency content, used as intermediate representation in older TTS
MimiKyutai's 12.5 Hz neural audio codec used by Sesame CSM and Dia2
MOSMean Opinion Score — subjective quality rating (1–5 scale)
MTPMulti-Token Prediction — generating multiple codec layers in a single model step
NVLink-C2CHigh-speed interconnect between CPU and GPU dies in DGX Spark (900 GB/s)
ONNXOpen Neural Network Exchange — a universal model format for inference
PagedAttentionvLLM's memory management for KV cache using fixed-size pages
PipecatOpen-source Python framework for real-time voice AI pipelines
PhonemizerConverts written text to pronunciation symbols (phonemes)
ProsodyThe rhythm, stress, and intonation patterns in speech
RadixAttentionSGLang's prefix-caching system using radix trees for KV cache reuse
RTFReal-Time Factor — ratio of generation time to audio duration
RVQResidual Vector Quantization — multi-layer codec quantization
SDPAScaled Dot-Product Attention — PyTorch's built-in optimized attention
SGLangHigh-performance inference framework with RadixAttention
sm_121 / sm_121aNVIDIA GPU compute capability for Blackwell GB10
SNACMulti-scale Neural Audio Codec with hierarchical structure
Speculative decodingUsing a small draft model to propose tokens verified by the main model
SSMLSpeech Synthesis Markup Language — XML-based speech control standard
STTSpeech-to-Text (also called ASR — Automatic Speech Recognition)
Tensor CoresSpecialized GPU units for fast matrix multiplication
torch.compilePyTorch's JIT compilation for optimized inference
TRT-LLMTensorRT-LLM — NVIDIA's optimized LLM inference engine
TTFATime-to-First-Audio — latency before first audio chunk is ready
Unified memoryDGX Spark's shared memory pool accessible by both CPU and GPU
VADVoice Activity Detection — determines when someone is speaking
vLLMOpen-source LLM inference framework with PagedAttention
vLLM-OmniExtension of vLLM for multimodal models including TTS
VocoderConverts mel spectrograms or latent representations into audio waveforms
Voice cloningGenerating speech that matches a specific speaker's voice from a reference recording
WERWord Error Rate — measures TTS accuracy via STT round-trip
Zero-shotVoice cloning without fine-tuning — just provide a reference audio clip

This is a companion document to "The Complete Offline TTS Guide (2026 Edition)." Return to the main guide for model comparisons, deployment configurations, and DGX Spark-specific instructions.

Related Guides

Was this guide helpful?