Choosing the right hardware for self-hosted TTS — from Raspberry Pi to RTX 5090, ranked by the metric that matters most: memory bandwidth.
Why Memory Bandwidth Determines TTS Speed
For autoregressive TTS decode (which is how all LLM-backbone TTS models generate audio tokens), every generated token requires reading the full model weights from memory. The GPU spends most of its time loading data, not computing.
The formula:
Max tokens/sec ≈ Memory bandwidth (GB/s) ÷ Model size (GB)
This means:
- A 1 GB model on hardware with 1,000 GB/s bandwidth → ~1,000 tok/s max
- The same 1 GB model on 273 GB/s bandwidth → ~273 tok/s max
- 3.7× bandwidth difference = 3.7× speed difference (at batch size 1)
Compute (TFLOPS) only matters for prefill (processing input text and reference audio). For the actual audio token generation loop, bandwidth is king.
Hardware Specifications Comparison
GPU / Accelerator Tier
| Platform | Memory Type | Capacity | Bandwidth | Compute (BF16) | TDP | Price (2026) |
|---|---|---|---|---|---|---|
| RTX 5090 | GDDR7, 512-bit | 32 GB | 1,792 GB/s | 419 TFLOPS | 575W | ~$3,000 street |
| RTX 4090 | GDDR6X, 384-bit | 24 GB | 1,008 GB/s | 330 TFLOPS | 450W | $1,599 (discontinued) |
| RTX 3090 | GDDR6X, 384-bit | 24 GB | 936 GB/s | 71 TFLOPS | 350W | ~$700 used |
| RTX 4080 Super | GDDR6X, 256-bit | 16 GB | 736 GB/s | 198 TFLOPS | 320W | $999 |
| RTX 3060 12GB | GDDR6, 192-bit | 12 GB | 360 GB/s | 25 TFLOPS | 170W | ~$250 used |
Unified Memory Platforms
| Platform | Memory Type | Capacity | Bandwidth | Compute | TDP | Price |
|---|---|---|---|---|---|---|
| Mac Studio M3 Ultra | LPDDR5 | up to 512 GB | 800 GB/s | ~22 TFLOPS (GPU) | ~200W | $3,999–$13,000+ |
| Mac Studio M4 Max | LPDDR5x | up to 128 GB | 546 GB/s | ~18 TFLOPS (GPU) | ~120W | $1,999–$5,999 |
| DGX Spark (GB10) | LPDDR5x | 128 GB | 273 GB/s | ~100 TFLOPS BF16 | ~170W | $3,999–$4,699 |
Apple M4 Ultra does not exist as of March 2026. The current Mac Studio options are M4 Max and M3 Ultra.
Edge / Embedded Tier
| Platform | Memory | Capacity | Bandwidth | GPU/NPU | TDP | Price |
|---|---|---|---|---|---|---|
| Jetson AGX Orin | LPDDR5 | 32/64 GB | 205 GB/s | 2048 CUDA / 275 TOPS | 15–60W | $999–$1,599 |
| Jetson Orin Nano Super | LPDDR5 | 8 GB | 102 GB/s | 1024 CUDA / 67 TOPS | 7–25W | $249 |
| Raspberry Pi 5 | LPDDR4x | 4/8 GB | ~34 GB/s | CPU only | ~12W | $60–$80 |
| Orange Pi 5 (RK3588) | LPDDR5 | 8/16 GB | ~25–50 GB/s | CPU + 6 TOPS NPU | ~18W | $100–$160 |
TTS Benchmarks by Platform
Qwen3-TTS Performance (estimated from community reports)
| Platform | Qwen3-TTS 1.7B RTF | Qwen3-TTS 0.6B RTF | First-Token Latency |
|---|---|---|---|
| RTX 5090 | 0.48–0.55 | 0.32–0.38 | 45–62ms |
| RTX 4090 | 0.65–0.85 | 0.38–0.45 | 52–97ms |
| RTX 3090 | 0.95–1.26 | 0.52–0.68 | 78–145ms |
| RTX 4080 Super | 0.82–1.15 | 0.48–0.62 | — |
| RTX 3060 12GB | 1.65+ | 0.85–1.15 | — |
RTF < 1.0 means faster than real-time. Lower is better.
Cross-Model Performance (NVIDIA L4 benchmark, Inferless)
| Model | 50 words | 100 words | 200 words |
|---|---|---|---|
| Kokoro-82M | <0.1s | <0.2s | <0.3s |
| MeloTTS | <0.5s | <1s | ~1.5s |
| Parler-TTS mini | ~2s | ~4s | ~8s |
| F5-TTS | ~3s | ~6s | ~12s |
| XTTS-v2 | ~8s | ~18s | ~35s |
Other Published Benchmarks
| Model | Platform | RTF | Notes |
|---|---|---|---|
| VITS (non-cloning) | RTX 3090 | 0.015 | 67× real-time — feed-forward, not autoregressive |
| Kani-TTS-2 | RTX 5080 | 0.19 | New LFM2 architecture |
| Kani-TTS-2 | RTX 4080 | ~0.20 | Similar to 5080 (bandwidth-limited) |
| VibeVoice-RT 0.5B | DGX Spark | 0.48 | Community benchmark |
| Chatterbox Turbo | Consumer GPU | Up to 6× RT | 1-step distilled decoder |
| TADA-1B | A100 | 0.09 | Ultra-low codec frame rate |
LLM Token Generation as TTS Proxy
Since modern TTS models use autoregressive token generation, llama.cpp benchmarks are a reliable proxy for TTS decode throughput.
Llama 2 7B Q4_0 (single-stream decode)
| Platform | tok/s | vs RTX 4090 |
|---|---|---|
| RTX 5090 | 264–274 | 1.4× |
| RTX 4090 | 188–190 | 1.0× (baseline) |
| RTX 3090 | 160–162 | 0.85× |
| DGX Spark (batch 1) | ~20.5 | 0.11× |
| DGX Spark (batch 32) | 368 (total) | — |
| Mac Studio M3 Ultra (est.) | ~120–150 | ~0.7× |
| Mac Studio M4 Max (est.) | ~80–100 | ~0.5× |
What this means for TTS
A TTS model generating audio tokens at the same rate as LLM text tokens:
- RTX 5090: ~270 audio tok/s → real-time at codecs up to ~270 Hz
- RTX 4090: ~190 audio tok/s → real-time at codecs up to ~190 Hz
- DGX Spark: ~20 audio tok/s (batch 1) → real-time only at ~20 Hz or lower
This is why codec frame rate matters so much on DGX Spark: Qwen3-TTS-12Hz needs only 12.5 tok/s, TADA needs 2–3 tok/s, but a 50 Hz codec needs 50 tok/s — the Spark can't keep up at batch 1 for larger models.
Edge and ARM Hardware
Raspberry Pi 5
The default platform for Piper TTS (primary local TTS in Home Assistant).
- Piper medium English: estimated RTF 0.1–0.2 (5–10× real-time) with 4 threads
- Kokoro-82M int8 ONNX: approaching real-time on Pi 5 8GB (slower-than-real-time on Pi 4 2GB)
- Home Assistant 2025.7+ added TTS streaming via Wyoming protocol, improving perceived response by ~10×
Rockchip RK3588 (Orange Pi 5, Rock 5B)
Same quad A76 cores as Pi 5 plus a 6 TOPS NPU for TTS decoder acceleration.
- RKLLaMA project: Piper decoder on NPU, encoder via ONNX
- MMS-TTS entirely on NPU
- Potentially faster than Pi 5 for neural TTS (no published RTF benchmarks yet)
Jetson Orin Nano Super ($249)
GPU-accelerated edge option: 1,024 CUDA cores, 8 GB, 67 TOPS.
- Community voice pipelines: Piper + faster-whisper + small LLMs
- sherpa-onnx explicitly supports Jetson Orin (CPU + CUDA)
- 8 GB memory limits model size
Intel N100 Mini-PCs
~$150, ~25W under load. Kokoro via OpenVINO on Intel iGPU achieves 3× CPU speed at 15W TDP. More compute than Pi 5 at ~2× power.
Recommended Hardware by Use Case
| Use Case | Hardware | TTS Model | Est. RTF | Cost | Power |
|---|---|---|---|---|---|
| Budget voice assistant | RPi 5 (8 GB) | Piper medium | 0.1–0.2 | ~$80 | ~12W |
| Higher quality edge | RPi 5 (8 GB) | Kokoro int8 ONNX | ~0.5–1.0 | ~$80 | ~12W |
| NPU-accelerated edge | Orange Pi 5 (16 GB) | Piper on RKNN | <0.2 est. | ~$130 | ~18W |
| GPU edge device | Jetson Orin Nano Super | Piper + Kokoro | Real-time | $249 | 15–25W |
| Quiet desktop/dev | Mac Studio M4 Max | Kokoro (CoreML) | 0.02–0.1 | $1,999+ | ~120W |
| Best budget GPU TTS | PC + RTX 3090 (used) | Any TTS model | 0.5–0.7 | ~$1,200 total | ~400W |
| Production TTS server | PC + RTX 4090 | Qwen3-TTS / F5-TTS | 0.4–0.9 | ~$2,500 total | ~550W |
| Maximum throughput | PC + RTX 5090 | Qwen3-TTS 1.7B | 0.48–0.55 | ~$4,000+ total | ~700W |
| Large model prototyping | DGX Spark | 30B+ TTS/LLM combos | Varies | $3,999 | ~170W |
| Huge model capacity | Mac Studio M3 Ultra | Any (via MLX) | 0.5–2.0 | $3,999+ | ~200W |
The RTX 3090 value proposition
At ~$700 used, the RTX 3090's 936 GB/s bandwidth is only 7% less than the RTX 4090's 1,008 GB/s, delivering near-identical per-token generation speed. This makes the 3090 the best price-to-performance option for TTS in 2026. The main downsides are power (350W TDP) and the inability to use FP8/NVFP4 quantization (Ampere architecture).
DGX Spark: Capacity vs Speed
The DGX Spark's value proposition is model capacity, not decode speed.
Strengths
- 128 GB unified memory — fits models that won't run on any consumer GPU
- Two units NVLink-linked for 256 GB (run 405B models)
- 128 GB for simultaneous STT + LLM + TTS with room to spare
- Silent operation, Mac Mini form factor, <100W AI workload
- Native NVIDIA CUDA ecosystem (vs Apple's MLX)
Weaknesses
- 273 GB/s bandwidth — single-stream decode is slow (20.5 tok/s for 7B model)
- $3,999–$4,699 for bandwidth that a $700 RTX 3090 exceeds 3.4×
- aarch64 software ecosystem still maturing
- "Blackwell Noise" breaks several TTS models on GPU
When DGX Spark makes sense for TTS
- Running large voice cloning models (Fish S2 Pro at 4.4B, Higgs Audio at 5.8B) that don't fit in 24 GB
- Complete voice pipelines (STT + large LLM + TTS) in one device
- Development and testing before deploying to production GPU servers
- Concurrent inference (batch 32 throughput is competitive)
- Air-gapped deployments where silence and low power matter
When DGX Spark does NOT make sense for TTS
- Single-user real-time voice conversation (bandwidth-limited)
- Maximum generation speed per dollar (RTX 3090 wins)
- Models that fit in 24 GB (use a 4090 instead)
References
| Resource | URL |
|---|---|
| LMSYS DGX Spark Review | lmsys.org/blog/2025-10-13-nvidia-dgx-spark/ |
| RTX 5090 specifications | vast.ai/article/nvidia-geforce-rtx-5090-specs |
| RTX 3090 vs 4090 AI comparison | bestgpusforai.com/gpu-comparison/3090-vs-4090 |
| llama.cpp benchmarks | github.com/DandinPower/llama.cpp_bench |
| DGX Spark price comparison | glukhov.org/post/2025/10/nvidia-dgx-spark-prices/ |
| DGX Spark vs alternatives | aimultiple.com/dgx-spark-alternatives |
| Mac Studio M3 Ultra specs | lowendmac.com/2025/mac-studio-early-2025/ |
| Jetson Orin specs | nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/ |
| Kokoro on Raspberry Pi | mikeesto.com/posts/kokoro-82m-pi/ |
| Home Assistant voice | home-assistant.io/blog/2025/09/11/ai-in-home-assistant/ |
| sherpa-onnx ARM support | github.com/k2-fsa/sherpa-onnx |
| Inferless TTS benchmark | inferless.com/learn/comparing-different-text-to-speech---tts--models-part-2 |
Related Guides
Complete Offline TTS Guide 2026
Comprehensive guide to offline text-to-speech: 30+ models, optimization techniques, deployment configurations, and DGX Spark compatibility.
TTS Model Catalog & Optimization
Complete catalog of 30+ offline TTS models with DGX Spark compatibility, optimization techniques, voice pipeline architecture, and deployment configurations.
TTS Decision Guide
Interactive decision flowchart, model recommendations by use case, and 5-minute quickstart guides for the top TTS models.
Was this guide helpful?