TTS Backends¶
cc-vox supports five TTS backends. In auto mode (default), it tries them in priority order and uses the first one available.
Backend Comparison¶
| Feature | Fish Speech | Chatterbox | Qwen3-TTS | Kokoro | pocket-tts |
|---|---|---|---|---|---|
| Priority | 10 (first) | 12 | 14 | 20 | 30 (last) |
| Quality | Best | Great | Great | Great | Good |
| Hardware | NVIDIA GPU | NVIDIA GPU | NVIDIA GPU | CPU only | CPU only |
| Setup | Docker + GPU | Docker + GPU | Docker + GPU | Docker | Zero (auto-starts) |
| Voices | Single voice | Voice cloning | Default voice | 9 voices | 8 voices (aliased) |
| Speed control | No | No | Yes | Yes (0.5--2.0) | No |
| Port | 32611 | 32613 | 32614 | 32612 | 8000 |
Fish Speech¶
GPU-accelerated TTS via Docker with a Gradio API. Uses the openaudio-s1-mini model (0.5B parameters, 13 languages).
Setup:
# 1. Download the model
hf download fishaudio/openaudio-s1-mini --local-dir checkpoints/openaudio-s1-mini
# 2. Start the Docker container
docker run -d --name fish-speech \
--gpus all \
-p 32611:7860 \
-v ./checkpoints:/app/checkpoints \
fishaudio/fish-speech:latest
Note
The model is licensed under CC-BY-NC-SA-4.0. You may need to accept the license on Hugging Face before downloading.
Key behaviors:
- Uses its own voice model (ignores the
voiceparameter) - Ignores
speedparameter - cc-vox checks GPU utilization before using it — if GPU usage exceeds the threshold (default 80%), Fish Speech is skipped
Environment variables:
| Variable | Default | Description |
|---|---|---|
FISH_SPEECH_PORT |
32611 |
Docker container port |
GPU_THRESHOLD |
80 |
GPU % above which Fish Speech is skipped |
Chatterbox¶
GPU-accelerated TTS with voice cloning via Docker. Uses the Chatterbox model with an OpenAI-compatible API.
Key behaviors:
- Uses voice cloning (ignores the
voiceparameter, sends"default") - Ignores
speedparameter - OpenAI-compatible
/v1/audio/speechendpoint
Environment variables:
| Variable | Default | Description |
|---|---|---|
CHATTERBOX_PORT |
32613 |
Docker container port |
Qwen3-TTS¶
GPU-accelerated multilingual TTS via Docker. Uses the Qwen3-TTS model with a FastAPI REST API.
# Clone and build
cd tools/tts && git clone https://github.com/ValyrianTech/Qwen3-TTS_server qwen3-tts
docker compose -f tts/docker-compose.yml --profile gpu up -d qwen3-tts
Key behaviors:
- Uses default English voice (ignores the
voiceparameter) - Supports
speedparameter via query string - REST API at
GET /base_tts/?text=...&speed=...
Environment variables:
| Variable | Default | Description |
|---|---|---|
QWEN3_TTS_PORT |
32614 |
Docker container port |
Kokoro¶
CPU-based TTS with an OpenAI-compatible API. Recommended for most users.
Key behaviors:
- Supports all 9 voices from the voice catalog
- Supports speed control (0.5--2.0 via the
speedconfig) - Always available on CPU — no GPU required
- Stable, consistent quality
Environment variables:
| Variable | Default | Description |
|---|---|---|
KOKORO_PORT |
32612 |
Docker container port |
pocket-tts¶
Lightweight CPU fallback using the pocket-tts model by Kyutai (100M parameters, English). Auto-starts via uvx — zero configuration required.
Setup:
# Pre-download the model (optional — downloaded automatically on first use)
hf download kyutai/pocket-tts
The model is downloaded to the Hugging Face cache (~/.cache/huggingface/) and reused across sessions.
Key behaviors:
- Auto-starts if not running — cc-vox spawns
uvx pocket-tts serveautomatically - Uses aliased voice names (e.g.,
albainstead ofaf_heart) - Ignores
speedparameter - First startup can take up to 60 seconds (model download + initialization)
Environment variables:
| Variable | Default | Description |
|---|---|---|
TTS_PORT |
8000 |
Server port |
Backend Selection Logic¶
flowchart TD
Start([select_backend]) --> Forced{Backend forced?}
Forced -->|Yes| TryForced[Try forced backend]
TryForced --> ForcedOk{Available?}
ForcedOk -->|Yes| UseForced([Use forced])
ForcedOk -->|No| Fallback{Fallback enabled?}
Fallback -->|Yes| Auto
Fallback -->|No| None([No backend])
Forced -->|No / auto| Auto[Sort by priority]
Auto --> Fish{Fish Speech available<br>& GPU < threshold?}
Fish -->|Yes| UseFish([Use Fish Speech])
Fish -->|No| Chatterbox{Chatterbox available?}
Chatterbox -->|Yes| UseChatterbox([Use Chatterbox])
Chatterbox -->|No| Qwen3{Qwen3-TTS available?}
Qwen3 -->|Yes| UseQwen3([Use Qwen3-TTS])
Qwen3 -->|No| Kokoro{Kokoro available?}
Kokoro -->|Yes| UseKokoro([Use Kokoro])
Kokoro -->|No| Pocket{pocket-tts available<br>or startable?}
Pocket -->|Yes| UsePocket([Use pocket-tts])
Pocket -->|No| None
Forcing a Backend¶
# Via slash command
/voice:speak backend kokoro
# Via environment variable
TTS_BACKEND=fish-speech claude
# Via config file (~/.claude/cc-vox.toml)
[core]
backend = "kokoro"
With fallback = true (default), if your forced backend is unavailable, cc-vox automatically tries the next backend in priority order.