Skip to content

TTS Backends

cc-vox supports five TTS backends. In auto mode (default), it tries them in priority order and uses the first one available.

Backend Comparison

Feature Fish Speech Chatterbox Qwen3-TTS Kokoro pocket-tts
Priority 10 (first) 12 14 20 30 (last)
Quality Best Great Great Great Good
Hardware NVIDIA GPU NVIDIA GPU NVIDIA GPU CPU only CPU only
Setup Docker + GPU Docker + GPU Docker + GPU Docker Zero (auto-starts)
Voices Single voice Voice cloning Default voice 9 voices 8 voices (aliased)
Speed control No No Yes Yes (0.5--2.0) No
Port 32611 32613 32614 32612 8000

Fish Speech

GPU-accelerated TTS via Docker with a Gradio API. Uses the openaudio-s1-mini model (0.5B parameters, 13 languages).

Setup:

# 1. Download the model
hf download fishaudio/openaudio-s1-mini --local-dir checkpoints/openaudio-s1-mini

# 2. Start the Docker container
docker run -d --name fish-speech \
  --gpus all \
  -p 32611:7860 \
  -v ./checkpoints:/app/checkpoints \
  fishaudio/fish-speech:latest

Note

The model is licensed under CC-BY-NC-SA-4.0. You may need to accept the license on Hugging Face before downloading.

Key behaviors:

  • Uses its own voice model (ignores the voice parameter)
  • Ignores speed parameter
  • cc-vox checks GPU utilization before using it — if GPU usage exceeds the threshold (default 80%), Fish Speech is skipped

Environment variables:

Variable Default Description
FISH_SPEECH_PORT 32611 Docker container port
GPU_THRESHOLD 80 GPU % above which Fish Speech is skipped

Chatterbox

GPU-accelerated TTS with voice cloning via Docker. Uses the Chatterbox model with an OpenAI-compatible API.

docker run -d --name chatterbox \
  --gpus all \
  -p 32613:4123 \
  travisvn/chatterbox-tts-api:latest

Key behaviors:

  • Uses voice cloning (ignores the voice parameter, sends "default")
  • Ignores speed parameter
  • OpenAI-compatible /v1/audio/speech endpoint

Environment variables:

Variable Default Description
CHATTERBOX_PORT 32613 Docker container port

Qwen3-TTS

GPU-accelerated multilingual TTS via Docker. Uses the Qwen3-TTS model with a FastAPI REST API.

# Clone and build
cd tools/tts && git clone https://github.com/ValyrianTech/Qwen3-TTS_server qwen3-tts
docker compose -f tts/docker-compose.yml --profile gpu up -d qwen3-tts

Key behaviors:

  • Uses default English voice (ignores the voice parameter)
  • Supports speed parameter via query string
  • REST API at GET /base_tts/?text=...&speed=...

Environment variables:

Variable Default Description
QWEN3_TTS_PORT 32614 Docker container port

Kokoro

CPU-based TTS with an OpenAI-compatible API. Recommended for most users.

docker run -d --name kokoro \
  -p 32612:8880 \
  ghcr.io/remsky/kokoro-fastapi-cpu:latest

Key behaviors:

  • Supports all 9 voices from the voice catalog
  • Supports speed control (0.5--2.0 via the speed config)
  • Always available on CPU — no GPU required
  • Stable, consistent quality

Environment variables:

Variable Default Description
KOKORO_PORT 32612 Docker container port

pocket-tts

Lightweight CPU fallback using the pocket-tts model by Kyutai (100M parameters, English). Auto-starts via uvx — zero configuration required.

Setup:

# Pre-download the model (optional — downloaded automatically on first use)
hf download kyutai/pocket-tts

The model is downloaded to the Hugging Face cache (~/.cache/huggingface/) and reused across sessions.

Key behaviors:

  • Auto-starts if not running — cc-vox spawns uvx pocket-tts serve automatically
  • Uses aliased voice names (e.g., alba instead of af_heart)
  • Ignores speed parameter
  • First startup can take up to 60 seconds (model download + initialization)

Environment variables:

Variable Default Description
TTS_PORT 8000 Server port

Backend Selection Logic

flowchart TD
    Start([select_backend]) --> Forced{Backend forced?}
    Forced -->|Yes| TryForced[Try forced backend]
    TryForced --> ForcedOk{Available?}
    ForcedOk -->|Yes| UseForced([Use forced])
    ForcedOk -->|No| Fallback{Fallback enabled?}
    Fallback -->|Yes| Auto
    Fallback -->|No| None([No backend])
    Forced -->|No / auto| Auto[Sort by priority]
    Auto --> Fish{Fish Speech available<br>& GPU < threshold?}
    Fish -->|Yes| UseFish([Use Fish Speech])
    Fish -->|No| Chatterbox{Chatterbox available?}
    Chatterbox -->|Yes| UseChatterbox([Use Chatterbox])
    Chatterbox -->|No| Qwen3{Qwen3-TTS available?}
    Qwen3 -->|Yes| UseQwen3([Use Qwen3-TTS])
    Qwen3 -->|No| Kokoro{Kokoro available?}
    Kokoro -->|Yes| UseKokoro([Use Kokoro])
    Kokoro -->|No| Pocket{pocket-tts available<br>or startable?}
    Pocket -->|Yes| UsePocket([Use pocket-tts])
    Pocket -->|No| None

Forcing a Backend

# Via slash command
/voice:speak backend kokoro

# Via environment variable
TTS_BACKEND=fish-speech claude

# Via config file (~/.claude/cc-vox.toml)
[core]
backend = "kokoro"

With fallback = true (default), if your forced backend is unavailable, cc-vox automatically tries the next backend in priority order.