Skip to content

Latest commit

 

History

History
94 lines (76 loc) · 4.03 KB

File metadata and controls

94 lines (76 loc) · 4.03 KB

chatterbox

chatterbox is an R package that is an R port of resemble AI's chatterbox library. It is written entirely in R using torch and has no Python dependencies.

Installation

# From CRAN (once accepted)
install.packages("chatterbox")

# Development version from GitHub
remotes::install_github("cornball-ai/chatterbox")

Usage

library(chatterbox)

# First use downloads ~2GB of model weights from HuggingFace into the
# standard cache. Give it a generous timeout; in an interactive session
# you'll be asked to confirm the download. chatterbox() expects the
# weights to already be present, so download them first.
options(timeout = 600)
download_chatterbox_models()

# Load model (constructs and loads in one call)
model <- chatterbox("cuda")

# Generate speech from a reference voice
jfk <- system.file("audio", "jfk.mp3", package = "chatterbox")
result <- generate(model, "Hello, this is a test!", jfk)
write_audio(result$audio, result$sample_rate, "output.wav")

# Re-render the same words in a different voice (voice conversion)
vc <- voice_convert(model, jfk, "target_voice.wav")
write_audio(vc$audio, vc$sample_rate, "converted.wav")

# One-liner (also needs the weights downloaded first)
quick_tts("Hello world!", jfk, "out.wav")

Serving

serve() runs an OpenAI-compatible TTS server (POST /v1/audio/speech, GET /health) that loads the model once and stays resident on the GPU:

chatterbox::serve(port = 7810L)               # regular model
chatterbox::serve(port = 7810L, turbo = TRUE) # Turbo (fewer FLOPs; fits a tight VRAM budget)

Point any OpenAI-style client at it (e.g. tts.api::set_tts_base()). Built on base R sockets; a systemd unit ships in system.file("chatterbox.service", package = "chatterbox"). That unit is a template that runs the regular model by default — add turbo = TRUE to its ExecStart if you want Turbo (e.g. to co-reside with another model on a small card).

Differences from the Python implementation

This package targets behavioral parity with chatterbox-tts 0.1.7, with a few deliberate differences:

  • No audio watermark. Python chatterbox embeds Resemble's Perth imperceptible watermark in every generated clip; this port does not. If you need provenance marking for generated audio, add it downstream.
  • A reference voice is required. Python falls back to a builtin default voice (conds.pt); the R API asks for reference audio explicitly and skips that ~105 MB download.
  • Reliability extras. generate() reports eos_found, n_tokens, and audio_sec, always applies Python-parity punctuation normalization, and stops degenerate token loops early (Python 0.1.4 English generates until the token cap in those cases). The R-only internal-caps mitigation is opt-in via normalize_text = TRUE (default FALSE; the failure it patched was a since-fixed bug).
  • One-call model load. chatterbox("cuda") constructs and loads by default; pass load = FALSE for the bare object. load_chatterbox() is idempotent, so older two-step code still works.
  • Backend token caps. The pure-R and backend = "jit" paths generate up to max_new_tokens (default 1000, ~40 s; jit auto-sizes its KV cache so generation always completes). traced = TRUE is limited by its pre-allocated 350-position cache (roughly 10 s of audio per call). Long texts: tts_chunked().
  • GC tuning is automatic and matters a lot. With torch's default allocator settings, autoregressive inference spends most of its wall time (~85% on a regular-model run) in R garbage collection. chatterbox("cuda") tunes torch's CUDA allocator GC by default (tune_gc = TRUE), a roughly 6x speedup; pass tune_gc = FALSE to opt out. chatterbox_gc_options() still prints the snippet if you prefer to set the options() yourself, and the performance vignette has the measurements.
  • The multilingual model is not ported. This targets the standard English model and the turbo model. (Voice conversion is ported, via voice_convert().)