Batched generation: generate_batch() with ragged-batch S3Gen#12
Merged
Conversation
- generate_batch(model, texts, voice): T3 per text (autoregressive), then ONE batched S3Gen pass (one CFM solve, one vocoder call) over the padded batch; per-row trimming by token count. - s3gen$inference accepts (B, T) tokens + speech_token_lens; flow builds per-row mel masks, expands single-voice conditioning, and solve_euler generalizes CFG from hardcoded batch-2 to 2B. - Three padded-batch leaks found and fixed: (1, 1) prompt_token_len broadcasting collapsed the batch in make_pad_mask; the CFM estimator's transformers ran unmasked (key-padding mask added); the conformer pre-lookahead conv read nonzero embedded padding (tail now zeroed first). - Batch-vs-single parity on identical tokens: encoder <= 2e-4, mel 0.003-0.005 (single-run FP envelope; Python-parity bar is 0.03). - generate() T3 stage extracted to .t3_text_to_tokens, shared by both.
At padded positions the masked estimator leaves dphi = 0, so the generated-region tail was raw initial Gaussian noise; HiFi-GAN's convolutional context smeared it into the end of shorter rows. Tail is now zeroed when speech_token_lens is given (matching the zero padding a single run's convs see past sequence end). Also documents that traced/ autocast apply to the T3 stage only in generate_batch().
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds
generate_batch(model, texts, voice, ...): T3 token generation runs per text (autoregressive generation doesn't batch — lengths and EOS differ), then ALL utterances synthesize in one batched S3Gen pass — a single CFM solve and a single vocoder call over the padded batch, trimmed per row. This is the upstream 0.1.7speech_token_lensbatching, not pipelining.Plumbing changes:
s3gen$inference()accepts(B, T)tokens +speech_token_lensfor ragged batchessolve_eulerCFG generalized from hardcoded batch-2 to 2B (traced falls back to non-traced for B > 1)generate()'s T3 stage extracted to.t3_text_to_tokens, shared by both entry pointsThree padded-batch leaks found and fixed (each invisible at B = 1):
(1, 1)-shapedprompt_token_lenbroadcast a(B)length vector to(1, B), collapsing the batch inmake_pad_maskValidation (GPU, identical token sequences): batch row vs single run — encoder ≤ 2e-4, mel through the full 10-step CFM 0.003–0.005 (the single-run FP envelope; the Python-parity bar is 0.03). End-to-end batch of 3 texts: all EOS, correct durations, samples in ~/Sync.
CPU regression test for the mask/broadcast math (
test_batch_masks.R); full suite passes. Version 0.1.0.8 + NEWS as separate bump commit.