-
Notifications
You must be signed in to change notification settings - Fork 21
Description
Hello,
Thank you for the excellent work on DiffRhythm 2 and for releasing the model weights!
We are researchers at VITA Lab (EPFL) exploring the use of DiffRhythm 2's DiT backbone.
We noticed that the released checkpoints include:
The DiT model (model.safetensors) from ASLP-lab/DiffRhythm2
The BigVGAN decoder (decoder.bin) from the same repo
The DiffRhythm v1 oobleck VAE (vae_model.pt) from ASLP-lab/DiffRhythm-vae
However, as described in Section 3.2 of the paper, the DiffRhythm 2 DiT was trained on latents from the v2 Music VAE encoder (24 kHz input, 4800× compression, 5 Hz frame rate), which is architecturally different from the v1 oobleck VAE (44.1 kHz, 2048× compression, ~21.5 Hz). The v2 Music VAE encoder does not appear to be included in the released checkpoints.
Would it be possible to release the Music VAE encoder checkpoint? We need it to produce the correct 5 fps latent representations that match what the DiT was pretrained on.
We would greatly appreciate any help with this. Thank you for your time!
Best regards,
Adrien Lefèvre