Access to Music VAE

Hello,

Thank you for the excellent work on DiffRhythm 2 and for releasing the model weights!

We are researchers at VITA Lab (EPFL) exploring the use of DiffRhythm 2's DiT backbone.

We noticed that the released checkpoints include:
The DiT model (model.safetensors) from [ASLP-lab/DiffRhythm2](https://huggingface.co/ASLP-lab/DiffRhythm2)
The BigVGAN decoder (decoder.bin) from the same repo
The DiffRhythm v1 oobleck VAE (vae_model.pt) from [ASLP-lab/DiffRhythm-vae](https://huggingface.co/ASLP-lab/DiffRhythm-vae)

However, as described in Section 3.2 of the paper, the DiffRhythm 2 DiT was trained on latents from the v2 Music VAE encoder (24 kHz input, 4800× compression, 5 Hz frame rate), which is architecturally different from the v1 oobleck VAE (44.1 kHz, 2048× compression, ~21.5 Hz). The v2 Music VAE encoder does not appear to be included in the released checkpoints.

Would it be possible to release the Music VAE encoder checkpoint? We need it to produce the correct 5 fps latent representations that match what the DiT was pretrained on. 

We would greatly appreciate any help with this. Thank you for your time!

Best regards,
Adrien Lefèvre

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Access to Music VAE #11

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Access to Music VAE #11

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions