Refined Audio-LDM for Audio Synthesis

Overview

This project explores a refined Audio-LDM approach for audio synthesis, focusing on generating high-quality audio representations through improved techniques in latent space. The refined model incorporates advanced components like HT-SAT and GPT-2 for text encoding, aiming to improve scene classification and overall semantic representation of audio spectrograms. The drive link which encompasses our entire project details :- drive_link

Motivation

The standard Audio-LDM (LDMs) faced limitations due to their architecture, primarily regarding the ability to generate semantically meaningful audio representations. Our goal was to enhance the Audio-LDM by integrating better pre-trained models and architectures to improve performance and efficiency.

Key Components

1. Audio Representation

Audio signals are represented in the frequency domain using Mel-Spectrograms, which provide a differentiable scale for both high and low-frequency components. This representation is then used for generating embeddings.

2. Refined Audio-LDM

The refined Audio-LDM utilizes:

HT-SAT (Hierarchical Transformer - Self-Attention Transformer): Pre-trained on 22 different audio tasks, this model efficiently captures and represents audio spectrogram features with lower GPU consumption.
GPT-2 Text Encoder: Provides a more robust text encoding mechanism, better suited for mapping text embeddings to the semantic information of spectrograms.
LDM
HIFI-GAN

Training Strategies:

The refined model was trained from scratch, accommodating the larger input size required by the new components, resulting in a lower loss at 50k iterations compared to the original Audio-LDM, which required 500k iterations without achieving similar improvement.

Results

The refined LDM achieved:

Reduced Training Time: The refined LDM achieved lower loss in fewer iterations (50k vs. 500k).
Improved Audio Quality: Better semantic representations were generated for the audio spectrograms, leading to higher quality audio synthesis.

Audio Samples

Here are some of the Samples Generated using our model: Samples

Usage

For training the model use the below script

python3 latent_diffusion.py -c audioldm_custom.yaml

For inference you can use the below command

python3 infer.py --config_yaml audioldm_custom.yaml --list_inference inference_test.lst --reload_from_ckpt checkpoint.ckpt

The checkpoint file can be found here -> Drive Link

PS: While running inference on CUDA, there is some driver error for the time being which we are trying to fix ASAP. So for the time being inference can taken out from the validation step of the training. To take out custom audios, once can change the textual description in the data folder.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
README.md		README.md
Training Loss Graph.png		Training Loss Graph.png
audioldm_custom.yaml		audioldm_custom.yaml
infer.py		infer.py
inference_test.lst		inference_test.lst
latent_diffusion.py		latent_diffusion.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Refined Audio-LDM for Audio Synthesis

Overview

Motivation

Key Components

1. Audio Representation

2. Refined Audio-LDM

Training Strategies:

Results

Audio Samples

Usage

For inference you can use the below command

About

Releases

Packages

Languages

Sukhvansh2004/Audio-Scene-Synthesis

Folders and files

Latest commit

History

Repository files navigation

Refined Audio-LDM for Audio Synthesis

Overview

Motivation

Key Components

1. Audio Representation

2. Refined Audio-LDM

Training Strategies:

Results

Audio Samples

Usage

For inference you can use the below command

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages