Skip to content

ongdyub/Music-To-Text-Description-Model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Music-Description-Generator

Example Outputs

Ex 1 * Only Audio file Not Video!

example1.mp4
'This is a classical music piece. It could also be playing in the background at a coffee shop.'

Ex 2 * Only Audio file Not Video!

example2.mp4
'The low quality recording features a live performance of a folk song and it consists of groovy bass, shimmering hi hats, soft kick and harmonizing vocals, harmonizing vocals. It sounds energetic.'

Model Architecture

Audio Encoder

Use facebook/encodec_32khz huggingface pre-trained model.

Input is 10 seconds of raw audio, sample rate is 32000.

Audio Encoder convert raw audio to Discrete sequence of audio like [100, 321, 210, 124, ... , 213].

Sequence of audio codebook is input of Text Decoder.

Text Decoder

Use Transformer base architecture and T5 tokenizer.

More details (nLayers, hidden dim, nHeads, etc...) are in trainer.ipynb

Input is sequence of codebook index, Out is sentences.

Training & Test Loss Graph

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published