example1.mp4
'This is a classical music piece. It could also be playing in the background at a coffee shop.'example2.mp4
'The low quality recording features a live performance of a folk song and it consists of groovy bass, shimmering hi hats, soft kick and harmonizing vocals, harmonizing vocals. It sounds energetic.'Use facebook/encodec_32khz huggingface pre-trained model.
Input is 10 seconds of raw audio, sample rate is 32000.
Audio Encoder convert raw audio to Discrete sequence of audio like [100, 321, 210, 124, ... , 213].
Sequence of audio codebook is input of Text Decoder.
Use Transformer base architecture and T5 tokenizer.
More details (nLayers, hidden dim, nHeads, etc...) are in trainer.ipynb
Input is sequence of codebook index, Out is sentences.

