β‘ A repository for evaluating AudioLLMs in various tasks π β‘ 
  β‘ AudioBench: A Universal Benchmark for Audio Large Language Models π β‘ 
    
     π Come to View Our Live Leaderboard on Huggingface Space π
  
π  AudioBench Leaderboard | π€ Huggingface Datasets | π€ AudioLLM Paper Collection 
- Mar 2025: Supported phi_4_multimodal_instruct model, gigaspeech 2 evaluation (Thai, Vietenames and Indonesina).
- Mar 2025: Support MMAU testset. Multiple-choice questions for speech, audio and music understanding!
- Mar 2025: AudioBench now supports over 50 datasets!!
- Mar 2025: Support SEAME testsets (dev). It is a code-switching dataset for Chinese and Singapore accented English.
- JAN 2025: AudioBench paper is accepted to NAACL 2025 Main Conference.
- JAN 2025: Support 10+ MNSC - Singlish Understanding datasets, the results are updated on leaderboard.
- DEC 2024: Support more (35) datasets / more Models (2 cascade and 3 fusion models).
- SEP 2024: Add MuChoMusic dataset for music evaluation (multiple choice questions).
- AUG 2024: Support a 6 speech translation datasets. Update the evaluation script for several MCQ evaluation.
- AUG 2024: Leaderboard is live. Check it out here.
- JUL 2024: We are working hard on the leaderboard and speech translation dataset. Stay tuned!
- JUL 2024: Support all INITIAL 26 datasets listed in AudioBench manuscript.
-  librispeech_test_clean, ASR, English, Metric: wer
-  librispeech_test_other, ASR, English, Metric: wer
-  common_voice_15_en_test, ASR, English, Metric: wer
-  peoples_speech_test, ASR, English, Metric: wer
-  gigaspeech_test, ASR, English, Metric: wer
-  tedlium3_test, ASR, English, Metric: wer
-  tedlium3_long_form_test, ASR, English, Long recording, Metric: wer
-  earnings21_test, ASR, English, Long recording, Metric: wer
-  earnings22_test, ASR, English, Long recording, Metric: wer
-  aishell_asr_zh_test, ASR, Chinese, Metric: wer
-  covost2_en_id_test, Speech Translation, English-Indonesian, Metric: bleu
-  covost2_en_zh_test, Speech Translation, English-Chinese, Metric: bleu
-  covost2_en_ta_test, Speech Translation, English-Tamil, Metric: bleu
-  covost2_id_en_test, Speech Translation, Indonesian-English, Metric: bleu
-  covost2_zh_en_test, Speech Translation, Chinese-English, Metric: bleu
-  covost2_ta_en_test, Speech Translation, Tamil-English, Metric: bleu
-  cn_college_listen_mcq_test, Speech Question Answering, Multiple Choice, Metric: llama3_70b_judge,gpt4o_judge
-  slue_p2_sqa5_test, Speech Question Answering, Metric: llama3_70b_judge,gpt4o_judge
-  dream_tts_mcq_test, Speech Question Answering, Multiple Choice, Metric: llama3_70b_judge,gpt4o_judge
-  public_sg_speech_qa_test, Speech Question Answering, Metric: llama3_70b_judge,gpt4o_judge
-  spoken_squad_test, Speech Question Answering, Metric: llama3_70b_judge,gpt4o_judge
-  openhermes_audio_test, Speech Instruction, Metric: llama3_70b_judge,gpt4o_judge
-  alpaca_audio_test, Speech Instruction, Metric: llama3_70b_judge,gpt4o_judge
-  spoken-mqa_short_digit, Speech Instruction, Metric: acc
-  spoken-mqa_long_digit, Speech Instruction, Metric: acc
-  spoken-mqa_single_step_reasoning, Speech Instruction, Metric: acc
-  spoken-mqa_multi_step_reasoning, Speech Instruction, Metric: acc
-  clotho_aqa_test, Speech Question Answering, Metric: llama3_70b_judge,gpt4o_judge
-  wavcaps_qa_test, Audio Scene Question Answering, Metric: llama3_70b_judge,gpt4o_judge
-  audiocaps_qa_test, Audio Scene Question Answering, Metric: llama3_70b_judge,gpt4o_judge
-  wavcaps_test, Audio Scene Question Answering, Metric: llama3_70b_judge,meteor,gpt4o_judge
-  audiocaps_test, Audio Scene Question Answering, Metric: llama3_70b_judge,meteor,gpt4o_judge
-  iemocap_emotion_test, Emotion Recognition, Metric: llama3_70b_judge,gpt4o_judge
-  meld_sentiment_test, Emotion Recognition, Metric: llama3_70b_judge,gpt4o_judge
-  meld_emotion_test, Emotion Recognition, Metric: llama3_70b_judge,gpt4o_judge
-  voxceleb_accent_test, Accent Recognition, Metric: llama3_70b_judge,gpt4o_judge
-  voxceleb_gender_test, Gender Recognition, Metric: llama3_70b_judge,gpt4o_judge
-  iemocap_gender_test, Gender Recognition, Metric: llama3_70b_judge,gpt4o_judge
-  muchomusic_test, Music Understanding, Metric: llama3_70b_judge,gpt4o_judge
-  imda_part1_asr_test, Singlish ASR, Metric: wer
-  imda_part2_asr_test, Singlish ASR, Metric: wer
-  imda_part3_30s_asr_test, Singlish ASR, Metric: wer
-  imda_part4_30s_asr_test, Singlish ASR, Metric: wer
-  imda_part5_30s_asr_test, Singlish ASR, Metric: wer
-  imda_part6_30s_asr_test, Singlish ASR, Metric: wer
-  imda_part3_30s_sqa_human_test, Singlish Speech Question Answering, Metric: llama3_70b_judge,gpt4o_judge
-  imda_part4_30s_sqa_human_test, Singlish Speech Question Answering, Metric: llama3_70b_judge,gpt4o_judge
-  imda_part5_30s_sqa_human_test, Singlish Speech Question Answering, Metric: llama3_70b_judge,gpt4o_judge
-  imda_part6_30s_sqa_human_test, Singlish Speech Question Answering, Metric: llama3_70b_judge,gpt4o_judge
-  imda_part3_30s_ds_human_test, Singlish Speech Summarization, Metric: llama3_70b_judge,gpt4o_judge
-  imda_part4_30s_ds_human_test, Singlish Speech Summarization, Metric: llama3_70b_judge,gpt4o_judge
-  imda_part5_30s_ds_human_test, Singlish Speech Summarization, Metric: llama3_70b_judge,gpt4o_judge
-  imda_part6_30s_ds_human_test, Singlish Speech Summarization, Metric: llama3_70b_judge,gpt4o_judge
-  imda_ar_sentence, Singlish, Accent Recognition, Metric: llama3_70b_judge,gpt4o_judge
-  imda_ar_dialogue, Singlish, Accent Recognition, Metric: llama3_70b_judge,gpt4o_judge
-  imda_gr_sentence, Singlish, Gender Recognition, Metric: llama3_70b_judge,gpt4o_judge
-  imda_gr_dialogue, Singlish, Gender Recognition, Metric: llama3_70b_judge,gpt4o_judge
-  seame_dev_man, English-Chinese Code-Switching, Metric: wer
-  seame_dev_sge, English-Chinese Code-Switching, Metric: wer
-  mmau_mini, Audio Understandign and Reasoning, Multiple Choice Questions, Metric: llama3_70b_judge,string_match,gpt4o_judge
-  gigaspeech2_thai, ASR for Thai language, Metric: wer
-  gigaspeech2_indo, ASR for Indonesian language, Metric: wer
-  gigaspeech2_viet, ASR for Vietnamese language, Metric: wer
-  ASCEND, English-Chinese Code-Switching, Metric: wer
- [fleurs] speech translation
- [AIR-Bench] airbench tasks
How to evaluate with the supported datasets? That's as simple as it can be. Replace the DATASET and METRIC name.
DATASET=librispeech_test_clean
METRIC=wer
Two simple steps:
- Make a copy of one of the customized dataset loader. Example: cn_college_listen_mcq_test. Customize it as your like on your own dataset.
- Add a new term in dataset.py.
- Done!
- cascade_whisper_large_v3_llama_3_8b_instruct
- cascade_whisper_large_v2_gemma2_9b_cpt_sea_lionv3_instruct
- MERaLiON-AudioLLM-Whisper-SEA-LION
- Qwen-Audio-Chat
- Qwen2-Audio-7B-Instruct
- SALMONN_7B: need extra git clone.
- WavLLM_fairseq: no longer supported as the inference takes too much effort.
- whisper_large_v3
- whisper_large_v2
- gemini-1.5-flash: key needed
- gemini-2-flash: key needed
- gpt-4o-audio: key needed
- phi_4_multimodal_instruct
- seallms_audio_7b
- ultravox https://huggingface.co/fixie-ai/ultravox-v0_5-llama-3_1-8b / https://www.ultravox.ai/
- llama3_s
- audio-flamingo-2
- [GLM4-Voice]
- [Mini-Omni]
- [SLAM-Omni]
- [https://huggingface.co/scb10x/llama3.1-typhoon2-audio-8b-instruct]
- [https://huggingface.co/WillHeld/DiVA-llama-3-v0-8b]
As long as the model can do inference, you can load them and inference to get the responses. To evaluate on new models, please refer to adding_new_model.
Installation with pip:
pip install -r requirements.txtFor model-as-judge evaluation, we serve the judgement model as a service via vllm on port 5000.
The example is hosting a Llama-3-70B-Instruct model and running the cascade Whisper + Llama-3 model.
# Step 1:
# Server the judgement model using VLLM framework (my example is using int4 quantized version)
# This requires with 1 * 80GB GPU
bash vllm_model_judge_llama_3_70b.sh
# Step 2:
# We perform model inference and obtain the evaluation results with the second GPU
GPU=2
BATCH_SIZE=1
OVERWRITE=True
NUMBER_OF_SAMPLES=-1 # indicate all test samples if number_of_samples=-1
MODEL_NAME=Qwen2-Audio-7B-Instruct
DATASET=cn_college_listen_mcq_test
METRICS=llama3_70b_judge
bash eval.sh $DATASET $MODEL_NAME $GPU $BATCH_SIZE $OVERWRITE $METRICS $NUMBER_OF_SAMPLES
If you find our work useful, please consider citing our paper!
@article{wang2024audiobench,
  title={AudioBench: A Universal Benchmark for Audio Large Language Models},
  author={Wang, Bin and Zou, Xunlong and Lin, Geyu and Sun, Shuo and Liu, Zhuohan and Zhang, Wenyu and Liu, Zhengyuan and Aw, AiTi and Chen, Nancy F},
  journal={NAACL},
  year={2025}
}Email: [email protected]
- Llama3-S: When Llama Learns to Listen
- [llms-eval] https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/docs/lmms-eval-0.3.md
- More to come...
-  Features
- Evaluation with audio/speech generation
- Evaluation with multiround chatbot
- Also support other model-as-judge and report the results
- Update AI-SHELL from WER to CER
 
-  Bugs
- Threads of model-as-judge
- Post-processing script for IMDA PART4 which contains code-switching in 4 languages.
 
- Xue Cong Tey (MMAU-mini Dataset)
