Skip to content

Conversation

@shruthan
Copy link
Collaborator

📌 Description

Adds GPQA Diamond Audio.

Has 155 of 198 speakable samples converted to speech for evaluation at ServiceNow-AI/gpqa_audio.

On parallel text run, GPT 4o mini scores 39 (reported 40.8 at https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/)

On audio:
GPT 4o mini: 28.9 +- 0.86 (5 runs)
Voxtral Small: 27.1
Phi 4 Multimodal Instruct: 22.58

🛠️ Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality including new tasks)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactor / Code cleanup
  • Maintenance / Chore / Task
  • Other (please describe):

✅ How Has This Been Tested?

  • Unit tests
  • Integration tests
  • Manual testing

Test Results / Screenshots (if applicable):

📸 Screenshots / Demos

📋 Checklist

  • Code follows project style guidelines
  • Tests have been added/updated (if applicable)
  • Documentation has been updated (if applicable)
  • Linked relevant issue(s)
  • Self-reviewed my code

🙌 Additional Notes

Copy link
Collaborator

@akshaykalkunte akshaykalkunte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shruthan
Copy link
Collaborator Author

With some more filtering of samples for audio quality, the dataset now has 147 samples.
Scores are largely similar except Voxtral Small that now scores 29.93

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants