[FEAT] Add gpqa diamond #17

shruthan · 2025-09-24T06:41:40Z

📌 Description

Adds GPQA Diamond Audio.

Has 155 of 198 speakable samples converted to speech for evaluation at ServiceNow-AI/gpqa_audio.

On parallel text run, GPT 4o mini scores 39 (reported 40.8 at https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/)

On audio:
GPT 4o mini: 28.9 +- 0.86 (5 runs)
Voxtral Small: 27.1
Phi 4 Multimodal Instruct: 22.58

🛠️ Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality including new tasks)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactor / Code cleanup
Maintenance / Chore / Task
Other (please describe):

✅ How Has This Been Tested?

Unit tests
Integration tests
Manual testing

Test Results / Screenshots (if applicable):

📸 Screenshots / Demos

📋 Checklist

Code follows project style guidelines
Tests have been added/updated (if applicable)
Documentation has been updated (if applicable)
Linked relevant issue(s)
Self-reviewed my code

🙌 Additional Notes

akshaykalkunte

LGTM

shruthan · 2025-09-24T21:22:14Z

With some more filtering of samples for audio quality, the dataset now has 147 samples.
Scores are largely similar except Voxtral Small that now scores 29.93

shruthan added 2 commits September 23, 2025 23:29

add gpqa diamond

fd34c10

Merge branch 'main' into scratch/gpqa

1ac3d58

akshaykalkunte approved these changes Sep 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEAT] Add gpqa diamond #17

[FEAT] Add gpqa diamond #17

Uh oh!

shruthan commented Sep 24, 2025

Uh oh!

akshaykalkunte left a comment

Uh oh!

shruthan commented Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[FEAT] Add gpqa diamond #17

Are you sure you want to change the base?

[FEAT] Add gpqa diamond #17

Uh oh!

Conversation

shruthan commented Sep 24, 2025

📌 Description

🛠️ Type of Change

✅ How Has This Been Tested?

📸 Screenshots / Demos

📋 Checklist

🙌 Additional Notes

Uh oh!

akshaykalkunte left a comment

Choose a reason for hiding this comment

Uh oh!

shruthan commented Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants