Skip to content

Conversation

@nhhoang96
Copy link
Collaborator

πŸ“Œ Description

  • Adding support for deep reasoning tasks, including MMAR and MMAU-PRO.
  • Reusing llm-judge-binary for the MCQ evaluation for these two datasets.

πŸ”— Related Issue(s)

πŸ› οΈ Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality including new tasks)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactor / Code cleanup
  • Maintenance / Chore / Task
  • Other (please describe):

βœ… How Has This Been Tested?

  • Unit tests
  • Integration tests
  • Manual testing

Test Results / Screenshots (if applicable):

Sample run_config:

task_metric: 
  - ['mmar', 'llm_judge_binary']
  - ['mmau-pro', 'llm_judge_binary']
aggregate:
  - ['llm_judge_binary', ['mmau-pro']]
  - ['llm_judge_binary', ['mmar']]

Sample Experimental Log:
2025-09-20_23-40-59_544502_default.log

πŸ“Έ Screenshots / Demos

N/A

πŸ“‹ Checklist

  • Code follows project style guidelines
  • Tests have been added/updated (if applicable)
  • Documentation has been updated (if applicable)
  • Linked relevant issue(s)
  • Self-reviewed my code

πŸ™Œ Additional Notes

N/A

@nhhoang96 nhhoang96 self-assigned this Sep 21, 2025
@nhhoang96 nhhoang96 added documentation Improvements or additions to documentation enhancement New feature or request labels Sep 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants