Skip to content

Conversation

rootfs
Copy link
Collaborator

@rootfs rootfs commented Sep 15, 2025

What type of PR is this?

Currently the MMLU pro eval and result_to_config.py is a two step process, not well integrated. Additionally, the reasoning eval is not implemented yet.

This consolidated script is to use a config template with more comprehensive model eval to generate the router config.

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Release Notes: Yes/No

Copy link

github-actions bot commented Sep 15, 2025

👥 vLLM Semantic Team Notification

The following members have been identified for the changed files in this PR and have been automatically assigned:

📁 src

Owners: @rootfs, @Xunzhuo, @wangchen615
Files changed:

  • src/training/model_eval/config_template.yaml
  • src/training/model_eval/reasoning_eval_consolidated.py

vLLM

🎉 Thanks for your contributions!

This comment was automatically generated based on the OWNER files in the repository.

@rootfs rootfs marked this pull request as draft September 15, 2025 14:28
Copy link

netlify bot commented Sep 15, 2025

Deploy Preview for vllm-semantic-router ready!

Name Link
🔨 Latest commit 9c934b6
🔍 Latest deploy log https://app.netlify.com/projects/vllm-semantic-router/deploys/68c8253bdc100c00089ff73a
😎 Deploy Preview https://deploy-preview-138--vllm-semantic-router.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a comprehensive statistical reasoning evaluation framework that integrates model evaluation and configuration generation into a single consolidated script. The framework uses advanced statistical methods to determine optimal reasoning mode configurations for LLM routers based on empirical evidence.

  • Implements multi-criteria statistical analysis using McNemar's test, Fisher's exact test, Cohen's h effect size, and Bayesian inference
  • Consolidates the previously separate MMLU pro eval and result_to_config.py processes into one unified workflow
  • Generates production-ready YAML configurations with data-driven reasoning decisions for each category

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/training/model_eval/reasoning_eval_consolidated.py Main evaluation script implementing statistical framework and configuration generation
src/training/model_eval/config_template.yaml YAML template defining router configuration structure with reasoning family mappings

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 696 to 697
print(f"⚠️ Warning: Template file not found at {template_path}")
# exit with error
Copy link
Preview

Copilot AI Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment indicates an exit should occur but no actual exit or return statement is present. This will cause the function to continue and return None, potentially causing errors downstream when the config is expected to be a dictionary.

Copilot uses AI. Check for mistakes.

params = {
"model": model,
"messages": [{"role": "user", "content": prompt}],
"max_tokens": 6144, # Increased for reasoning mode support (allows full reasoning chains)
Copy link
Preview

Copilot AI Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic number 6144 should be defined as a named constant at the module level for better maintainability and configurability.

Copilot uses AI. Check for mistakes.

Comment on lines 224 to 225
if "deepseek" in model_lower and ("v31" in model_lower):
# DeepSeek v3.1 reasoning via chat template kwargs
Copy link
Preview

Copilot AI Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version check 'v31' is fragile and may not match future versioning schemes (e.g., 'v3.1', 'v31.1'). Consider using a more robust pattern matching approach.

Suggested change
if "deepseek" in model_lower and ("v31" in model_lower):
# DeepSeek v3.1 reasoning via chat template kwargs
if "deepseek" in model_lower and re.search(r"v3(?:[._]?1(?:\.\d+)?)?", model_lower):
# DeepSeek v3.x reasoning via chat template kwargs

Copilot uses AI. Check for mistakes.

Signed-off-by: Huamin Chen <[email protected]>
Signed-off-by: Huamin Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants