-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Adding tool success evaluator #43427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thank you for your contribution @ashaabansoliman! We will review the pull request and get back to you soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a new ToolSuccessEvaluator to the Azure AI Evaluation SDK. The evaluator determines whether tool calls made by an AI agent succeeded or failed based on technical criteria, focusing on errors, exceptions, timeouts, and empty results rather than business logic correctness.
Key changes:
- Implements ToolSuccessEvaluator class with comprehensive error detection logic
- Adds prompty template with detailed evaluation framework and examples
- Includes sample usage code in evaluation examples
- Adds error target enum entry for the new evaluator
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.
Show a summary per file
File | Description |
---|---|
_tool_success.py |
Core evaluator implementation with tool call parsing and success/failure detection |
tool_success.prompty |
Comprehensive prompt template with evaluation criteria and examples |
_exceptions.py |
Adds TOOL_SUCCESS_EVALUATOR to ErrorTarget enum |
__init__.py |
Exports the new ToolSuccessEvaluator class |
evaluation_samples_evaluate.py |
Sample usage code demonstrating the evaluator |
evaluation_samples_evaluate_fdp.py |
Additional sample usage for Azure AI Project URL format |
|
||
|
||
TOOL_CALLS is a list of tool calls that was produced by the AI agent's. It includes calls together with the result of every tool call. | ||
TOOL_DEFINITIONS is a list of definitions for the tools that was called. This definition can contain a description of functionality provided by the tool , the parameters that the tool accept and the expected return of the tool. This definition can contribute to the assessment of whether a tool call succeeded or failed. |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spacing around comma in 'tool , the' to 'tool, the'
TOOL_DEFINITIONS is a list of definitions for the tools that was called. This definition can contain a description of functionality provided by the tool , the parameters that the tool accept and the expected return of the tool. This definition can contribute to the assessment of whether a tool call succeeded or failed. | |
TOOL_DEFINITIONS is a list of definitions for the tools that was called. This definition can contain a description of functionality provided by the tool, the parameters that the tool accept and the expected return of the tool. This definition can contribute to the assessment of whether a tool call succeeded or failed. |
Copilot uses AI. Check for mistakes.
EXPECTED OUTPUT | ||
{ | ||
"explanation": "Although the returned value 7 is not the square root of 4, but this is a business mistake in the tool. The tool did not return a result indicating a technical error", | ||
"details": { | ||
"failed_tools": "", | ||
}, | ||
"success": True | ||
} |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant conjunction 'but' - should be either 'Although...value 7 is not..., this is a business mistake' or 'The returned value 7 is not...but this is a business mistake'
Copilot uses AI. Check for mistakes.
EXPECTED OUTPUT | ||
{ | ||
"explanation": "The tool returned a semicolon separated list of names. Although the description in the definition says it should return comma-separated list , this formatting mistake is a business mistake of the tool , not a technical failure. The tool did not return an error", | ||
"details": { | ||
"failed_tools": "", | ||
}, | ||
"success": True | ||
} |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spacing around commas in 'comma-separated list , this' and 'tool , not' to 'comma-separated list, this' and 'tool, not'
Copilot uses AI. Check for mistakes.
EXPECTED OUTPUT | ||
{ | ||
"explanation": "The tool returned empty response , however , given the tool definition , it should never return empty response because there should be weather info at any given point in time. An empty response here is considered a technical failure. The conclusion is the get_weather_info failed", | ||
"details": { | ||
"failed_tools": "get_weather_info", | ||
}, | ||
"success": False | ||
} |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spacing around commas in 'response , however , given' and 'definition , it' to 'response, however, given' and 'definition, it'
Copilot uses AI. Check for mistakes.
message="response, is a required inputs to the Tool Success evaluator.", | ||
internal_message="response, is a required inputs to the Tool Success evaluator.", |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect grammar: 'response, is a required inputs' should be 'response is a required input'
message="response, is a required inputs to the Tool Success evaluator.", | |
internal_message="response, is a required inputs to the Tool Success evaluator.", | |
message="response is a required input to the Tool Success evaluator.", | |
internal_message="response is a required input to the Tool Success evaluator.", |
Copilot uses AI. Check for mistakes.
tool_success_evaluator = ToolSuccessEvaluator(model_config=model_config) | ||
tool_success_evaluator( | ||
response=json.loads( """[{"createdAt": "2025-08-16T08:39:47Z", "run_id": "run_id22", "role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_id557", "name": "get_date", "arguments": {}}]}, {"createdAt": "2025-08-16T08:39:49Z", "run_id": "run_id22", "tool_call_id": "call_id557", "role": "tool", "content": [{"type": "tool_result", "tool_result": {"date_and_time": "2019-09-07 23:59:59"}}]}, {"createdAt": "2025-08-16T08:39:51Z", "run_id": "run_id22", "role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_Run1", "name": "get_spending_by_day", "arguments": {"start_date": "2019-10-01", "end_date": "2019-10-31"}}]}, {"createdAt": "2025-08-16T08:39:53Z", "run_id": "run_id22", "tool_call_id": "call_Run1", "role": "tool", "content": [{"type": "tool_result", "tool_result": {"spending": {}}}]}, {"createdAt": "2025-08-16T08:39:54Z", "run_id": "run_id22", "role": "assistant", "content": [{"type": "text", "text": "There are no spending records for October."}]}]"""), | ||
tool_definitions=json.loads( """[{"name": "get_categories", "type": "function", "description": "Retrieve of a spending line id from your spending records."}]""") |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar error: 'Retrieve of a spending line id' should be 'Retrieve a spending line id' or 'Retrieve the category of a spending line id'
tool_definitions=json.loads( """[{"name": "get_categories", "type": "function", "description": "Retrieve of a spending line id from your spending records."}]""") | |
tool_definitions=json.loads( """[{"name": "get_categories", "type": "function", "description": "Retrieve the category of a spending line id from your spending records."}]""") |
Copilot uses AI. Check for mistakes.
tool_success_evaluator = ToolSuccessEvaluator(model_config=model_config) | ||
tool_success_evaluator( | ||
response=json.loads( """[{"createdAt": "2025-08-16T08:39:47Z", "run_id": "run_id22", "role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_id557", "name": "get_date", "arguments": {}}]}, {"createdAt": "2025-08-16T08:39:49Z", "run_id": "run_id22", "tool_call_id": "call_id557", "role": "tool", "content": [{"type": "tool_result", "tool_result": {"date_and_time": "2019-09-07 23:59:59"}}]}, {"createdAt": "2025-08-16T08:39:51Z", "run_id": "run_id22", "role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_Run1", "name": "get_spending_by_day", "arguments": {"start_date": "2019-10-01", "end_date": "2019-10-31"}}]}, {"createdAt": "2025-08-16T08:39:53Z", "run_id": "run_id22", "tool_call_id": "call_Run1", "role": "tool", "content": [{"type": "tool_result", "tool_result": {"spending": {}}}]}, {"createdAt": "2025-08-16T08:39:54Z", "run_id": "run_id22", "role": "assistant", "content": [{"type": "text", "text": "There are no spending records for October."}]}]"""), | ||
tool_definitions=json.loads( """[{"name": "get_categories", "type": "function", "description": "Retrieve of a spending line id from your spending records."}]""") |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar error: 'Retrieve of a spending line id' should be 'Retrieve a spending line id' or 'Retrieve the category of a spending line id'
tool_definitions=json.loads( """[{"name": "get_categories", "type": "function", "description": "Retrieve of a spending line id from your spending records."}]""") | |
tool_definitions=json.loads( """[{"name": "get_categories", "type": "function", "description": "Retrieve the category of a spending line id from your spending records."}]""") |
Copilot uses AI. Check for mistakes.
Description
Adding tool success evaluator after fixing the output structure , typos and examples
Please add an informative description that covers that changes made by the pull request and link all relevant issues.
If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines