Skip to content

Conversation

ashaabansoliman
Copy link

@ashaabansoliman ashaabansoliman commented Oct 14, 2025

Description

Adding tool success evaluator after fixing the output structure , typos and examples
Please add an informative description that covers that changes made by the pull request and link all relevant issues.

If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

@ashaabansoliman ashaabansoliman requested a review from a team as a code owner October 14, 2025 21:06
@Copilot Copilot AI review requested due to automatic review settings October 14, 2025 21:06
@github-actions github-actions bot added Community Contribution Community members are working on the issue customer-reported Issues that are reported by GitHub users external to the Azure organization. Evaluation Issues related to the client library for Azure AI Evaluation labels Oct 14, 2025
@github-actions
Copy link

Thank you for your contribution @ashaabansoliman! We will review the pull request and get back to you soon.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a new ToolSuccessEvaluator to the Azure AI Evaluation SDK. The evaluator determines whether tool calls made by an AI agent succeeded or failed based on technical criteria, focusing on errors, exceptions, timeouts, and empty results rather than business logic correctness.

Key changes:

  • Implements ToolSuccessEvaluator class with comprehensive error detection logic
  • Adds prompty template with detailed evaluation framework and examples
  • Includes sample usage code in evaluation examples
  • Adds error target enum entry for the new evaluator

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
_tool_success.py Core evaluator implementation with tool call parsing and success/failure detection
tool_success.prompty Comprehensive prompt template with evaluation criteria and examples
_exceptions.py Adds TOOL_SUCCESS_EVALUATOR to ErrorTarget enum
__init__.py Exports the new ToolSuccessEvaluator class
evaluation_samples_evaluate.py Sample usage code demonstrating the evaluator
evaluation_samples_evaluate_fdp.py Additional sample usage for Azure AI Project URL format



TOOL_CALLS is a list of tool calls that was produced by the AI agent's. It includes calls together with the result of every tool call.
TOOL_DEFINITIONS is a list of definitions for the tools that was called. This definition can contain a description of functionality provided by the tool , the parameters that the tool accept and the expected return of the tool. This definition can contribute to the assessment of whether a tool call succeeded or failed.
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spacing around comma in 'tool , the' to 'tool, the'

Suggested change
TOOL_DEFINITIONS is a list of definitions for the tools that was called. This definition can contain a description of functionality provided by the tool , the parameters that the tool accept and the expected return of the tool. This definition can contribute to the assessment of whether a tool call succeeded or failed.
TOOL_DEFINITIONS is a list of definitions for the tools that was called. This definition can contain a description of functionality provided by the tool, the parameters that the tool accept and the expected return of the tool. This definition can contribute to the assessment of whether a tool call succeeded or failed.

Copilot uses AI. Check for mistakes.

Comment on lines +117 to +124
EXPECTED OUTPUT
{
"explanation": "Although the returned value 7 is not the square root of 4, but this is a business mistake in the tool. The tool did not return a result indicating a technical error",
"details": {
"failed_tools": "",
},
"success": True
}
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant conjunction 'but' - should be either 'Although...value 7 is not..., this is a business mistake' or 'The returned value 7 is not...but this is a business mistake'

Copilot uses AI. Check for mistakes.

Comment on lines +134 to +141
EXPECTED OUTPUT
{
"explanation": "The tool returned a semicolon separated list of names. Although the description in the definition says it should return comma-separated list , this formatting mistake is a business mistake of the tool , not a technical failure. The tool did not return an error",
"details": {
"failed_tools": "",
},
"success": True
}
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spacing around commas in 'comma-separated list , this' and 'tool , not' to 'comma-separated list, this' and 'tool, not'

Copilot uses AI. Check for mistakes.

Comment on lines +172 to +179
EXPECTED OUTPUT
{
"explanation": "The tool returned empty response , however , given the tool definition , it should never return empty response because there should be weather info at any given point in time. An empty response here is considered a technical failure. The conclusion is the get_weather_info failed",
"details": {
"failed_tools": "get_weather_info",
},
"success": False
}
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spacing around commas in 'response , however , given' and 'definition , it' to 'response, however, given' and 'definition, it'

Copilot uses AI. Check for mistakes.

Comment on lines +127 to +128
message="response, is a required inputs to the Tool Success evaluator.",
internal_message="response, is a required inputs to the Tool Success evaluator.",
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect grammar: 'response, is a required inputs' should be 'response is a required input'

Suggested change
message="response, is a required inputs to the Tool Success evaluator.",
internal_message="response, is a required inputs to the Tool Success evaluator.",
message="response is a required input to the Tool Success evaluator.",
internal_message="response is a required input to the Tool Success evaluator.",

Copilot uses AI. Check for mistakes.

tool_success_evaluator = ToolSuccessEvaluator(model_config=model_config)
tool_success_evaluator(
response=json.loads( """[{"createdAt": "2025-08-16T08:39:47Z", "run_id": "run_id22", "role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_id557", "name": "get_date", "arguments": {}}]}, {"createdAt": "2025-08-16T08:39:49Z", "run_id": "run_id22", "tool_call_id": "call_id557", "role": "tool", "content": [{"type": "tool_result", "tool_result": {"date_and_time": "2019-09-07 23:59:59"}}]}, {"createdAt": "2025-08-16T08:39:51Z", "run_id": "run_id22", "role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_Run1", "name": "get_spending_by_day", "arguments": {"start_date": "2019-10-01", "end_date": "2019-10-31"}}]}, {"createdAt": "2025-08-16T08:39:53Z", "run_id": "run_id22", "tool_call_id": "call_Run1", "role": "tool", "content": [{"type": "tool_result", "tool_result": {"spending": {}}}]}, {"createdAt": "2025-08-16T08:39:54Z", "run_id": "run_id22", "role": "assistant", "content": [{"type": "text", "text": "There are no spending records for October."}]}]"""),
tool_definitions=json.loads( """[{"name": "get_categories", "type": "function", "description": "Retrieve of a spending line id from your spending records."}]""")
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar error: 'Retrieve of a spending line id' should be 'Retrieve a spending line id' or 'Retrieve the category of a spending line id'

Suggested change
tool_definitions=json.loads( """[{"name": "get_categories", "type": "function", "description": "Retrieve of a spending line id from your spending records."}]""")
tool_definitions=json.loads( """[{"name": "get_categories", "type": "function", "description": "Retrieve the category of a spending line id from your spending records."}]""")

Copilot uses AI. Check for mistakes.

tool_success_evaluator = ToolSuccessEvaluator(model_config=model_config)
tool_success_evaluator(
response=json.loads( """[{"createdAt": "2025-08-16T08:39:47Z", "run_id": "run_id22", "role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_id557", "name": "get_date", "arguments": {}}]}, {"createdAt": "2025-08-16T08:39:49Z", "run_id": "run_id22", "tool_call_id": "call_id557", "role": "tool", "content": [{"type": "tool_result", "tool_result": {"date_and_time": "2019-09-07 23:59:59"}}]}, {"createdAt": "2025-08-16T08:39:51Z", "run_id": "run_id22", "role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_Run1", "name": "get_spending_by_day", "arguments": {"start_date": "2019-10-01", "end_date": "2019-10-31"}}]}, {"createdAt": "2025-08-16T08:39:53Z", "run_id": "run_id22", "tool_call_id": "call_Run1", "role": "tool", "content": [{"type": "tool_result", "tool_result": {"spending": {}}}]}, {"createdAt": "2025-08-16T08:39:54Z", "run_id": "run_id22", "role": "assistant", "content": [{"type": "text", "text": "There are no spending records for October."}]}]"""),
tool_definitions=json.loads( """[{"name": "get_categories", "type": "function", "description": "Retrieve of a spending line id from your spending records."}]""")
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar error: 'Retrieve of a spending line id' should be 'Retrieve a spending line id' or 'Retrieve the category of a spending line id'

Suggested change
tool_definitions=json.loads( """[{"name": "get_categories", "type": "function", "description": "Retrieve of a spending line id from your spending records."}]""")
tool_definitions=json.loads( """[{"name": "get_categories", "type": "function", "description": "Retrieve the category of a spending line id from your spending records."}]""")

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community Contribution Community members are working on the issue customer-reported Issues that are reported by GitHub users external to the Azure organization. Evaluation Issues related to the client library for Azure AI Evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant