Adding tool success evaluator #43427

ashaabansoliman · 2025-10-14T21:06:18Z

Description

Adding tool success evaluator after fixing the output structure , typos and examples
Please add an informative description that covers that changes made by the pull request and link all relevant issues.

If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.

All SDK Contribution checklist:

The pull request does not introduce [breaking changes]
CHANGELOG is updated for new features, bug fixes or other significant changes.
I have read the contribution guidelines.

General Guidelines and Best Practices

Title of the pull request is clear and informative.
There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

Pull request includes test coverage for the included changes.

…c result.

…n-tool-success

github-actions · 2025-10-14T21:06:38Z

Thank you for your contribution @ashaabansoliman! We will review the pull request and get back to you soon.

Copilot

Pull Request Overview

This PR adds a new ToolSuccessEvaluator to the Azure AI Evaluation SDK. The evaluator determines whether tool calls made by an AI agent succeeded or failed based on technical criteria, focusing on errors, exceptions, timeouts, and empty results rather than business logic correctness.

Key changes:

Implements ToolSuccessEvaluator class with comprehensive error detection logic
Adds prompty template with detailed evaluation framework and examples
Includes sample usage code in evaluation examples
Adds error target enum entry for the new evaluator

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`_tool_success.py`	Core evaluator implementation with tool call parsing and success/failure detection
`tool_success.prompty`	Comprehensive prompt template with evaluation criteria and examples
`_exceptions.py`	Adds TOOL_SUCCESS_EVALUATOR to ErrorTarget enum
`__init__.py`	Exports the new ToolSuccessEvaluator class
`evaluation_samples_evaluate.py`	Sample usage code demonstrating the evaluator
`evaluation_samples_evaluate_fdp.py`	Additional sample usage for Azure AI Project URL format

Copilot · 2025-10-14T21:07:33Z

...ation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_success/tool_success.prompty

+
+
+TOOL_CALLS is a list of tool calls that was produced by the AI agent's. It includes calls together with the result of every tool call.
+TOOL_DEFINITIONS is a list of definitions for the tools that was called. This definition can contain a description of functionality provided by the tool , the parameters that the tool accept and the expected return of the tool. This definition can contribute to the assessment of whether a tool call succeeded or failed. 


Corrected spacing around comma in 'tool , the' to 'tool, the'

Suggested change

TOOL_DEFINITIONS is a list of definitions for the tools that was called. This definition can contain a description of functionality provided by the tool , the parameters that the tool accept and the expected return of the tool. This definition can contribute to the assessment of whether a tool call succeeded or failed.

TOOL_DEFINITIONS is a list of definitions for the tools that was called. This definition can contain a description of functionality provided by the tool, the parameters that the tool accept and the expected return of the tool. This definition can contribute to the assessment of whether a tool call succeeded or failed.

Copilot · 2025-10-14T21:07:33Z

...ation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_success/tool_success.prompty

+EXPECTED OUTPUT
+{
+     "explanation": "Although the returned value 7 is not the square root of 4, but this is a business mistake in the tool. The tool did not return a result indicating a technical error",
+     "details": {
+       "failed_tools": "",
+     },
+     "success": True
+} 


Redundant conjunction 'but' - should be either 'Although...value 7 is not..., this is a business mistake' or 'The returned value 7 is not...but this is a business mistake'

Copilot · 2025-10-14T21:07:34Z

...ation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_success/tool_success.prompty

+EXPECTED OUTPUT
+{
+     "explanation": "The tool returned a semicolon separated list of names. Although the description in the definition says it should return comma-separated list , this formatting mistake is a business mistake of the tool , not a technical failure. The tool did not return an error",
+     "details": {
+       "failed_tools": "",
+     },
+     "success": True
+} 


Corrected spacing around commas in 'comma-separated list , this' and 'tool , not' to 'comma-separated list, this' and 'tool, not'

Copilot · 2025-10-14T21:07:34Z

...ation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_success/tool_success.prompty

+EXPECTED OUTPUT
+{
+     "explanation": "The tool returned empty response , however , given the tool definition , it should never return empty response because there should be weather info at any given point in time. An empty response here is considered a technical failure. The conclusion is the get_weather_info failed",
+     "details": {
+       "failed_tools": "get_weather_info",
+     },
+     "success": False
+} 


Corrected spacing around commas in 'response , however , given' and 'definition , it' to 'response, however, given' and 'definition, it'

Copilot · 2025-10-14T21:07:34Z

...valuation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_success/_tool_success.py

+                message="response, is a required inputs to the Tool Success evaluator.",
+                internal_message="response, is a required inputs to the Tool Success evaluator.",


Incorrect grammar: 'response, is a required inputs' should be 'response is a required input'

Suggested change

message="response, is a required inputs to the Tool Success evaluator.",

internal_message="response, is a required inputs to the Tool Success evaluator.",

message="response is a required input to the Tool Success evaluator.",

internal_message="response is a required input to the Tool Success evaluator.",

Copilot · 2025-10-14T21:07:35Z

sdk/evaluation/azure-ai-evaluation/samples/evaluation_samples_evaluate.py

+        tool_success_evaluator = ToolSuccessEvaluator(model_config=model_config)
+        tool_success_evaluator(
+            response=json.loads( """[{"createdAt": "2025-08-16T08:39:47Z", "run_id": "run_id22", "role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_id557", "name": "get_date", "arguments": {}}]}, {"createdAt": "2025-08-16T08:39:49Z", "run_id": "run_id22", "tool_call_id": "call_id557", "role": "tool", "content": [{"type": "tool_result", "tool_result": {"date_and_time": "2019-09-07 23:59:59"}}]}, {"createdAt": "2025-08-16T08:39:51Z", "run_id": "run_id22", "role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_Run1", "name": "get_spending_by_day", "arguments": {"start_date": "2019-10-01", "end_date": "2019-10-31"}}]}, {"createdAt": "2025-08-16T08:39:53Z", "run_id": "run_id22", "tool_call_id": "call_Run1", "role": "tool", "content": [{"type": "tool_result", "tool_result": {"spending": {}}}]}, {"createdAt": "2025-08-16T08:39:54Z", "run_id": "run_id22", "role": "assistant", "content": [{"type": "text", "text": "There are no spending records for October."}]}]"""),
+            tool_definitions=json.loads( """[{"name": "get_categories", "type": "function", "description": "Retrieve of a spending line id from your spending records."}]""")


Grammar error: 'Retrieve of a spending line id' should be 'Retrieve a spending line id' or 'Retrieve the category of a spending line id'

Suggested change

tool_definitions=json.loads( """[{"name": "get_categories", "type": "function", "description": "Retrieve of a spending line id from your spending records."}]""")

tool_definitions=json.loads( """[{"name": "get_categories", "type": "function", "description": "Retrieve the category of a spending line id from your spending records."}]""")

Copilot · 2025-10-14T21:07:35Z

sdk/evaluation/azure-ai-evaluation/samples/evaluation_samples_evaluate_fdp.py

+        tool_success_evaluator = ToolSuccessEvaluator(model_config=model_config)
+        tool_success_evaluator(
+            response=json.loads( """[{"createdAt": "2025-08-16T08:39:47Z", "run_id": "run_id22", "role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_id557", "name": "get_date", "arguments": {}}]}, {"createdAt": "2025-08-16T08:39:49Z", "run_id": "run_id22", "tool_call_id": "call_id557", "role": "tool", "content": [{"type": "tool_result", "tool_result": {"date_and_time": "2019-09-07 23:59:59"}}]}, {"createdAt": "2025-08-16T08:39:51Z", "run_id": "run_id22", "role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_Run1", "name": "get_spending_by_day", "arguments": {"start_date": "2019-10-01", "end_date": "2019-10-31"}}]}, {"createdAt": "2025-08-16T08:39:53Z", "run_id": "run_id22", "tool_call_id": "call_Run1", "role": "tool", "content": [{"type": "tool_result", "tool_result": {"spending": {}}}]}, {"createdAt": "2025-08-16T08:39:54Z", "run_id": "run_id22", "role": "assistant", "content": [{"type": "text", "text": "There are no spending records for October."}]}]"""),
+            tool_definitions=json.loads( """[{"name": "get_categories", "type": "function", "description": "Retrieve of a spending line id from your spending records."}]""")


Grammar error: 'Retrieve of a spending line id' should be 'Retrieve a spending line id' or 'Retrieve the category of a spending line id'

Suggested change

tool_definitions=json.loads( """[{"name": "get_categories", "type": "function", "description": "Retrieve of a spending line id from your spending records."}]""")

tool_definitions=json.loads( """[{"name": "get_categories", "type": "function", "description": "Retrieve the category of a spending line id from your spending records."}]""")

ashaabansoliman and others added 8 commits October 14, 2025 02:56

Adding tool success evaluator and its samples

837a89a

fixing spelling mistakes

7c9aba9

Merge branch 'Azure:main' into main

b683a34

fix one of the success samples in the prompt

9f7e3d2

updating the output structure to include success threshold and numeri…

1d89436

…c result.

Merge branch 'Azure:main' into main

3cb1d4f

delete a backupfile accidentally added to the commit

c986990

Merge branch 'main' of github.com:ashaabansoliman/azure-sdk-for-pytho…

4a560ed

…n-tool-success

ashaabansoliman requested a review from a team as a code owner October 14, 2025 21:06

Copilot AI review requested due to automatic review settings October 14, 2025 21:06

github-actions bot added Community Contribution Community members are working on the issue customer-reported Issues that are reported by GitHub users external to the Azure organization. Evaluation Issues related to the client library for Azure AI Evaluation labels Oct 14, 2025

Copilot AI reviewed Oct 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding tool success evaluator #43427

Adding tool success evaluator #43427

Uh oh!

ashaabansoliman commented Oct 14, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 14, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 14, 2025

Uh oh!

Copilot AI Oct 14, 2025

Uh oh!

Copilot AI Oct 14, 2025

Uh oh!

Copilot AI Oct 14, 2025

Uh oh!

Copilot AI Oct 14, 2025

Uh oh!

Copilot AI Oct 14, 2025

Uh oh!

Copilot AI Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant



		TOOL_CALLS is a list of tool calls that was produced by the AI agent's. It includes calls together with the result of every tool call.
		TOOL_DEFINITIONS is a list of definitions for the tools that was called. This definition can contain a description of functionality provided by the tool , the parameters that the tool accept and the expected return of the tool. This definition can contribute to the assessment of whether a tool call succeeded or failed.

		message="response, is a required inputs to the Tool Success evaluator.",
		internal_message="response, is a required inputs to the Tool Success evaluator.",

	tool_definitions=json.loads( """[{"name": "get_categories", "type": "function", "description": "Retrieve of a spending line id from your spending records."}]""")
	tool_definitions=json.loads( """[{"name": "get_categories", "type": "function", "description": "Retrieve the category of a spending line id from your spending records."}]""")

Adding tool success evaluator #43427

Are you sure you want to change the base?

Adding tool success evaluator #43427

Uh oh!

Conversation

ashaabansoliman commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

All SDK Contribution checklist:

General Guidelines and Best Practices

Testing Guidelines

Uh oh!

github-actions bot commented Oct 14, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ashaabansoliman commented Oct 14, 2025 •

edited

Loading