Amazon Bedrock Agent Evaluation

Bedrock Agent Evaluation is an evalauation framework for Amazon Bedrock agent tool-use and chain-of-thought reasoning with observability dashboards in LangFuse.

Existing AWS assets

https://github.com/awslabs/agent-evaluation implements an LLM agent (evaluator) that will orchestrate conversations with your own agent (target) and evaluate the responses during the conversation.

Our repository provides the following additional features:

Features

Test your own Bedrock Agent with custom questions
Provides the option for LLM-as-a-judge without ground truth reference
Includes both Agent Goal metrics for chain of thought , and Task specific metrics with RAG, Text2SQL and custom tools
Observability with integration with Langfuse that includes latency and cost information
Dashboard comparison for comparison of agents with multiple Bedrock LLMs

Evaluation Workflow

Evaluation Results in Langfuse

Dashboard

Panel of Traces

Individual Trace

Deployment Options

Clone this repo to a SageMaker notebook instance
Clone this repo locally and set up AWS CLI credentials to your AWS account

Pre-Requisites

Set up a LangFuse account using the cloud https://www.langfuse.com or the self-host option for AWS https://github.com/aws-samples/deploy-langfuse-on-ecs-with-fargate/tree/main/langfuse-v3
Create an organization in Langfuse
Create a project within your Langfuse organization
Save your Langfuse project keys (Secret Key, Public Key, and Host) to use in config
If you are using the self-hosted option and want to see model costs then you must create a model definition in Langfuse for the LLM used by your agent, instructions can be found here https://langfuse.com/docs/model-usage-and-cost#custom-model-definitions

SageMaker Notebook Deployment Steps

Create a SageMaker notebook instance in your AWS account
Open a terminal and navigate to the SageMaker/ folder within the instance

cd SageMaker/

Clone this repository

git clone https://github.com/aws-samples/amazon-bedrock-agent-evaluation-framework

Navigate to the repository and install the necessary requirements

cd amazon-bedrock-agent-evaluation-framework/
pip3 install -r requirements.txt

Local Deployment Steps

Clone this repository

git clone https://github.com/aws-samples/amazon-bedrock-agent-evaluation-framework

Navigate to the repository and install the necessary requirements

cd amazon-bedrock-agent-evaluation-framework/
pip3 install -r requirements.txt

Set up AWS CLI to access AWS account resources locally https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html

Agent Evaluation Options

Bring you own agent to evaluate
Create sample agents from this repository and run evaluations

Option 1: Bring your own agent to evaluate

Bring your existing agent you want to evaluate (Currently RAG and Text2SQL evaluations built-in)
Create a dataset file for evaluations, manually or using the generator (Refer to the data_files/sample_data_file.json for the necessary format)
Copy the template configuration file and fill in the necessary information

cp config_tpl.env.tpl config.env

Run driver.py to execute evaluation job against dataset

python3 driver.py

Check your Langfuse project console to see the evaluation results!

Option 2: Create Sample Agents to run Evaluations

Follow the instructions in the Blog Sample Agents README. This is a guided way to run the evaluation framework on pre-created Bedrock Agents.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
blog_sample_agents		blog_sample_agents
data_files		data_files
evaluators		evaluators
helpers		helpers
img		img
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
config.env.tpl		config.env.tpl
driver.py		driver.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Amazon Bedrock Agent Evaluation

Existing AWS assets

Features

Evaluation Workflow

Evaluation Results in Langfuse

Dashboard

Panel of Traces

Individual Trace

Deployment Options

Pre-Requisites

SageMaker Notebook Deployment Steps

Local Deployment Steps

Agent Evaluation Options

Option 1: Bring your own agent to evaluate

Option 2: Create Sample Agents to run Evaluations

Security

License

About

Releases

Packages

Contributors 4

Languages

License

aws-samples/amazon-bedrock-agent-evaluation-framework

Folders and files

Latest commit

History

Repository files navigation

Amazon Bedrock Agent Evaluation

Existing AWS assets

Features

Evaluation Workflow

Evaluation Results in Langfuse

Dashboard

Panel of Traces

Individual Trace

Deployment Options

Pre-Requisites

SageMaker Notebook Deployment Steps

Local Deployment Steps

Agent Evaluation Options

Option 1: Bring your own agent to evaluate

Option 2: Create Sample Agents to run Evaluations

Security

License

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages