Bedrock Agent Evaluation is an evalauation framework for Amazon Bedrock agent tool-use and chain-of-thought reasoning with observability dashboards in LangFuse.
https://github.com/awslabs/agent-evaluation implements an LLM agent (evaluator) that will orchestrate conversations with your own agent (target) and evaluate the responses during the conversation.
Our repository provides the following additional features:
- Test your own Bedrock Agent with custom questions
- Provides the option for LLM-as-a-judge without ground truth reference
- Includes both Agent Goal metrics for chain of thought , and Task specific metrics with RAG, Text2SQL and custom tools
- Observability with integration with Langfuse that includes latency and cost information
- Dashboard comparison for comparison of agents with multiple Bedrock LLMs
- Clone this repo to a SageMaker notebook instance
- Clone this repo locally and set up AWS CLI credentials to your AWS account
-
Set up a LangFuse account using the cloud https://www.langfuse.com or the self-host option for AWS https://github.com/aws-samples/deploy-langfuse-on-ecs-with-fargate/tree/main/langfuse-v3
-
Create an organization in Langfuse
-
Create a project within your Langfuse organization
-
Save your Langfuse project keys (Secret Key, Public Key, and Host) to use in config
-
If you are using the self-hosted option and want to see model costs then you must create a model definition in Langfuse for the LLM used by your agent, instructions can be found here https://langfuse.com/docs/model-usage-and-cost#custom-model-definitions
-
Create a SageMaker notebook instance in your AWS account
-
Open a terminal and navigate to the SageMaker/ folder within the instance
cd SageMaker/
- Clone this repository
git clone https://github.com/aws-samples/amazon-bedrock-agent-evaluation-framework
- Navigate to the repository and install the necessary requirements
cd amazon-bedrock-agent-evaluation-framework/
pip3 install -r requirements.txt
- Clone this repository
git clone https://github.com/aws-samples/amazon-bedrock-agent-evaluation-framework
- Navigate to the repository and install the necessary requirements
cd amazon-bedrock-agent-evaluation-framework/
pip3 install -r requirements.txt
- Set up AWS CLI to access AWS account resources locally https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html
- Bring you own agent to evaluate
- Create sample agents from this repository and run evaluations
-
Bring your existing agent you want to evaluate (Currently RAG and Text2SQL evaluations built-in)
-
Create a dataset file for evaluations, manually or using the generator (Refer to the data_files/sample_data_file.json for the necessary format)
-
Copy the template configuration file and fill in the necessary information
cp config_tpl.env.tpl config.env
- Run driver.py to execute evaluation job against dataset
python3 driver.py
- Check your Langfuse project console to see the evaluation results!
Follow the instructions in the Blog Sample Agents README. This is a guided way to run the evaluation framework on pre-created Bedrock Agents.
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.