Skip to content

Commit 98c9bad

Browse files
lnhsinghCopilottanushree-sharma
authored
Revert "Revert "Add composite scores"" 😬 (#581)
Reverts #580 MERGE ON TUESDAY --------- Co-authored-by: Copilot <[email protected]> Co-authored-by: Tanushree Sharma <[email protected]>
1 parent c58d6d6 commit 98c9bad

File tree

4 files changed

+235
-0
lines changed

4 files changed

+235
-0
lines changed

β€Žsrc/docs.jsonβ€Ž

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -842,6 +842,7 @@
842842
"pages": [
843843
"langsmith/code-evaluator",
844844
"langsmith/llm-as-judge",
845+
"langsmith/composite-evaluators",
845846
"langsmith/summary",
846847
"langsmith/evaluate-pairwise"
847848
]
Lines changed: 234 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,234 @@
1+
---
2+
title: Composite evaluators
3+
sidebarTitle: Composite evaluators
4+
---
5+
6+
_Composite evaluators_ are a way to combine multiple evaluator scores into a single [score](/langsmith/evaluation-concepts#evaluator-outputs). This is useful when you want to evaluate multiple aspects of your application and combine the results into a single result.
7+
8+
## Create a composite evaluator using the UI
9+
10+
You can create composite evaluators on a [tracing project](/langsmith/observability-concepts#projects) (for [online evaluations](/langsmith/evaluation-concepts#online-evaluation)) or a [dataset](/langsmith/evaluation-concepts#datasets) (for [offline evaluations](/langsmith/evaluation-concepts#offline-evaluation)). With composite evaluators in the UI, you can compute a weighted average or weighted sum of multiple evaluator scores, with configurable weights.
11+
12+
<div style={{ textAlign: 'center' }}>
13+
<img
14+
className="block dark:hidden"
15+
src="/langsmith/images/create_composite_evaluator-light.png"
16+
alt="LangSmith UI showing an LLM call trace called ChatOpenAI with a system and human input followed by an AI Output."
17+
/>
18+
19+
<img
20+
className="hidden dark:block"
21+
src="/langsmith/images/create_composite_evaluator-dark.png"
22+
alt="LangSmith UI showing an LLM call trace called ChatOpenAI with a system and human input followed by an AI Output."
23+
/>
24+
</div>
25+
26+
27+
### 1. Navigate to the tracing project or dataset
28+
29+
To start configuring a composite evaluator, navigate to the **Tracing Projects** or **Dataset & Experiments** tab and select a project or dataset.
30+
- From within a tracing project: **+ New** > **Evaluator** > **Composite score**
31+
- From within a dataset: **+ Evaluator** > **Composite score**
32+
33+
### 2. Configure the composite evaluator
34+
35+
1. Name your evaluator.
36+
2. Select an aggregation method, either **Average** or **Sum**.
37+
- **Average**: βˆ‘(weight*score) / βˆ‘(weight).
38+
- **Sum**: βˆ‘(weight*score).
39+
3. Add the feedback keys you want to include in the composite score.
40+
4. Add the weights for the feedback keys. By default, the weights are equal for each feedback key. Adjust the weights to increase or decrease the importance of specific feedback keys in the final score.
41+
5. Click **Create** to save the evaluator.
42+
43+
<Tip> If you need to adjust the weights for the composite scores, they can be updated after the evaluator is created. The resulting scores will be updated for all runs that have the evaluator configured. </Tip>
44+
45+
### 3. View composite evaluator results
46+
47+
Composite scores are attached to a run as **feedback**, similar feedback from a single evaluator. How you can view them depends on where the evaluation was run:
48+
49+
**On a tracing project**:
50+
- Composite scores appear as feedback on runs.
51+
- [Filter for runs](/langsmith/filter-traces-in-application) with a composite score, or where the composite score meets a certain threshold.
52+
- [Create a chart](/langsmith/dashboards#custom-dashboards) to visualize trends in the composite score over time.
53+
54+
**On a dataset**:
55+
- View the composite scores in the experiments tab. You can also filter and sort experiments based on the average composite score of their runs.
56+
- Click into an experiment to view the composite score for each run.
57+
58+
<Note> If any of the constituent evaluators are not configured on the run, the composite score will not be calculated for that run. </Note>
59+
60+
## Create composite feedback with the SDK
61+
62+
This guide describes setting up an evaluation that uses multiple evaluators and combines their scores with a custom aggregation function.
63+
64+
### 1. Configure evaluators on a dataset
65+
Start by configuring your evaluators. In this example, the application generates a tweet from a blog introduction and uses three evaluators β€” summary, tone, and formatting β€” to assess the output.
66+
67+
If you already have your own dataset with evaluators configured, you can skip this step.
68+
69+
<Accordion title="Configure evaluators on a dataset.">
70+
71+
```python
72+
# Import dependencies
73+
import os
74+
from dotenv import load_dotenv
75+
from langsmith import traceable, wrappers
76+
from openai import OpenAI
77+
from langsmith import Client
78+
from pydantic import BaseModel
79+
import json
80+
81+
# Load environment variables from .env file
82+
load_dotenv()
83+
84+
# Access environment variables
85+
openai_api_key = os.getenv('OPENAI_API_KEY')
86+
langsmith_api_key = os.getenv('LANGSMITH_API_KEY')
87+
langsmith_project = os.getenv('LANGSMITH_PROJECT', 'default')
88+
89+
90+
# Create a dataset. Only need to do this once.
91+
client = Client()
92+
93+
examples = [
94+
{
95+
"inputs": {"blog_intro": "Today we’re excited to announce the general availability of LangGraph Platform β€” our purpose-built infrastructure and management layer for deploying and scaling long-running, stateful agents. Since our beta last June, nearly 400 companies have used LangGraph Platform to deploy their agents into production. Agent deployment is the next hard hurdle for shipping reliable agents, and LangGraph Platform dramatically lowers this barrier with: 1-click deployment to go live in minutes, 30 API endpoints for designing custom user experiences that fit any interaction pattern, Horizontal scaling to handle bursty, long-running traffic, A persistence layer to support memory, conversational history, and async collaboration with human-in-the-loop or multi-agent workflows, Native LangGraph Studio, the agent IDE, for easy debugging, visibility, and iteration "},
96+
},
97+
{
98+
"inputs": {"blog_intro": "Klarna has reshaped global commerce with its consumer-centric, AI-powered payment and shopping solutions. With over 85 million active users and 2.5 million daily transactions on its platform, Klarna is a fintech leader that simplifies shopping while empowering consumers with smarter, more flexible financial solutions. Klarna’s flagship AI Assistant is revolutionizing the shopping and payments experience. Built on LangGraph and powered by LangSmith, the AI Assistant handles tasks ranging from customer payments, to refunds, to other payment escalations. With 2.5 million conversations to date, the AI Assistant is more than just a chatbot; it’s a transformative agent that performs the work equivalent of 700 full-time staff, delivering results quickly and improving company efficiency."},
99+
},
100+
]
101+
102+
dataset = client.create_dataset(dataset_name="Blog Intros")
103+
client.create_examples(
104+
dataset_id=dataset.id,
105+
examples=examples,
106+
)
107+
108+
# Define a target function. In this case, we're using a simple function that generates a tweet from a blog intro.
109+
def generate_tweet(inputs: dict) -> dict:
110+
instructions = (
111+
"Given the blog introduction, please generate a catchy yet professional tweet that can be used to promote the blog post on social media. Summarize the key point of the blog post in the tweet. Use emojis in a tasteful manner."
112+
)
113+
messages = [
114+
{"role": "system", "content": instructions},
115+
{"role": "user", "content": inputs["blog_intro"]},
116+
]
117+
result = oai_client.responses.create(
118+
input=messages, model="gpt-5-nano"
119+
)
120+
return {"tweet": result.output_text}
121+
122+
# Define evaluators. In this case, we're using three evaluators: summary, formatting, and tone.
123+
def summary(inputs: dict, outputs: dict) -> bool:
124+
"""Judge whether the tweet is a good summary of the blog intro."""
125+
instructions = "Given the following text and summary, determine if the summary is a good summary of the text."
126+
127+
class Response(BaseModel):
128+
summary: bool
129+
130+
msg = f"Question: {inputs['blog_intro']}\nAnswer: {outputs['tweet']}"
131+
response = oai_client.responses.parse(
132+
model="gpt-5-nano",
133+
input=[{"role": "system", "content": instructions,}, {"role": "user", "content": msg}],
134+
text_format=Response
135+
)
136+
137+
parsed_response = json.loads(response.output_text)
138+
return parsed_response["summary"]
139+
140+
def formatting(inputs: dict, outputs: dict) -> bool:
141+
"""Judge whether the tweet is formatted for easy human readability."""
142+
instructions = "Given the following text, determine if it is formatted well so that a human can easily read it. Pay particular attention to spacing and punctuation."
143+
144+
class Response(BaseModel):
145+
formatting: bool
146+
147+
msg = f"{outputs['tweet']}"
148+
response = oai_client.responses.parse(
149+
model="gpt-5-nano",
150+
input=[{"role": "system", "content": instructions,}, {"role": "user", "content": msg}],
151+
text_format=Response
152+
)
153+
154+
parsed_response = json.loads(response.output_text)
155+
return parsed_response["formatting"]
156+
157+
def tone(inputs: dict, outputs: dict) -> bool:
158+
"""Judge whether the tweet's tone is informative, friendly, and engaging."""
159+
instructions = "Given the following text, determine if the tweet is informative, yet friendly and engaging."
160+
161+
class Response(BaseModel):
162+
tone: bool
163+
164+
msg = f"{outputs['tweet']}"
165+
response = oai_client.responses.parse(
166+
model="gpt-5-nano",
167+
input=[{"role": "system", "content": instructions,}, {"role": "user", "content": msg}],
168+
text_format=Response
169+
)
170+
parsed_response = json.loads(response.output_text)
171+
return parsed_response["tone"]
172+
173+
# Calling evaluate() with the dataset, target function, and evaluators.
174+
results = client.evaluate(
175+
generate_tweet,
176+
data=dataset.name,
177+
evaluators=[summary, tone, formatting],
178+
experiment_prefix="gpt-5-nano",
179+
)
180+
```
181+
</Accordion>
182+
183+
### 2. Create composite feedback
184+
185+
Create composite feedabck that aggregates the individual evaluator scores using your custom function. This example uses a weighted average of the individual evaluator scores.
186+
187+
<Accordion title="Create a composite feedback.">
188+
189+
```python
190+
from typing import Dict
191+
import math
192+
import pandas as pd
193+
194+
# Set weights for the individual evaluator scores
195+
DEFAULT_WEIGHTS: Dict[str, float] = {
196+
"feedback.summary": 0.7,
197+
"feedback.tone": 0.2,
198+
"feedback.formatting": 0.1,
199+
}
200+
WEIGHTED_KEY = "weighted_summary"
201+
202+
# Pull experiment results
203+
EXPERIMENT_ID = list(client.list_projects(reference_dataset_name=DATASET_NAME, limit=1))[0].id
204+
df = client.get_test_results(project_id=EXPERIMENT_ID)
205+
206+
# Skip rows with any missing metric
207+
required = list(DEFAULT_WEIGHTS.keys())
208+
missing_cols = [c for c in required if c not in df.columns]
209+
if missing_cols:
210+
raise ValueError(f"Missing columns in DataFrame: {missing_cols}")
211+
212+
# Calculate weighted score. This can be any aggregation function
213+
def row_score(row: pd.Series) -> float:
214+
if row[required].isna().any():
215+
return float("nan")
216+
return float(sum(row[c] * DEFAULT_WEIGHTS[c] for c in required))
217+
218+
df[WEIGHTED_KEY] = df.apply(row_score, axis=1)
219+
220+
# Write feedback back to LangSmith
221+
for _, row in df.iterrows():
222+
run_id = row["id"]
223+
score = row[WEIGHTED_KEY]
224+
if isinstance(score, (int, float)) and not math.isnan(score):
225+
client.create_feedback(
226+
run_id=run_id,
227+
key=WEIGHTED_KEY,
228+
score=float(score)
229+
)
230+
```
231+
</Accordion>
232+
233+
234+
52.9 KB
Loading
53.3 KB
Loading

0 commit comments

Comments
Β (0)