Skip to content

Commit

Permalink
docs: fix invalid links and imports (#1473)
Browse files Browse the repository at this point in the history
  • Loading branch information
shahules786 authored Oct 11, 2024
1 parent 734cec0 commit a4b1912
Show file tree
Hide file tree
Showing 34 changed files with 160 additions and 481 deletions.
4 changes: 1 addition & 3 deletions docs/concepts/components/eval_sample.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@ An evaluation sample is a single structured data instance that is used to asses
## SingleTurnSample
SingleTurnSample represents a single-turn interaction between a user, LLM, and expected results for evaluation. It is suitable for evaluations that involve a single question and answer pair, possibly with additional context or reference information.

[SingleTurnSample API Reference]()

### Example
The following example demonstrates how to create a `SingleTurnSample` instance for evaluating a single-turn interaction in a RAG-based application. In this scenario, a user asks a question, and the AI provides an answer. We’ll create a SingleTurnSample instance to represent this interaction, including any retrieved contexts, reference answers, and evaluation rubrics.
Expand Down Expand Up @@ -43,9 +42,8 @@ sample = SingleTurnSample(

## MultiTurnSample

MultiTurnSample represents a multi-turn interaction between Human, AI and optionally a Tool and expected results for evaluation. It is suitable for representing conversational agents in more complex interactions for evaluation. In `MultiTurnSample`, the `user_input` attribute represents a sequence of messages that collectively form a multi-turn conversation between a human user and an AI system. These messages are instances of the classes [HumanMessage](), [AIMessage](), and [ToolMessage]()
MultiTurnSample represents a multi-turn interaction between Human, AI and optionally a Tool and expected results for evaluation. It is suitable for representing conversational agents in more complex interactions for evaluation. In `MultiTurnSample`, the `user_input` attribute represents a sequence of messages that collectively form a multi-turn conversation between a human user and an AI system. These messages are instances of the classes `HumanMessage`, `AIMessage`, and `ToolMessage`

[MultTurnSample API Reference]()

### Example
The following example demonstrates how to create a `MultiTurnSample` instance for evaluating a multi-turn interaction. In this scenario, a user wants to know the current weather in New York City. The AI assistant will use a weather API tool to fetch the information and respond to the user.
Expand Down
2 changes: 0 additions & 2 deletions docs/concepts/components/prompt.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,6 @@ class MyPrompt(PydanticPrompt[MyInput,MyInput]):

```

[Prompt Object API Reference]()

## Guidelines for Creating Effective Prompts

When creating prompts in Ragas, consider the following guidelines to ensure that your prompts are effective and aligned with the task requirements:
Expand Down
2 changes: 1 addition & 1 deletion docs/concepts/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
Algorithms for synthesizing data to test [RAG](test_data_generation/index.md#retrieval-augmented-generation), [Agentic workflows](test_data_generation/index.md#agents-or-tool-use-cases)


- :material-chart-box-outline:{ .lg .middle } [__Feedback Intelligence__](feedback.md)
- :material-chart-box-outline:{ .lg .middle } [__Feedback Intelligence__](feedback/index.md)

---

Expand Down
76 changes: 70 additions & 6 deletions docs/concepts/metrics/available_metrics/agents.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,78 @@

Agentic or tool use workflows can be evaluated in multiple dimensions. Here are some of the metrics that can be used to evaluate the performance of agents or tools in a given task.


## Topic Adherence

AI systems deployed in real-world applications are expected to adhere to domains of interest while interacting with users but LLMs sometimes may answer general queries by ignoring this limitation. The topic adherence metric evaluates the ability of the AI to stay on predefined domains during the interactions. This metric is particularly important in conversational AI systems, where the AI is expected to only provide assistance to queries related to predefined domains.

`TopicAdherenceScore` requires a predefined set of topics that the AI system is expected to adhere to which is provided using `reference_topics` along with `user_input`. The metric can compute precision, recall, and F1 score for topic adherence, defined as

$$
\text{Precision } = {|\text{Queries that are answered and are adheres to any present reference topics}| \over |\text{Queries that are answered and are adheres to any present reference topics}| + |\text{Queries that are answered and do not adheres to any present reference topics}|}
$$

$$
\text{Recall } = {|\text{Queries that are answered and are adheres to any present reference topics}| \over |\text{Queries that are answered and are adheres to any present reference topics}| + |\text{Queries that were refused and should have been answered}|}
$$

$$
\text{F1 Score } = {2 \times \text{Precision} \times \text{Recall} \over \text{Precision} + \text{Recall}}
$$

### Example

```python
from ragas.dataset_schema import SingleTurnSample, MultiTurnSample, EvaluationDataset
from ragas.messages import HumanMessage,AIMessage,ToolMessage,ToolCall
from ragas.metrics import TopicAdherenceScore


sample_input_4 = [
HumanMessage(content="Can you provide me with details about Einstein's theory of relativity?"),
AIMessage(content="Sure, let me retrieve the relevant information for you.", tool_calls=[
ToolCall(name="document_search", args={"query": "Einstein's theory of relativity"})
]),
ToolMessage(content="Found relevant documents: 1. Relativity: The Special and the General Theory, 2. General Theory of Relativity by A. Einstein."),
AIMessage(content="I found some documents on Einstein's theory of relativity. Which one would you like to know more about: 'Relativity: The Special and the General Theory' or 'General Theory of Relativity by A. Einstein'?"),
HumanMessage(content="Tell me about the 'General Theory of Relativity'."),
AIMessage(content="Got it! Let me fetch more details from 'General Theory of Relativity by A. Einstein'.", tool_calls=[
ToolCall(name="document_retrieve", args={"document": "General Theory of Relativity by A. Einstein"})
]),
ToolMessage(content="The document discusses how gravity affects the fabric of spacetime, describing the relationship between mass and spacetime curvature."),
AIMessage(content="The 'General Theory of Relativity' explains how gravity affects the fabric of spacetime and the relationship between mass and spacetime curvature. Would you like more details or a specific explanation?"),
HumanMessage(content="No, that's perfect. By the way, do you know any good recipes for a chocolate cake?"),
AIMessage(content="Sure! Let me find a simple and delicious recipe for a chocolate cake.", tool_calls=[
ToolCall(name="recipe_search", args={"query": "chocolate cake recipe"})
]),
ToolMessage(content="Here’s a popular recipe for a chocolate cake: Ingredients include flour, sugar, cocoa powder, eggs, milk, and butter. Instructions: Mix dry ingredients, add wet ingredients, and bake at 350°F for 30-35 minutes."),
AIMessage(content="I found a great recipe for chocolate cake! Would you like the full details, or is that summary enough?")
]


sample = MultiTurnSample(user_input=sample_input_4, reference_topics=["science"])
scorer = TopicAdherenceScore(mode="precision")
scorer.llm = openai_model
await scorer.multi_turn_ascore(sample)
```


To change the mode to recall, set the `mode` parameter to `recall`.

```python
scorer = TopicAdherenceScore(mode="recall")
```



## Tool call Accuracy

Tool call accuracy is a metric that can be used to evaluate the performance of the LLM in identifying and calling the required tools to complete a given task. This metric needs `user_input` and `reference_tool_calls` to evaluate the performance of the LLM in identifying and calling the required tools to complete a given task. The metric is computed by comparing the `reference_tool_calls` with the Tool calls made by the AI. The values range between 0 and 1, with higher values indicating better performance.
`ToolCallAccuracy` is a metric that can be used to evaluate the performance of the LLM in identifying and calling the required tools to complete a given task. This metric needs `user_input` and `reference_tool_calls` to evaluate the performance of the LLM in identifying and calling the required tools to complete a given task. The metric is computed by comparing the `reference_tool_calls` with the Tool calls made by the AI. The values range between 0 and 1, with higher values indicating better performance.

```python
from ragas.dataset_schema import MultiTurnSample
from ragas.messages import HumanMessage,AIMessage,ToolMessage,ToolCall
from ragas.metrics._tool_call_accuracy import ToolCallAccuracy
from ragas.metrics import ToolCallAccuracy


sample = [
Expand Down Expand Up @@ -56,13 +120,13 @@ Agent goal accuracy is a metric that can be used to evaluate the performance of

### With reference

Calculating agent goal accuracy with reference needs `user_input` and `reference` to evaluate the performance of the LLM in identifying and achieving the goals of the user. The annotated `reference` will be used as ideal outcome. The metric is computed by comparing the `reference` with the goal achieved by the end of workflow.
Calculating `AgentGoalAccuracyWithReference` with reference needs `user_input` and `reference` to evaluate the performance of the LLM in identifying and achieving the goals of the user. The annotated `reference` will be used as ideal outcome. The metric is computed by comparing the `reference` with the goal achieved by the end of workflow.


```python
from ragas.dataset_schema import MultiTurnSample
from ragas.messages import HumanMessage,AIMessage,ToolMessage,ToolCall
from ragas.metrics._agent_goal_accuracy import AgentGoalAccuracyWithReference
from ragas.metrics import AgentGoalAccuracyWithReference


sample = MultiTurnSample(user_input=[
Expand All @@ -89,15 +153,15 @@ await metric.multi_turn_ascore(sample)

### Without reference

In without reference mode, the metric will evaluate the performance of the LLM in identifying and achieving the goals of the user without any reference. Here the desired outcome is inferred from the human interactions in the workflow.
`AgentGoalAccuracyWithoutReference` works in without reference mode, the metric will evaluate the performance of the LLM in identifying and achieving the goals of the user without any reference. Here the desired outcome is inferred from the human interactions in the workflow.


### Example

```python
from ragas.dataset_schema import MultiTurnSample
from ragas.messages import HumanMessage,AIMessage,ToolMessage,ToolCall
from ragas.metrics._agent_goal_accuracy import AgentGoalAccuracyWithoutReference
from ragas.metrics import AgentGoalAccuracyWithoutReference


sample = MultiTurnSample(user_input=[
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## Response Relevancy

The evaluation metric, Answer Relevancy, focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This metric is computed using the `user_input`, the `retrived_contexts` and the `response`.
`ResponseRelevancy` metric focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This metric is computed using the `user_input`, the `retrived_contexts` and the `response`.

The Answer Relevancy is defined as the mean cosine similarity of the original `user_input` to a number of artificial questions, which where generated (reverse engineered) based on the `response`:

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## Context Entities Recall

This metric gives the measure of recall of the retrieved context, based on the number of entities present in both `reference` and `retrieved_contexts` relative to the number of entities present in the `reference` alone. Simply put, it is a measure of what fraction of entities are recalled from `reference`. This metric is useful in fact-based use cases like tourism help desk, historical QA, etc. This metric can help evaluate the retrieval mechanism for entities, based on comparison with entities present in `reference`, because in cases where entities matter, we need the `retrieved_contexts` which cover them.
`ContextEntityRecall` metric gives the measure of recall of the retrieved context, based on the number of entities present in both `reference` and `retrieved_contexts` relative to the number of entities present in the `reference` alone. Simply put, it is a measure of what fraction of entities are recalled from `reference`. This metric is useful in fact-based use cases like tourism help desk, historical QA, etc. This metric can help evaluate the retrieval mechanism for entities, based on comparison with entities present in `reference`, because in cases where entities matter, we need the `retrieved_contexts` which cover them.

To compute this metric, we use two sets, $GE$ and $CE$, as set of entities present in `reference` and set of entities present in `retrieved_contexts` respectively. We then take the number of elements in intersection of these sets and divide it by the number of elements present in the $GE$, given by the formula:

Expand Down
6 changes: 3 additions & 3 deletions docs/concepts/metrics/available_metrics/context_precision.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ The following metrics uses LLM to identify if a retrieved context is relevant or

### Context Precision without reference

This metric is can be used when you have both retrieved contexts and also reference contexts associated with a `user_input`. To estimate if a retrieved contexts is relevant or not this method uses the LLM to compare each of the retrieved context or chunk present in `retrieved_contexts` with `response`.
`LLMContextPrecisionWithoutReference` metric is can be used when you have both retrieved contexts and also reference contexts associated with a `user_input`. To estimate if a retrieved contexts is relevant or not this method uses the LLM to compare each of the retrieved context or chunk present in `retrieved_contexts` with `response`.

#### Example

Expand All @@ -39,7 +39,7 @@ await context_precision.single_turn_ascore(sample)

### Context Precision with reference

This metric is can be used when you have both retrieved contexts and also reference answer associated with a `user_input`. To estimate if a retrieved contexts is relevant or not this method uses the LLM to compare each of the retrieved context or chunk present in `retrieved_contexts` with `reference`.
`LLMContextPrecisionWithReference` metric is can be used when you have both retrieved contexts and also reference answer associated with a `user_input`. To estimate if a retrieved contexts is relevant or not this method uses the LLM to compare each of the retrieved context or chunk present in `retrieved_contexts` with `reference`.

#### Example

Expand All @@ -64,7 +64,7 @@ The following metrics uses traditional methods to identify if a retrieved contex

### Context Precision with reference contexts

This metric is can be used when you have both retrieved contexts and also reference contexts associated with a `user_input`. To estimate if a retrieved contexts is relevant or not this method uses the LLM to compare each of the retrieved context or chunk present in `retrieved_contexts` with each ones present in `reference_contexts`.
`NonLLMContextPrecisionWithReference` metric is can be used when you have both retrieved contexts and also reference contexts associated with a `user_input`. To estimate if a retrieved contexts is relevant or not this method uses the LLM to compare each of the retrieved context or chunk present in `retrieved_contexts` with each ones present in `reference_contexts`.

#### Example

Expand Down
4 changes: 2 additions & 2 deletions docs/concepts/metrics/available_metrics/context_recall.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ In short, recall is about not missing anything important. Since it is about not

## LLM Based Context Recall

Computed using `user_input`, `reference` and the `retrieved_contexts`, and the values range between 0 and 1, with higher values indicating better performance. This metric uses `reference` as a proxy to `reference_contexts` which also makes it easier to use as annotating reference contexts can be very time consuming. To estimate context recall from the `reference`, the reference is broken down into claims each claim in the `reference` answer is analyzed to determine whether it can be attributed to the retrieved context or not. In an ideal scenario, all claims in the reference answer should be attributable to the retrieved context.
`LLMContextRecall` is computed using `user_input`, `reference` and the `retrieved_contexts`, and the values range between 0 and 1, with higher values indicating better performance. This metric uses `reference` as a proxy to `reference_contexts` which also makes it easier to use as annotating reference contexts can be very time consuming. To estimate context recall from the `reference`, the reference is broken down into claims each claim in the `reference` answer is analyzed to determine whether it can be attributed to the retrieved context or not. In an ideal scenario, all claims in the reference answer should be attributable to the retrieved context.


The formula for calculating context recall is as follows:
Expand Down Expand Up @@ -36,7 +36,7 @@ await context_recall.single_turn_ascore(sample)

## Non LLM Based Context Recall

Computed using `retrieved_contexts` and `reference_contexts`, and the values range between 0 and 1, with higher values indicating better performance. This metrics uses non llm string comparison metrics to identify if a retrieved context is relevant or not. You can use any non LLM based metrics as distance measure to identify if a retrieved context is relevant or not.
`NonLLMContextRecall` metric is computed using `retrieved_contexts` and `reference_contexts`, and the values range between 0 and 1, with higher values indicating better performance. This metrics uses non llm string comparison metrics to identify if a retrieved context is relevant or not. You can use any non LLM based metrics as distance measure to identify if a retrieved context is relevant or not.

The formula for calculating context recall is as follows:

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## Factual Correctness

Factual correctness is a metric that compares and evaluates the factual accuracy of the generated `response` with the `reference`. This metric is used to determine the extent to which the generated response aligns with the reference. The factual correctness score ranges from 0 to 1, with higher values indicating better performance. To measure the alignment between the response and the reference, the metric uses the LLM for first break down the response and reference into claims and then uses natural language inference to determine the factual overlap between the response and the reference. Factual overlap is quantified using precision, recall, and F1 score, which can be controlled using the `mode` parameter.
`FactualCorrectness` is a metric that compares and evaluates the factual accuracy of the generated `response` with the `reference`. This metric is used to determine the extent to which the generated response aligns with the reference. The factual correctness score ranges from 0 to 1, with higher values indicating better performance. To measure the alignment between the response and the reference, the metric uses the LLM for first break down the response and reference into claims and then uses natural language inference to determine the factual overlap between the response and the reference. Factual overlap is quantified using precision, recall, and F1 score, which can be controlled using the `mode` parameter.

The formula for calculating True Positive (TP), False Positive (FP), and False Negative (FN) is as follows:

Expand Down
2 changes: 1 addition & 1 deletion docs/concepts/metrics/available_metrics/faithfulness.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## Faithfulness

This measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better.
`Faithfulness` metric measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better.

The generated answer is regarded as faithful if all the claims made in the answer can be inferred from the given context. To calculate this, a set of claims from the generated answer is first identified. Then each of these claims is cross-checked with the given context to determine if it can be inferred from the context. The faithfulness score is given by:

Expand Down
Loading

0 comments on commit a4b1912

Please sign in to comment.