docs: fix invalid links and imports (#1473)

explodinggradients · Oct 11, 2024 · a4b1912 · a4b1912
1 parent 734cec0
commit a4b1912
Show file tree

Hide file tree

Showing 34 changed files with 160 additions and 481 deletions.
diff --git a/docs/concepts/components/eval_sample.md b/docs/concepts/components/eval_sample.md
@@ -5,7 +5,6 @@ An evaluation sample is a single structured data instance that is used to asses
 ## SingleTurnSample
 SingleTurnSample represents a single-turn interaction between a user, LLM, and expected results for evaluation. It is suitable for evaluations that involve a single question and answer pair, possibly with additional context or reference information.
 
-[SingleTurnSample API Reference]()
 
 ### Example
 The following example demonstrates how to create a `SingleTurnSample` instance for evaluating a single-turn interaction in a RAG-based application. In this scenario, a user asks a question, and the AI provides an answer. We’ll create a SingleTurnSample instance to represent this interaction, including any retrieved contexts, reference answers, and evaluation rubrics.
@@ -43,9 +42,8 @@ sample = SingleTurnSample(
 
 ## MultiTurnSample
 
-MultiTurnSample represents a multi-turn interaction between Human, AI and optionally a Tool and expected results for evaluation. It is suitable for representing conversational agents in more complex interactions for evaluation. In `MultiTurnSample`, the `user_input` attribute represents a sequence of messages that collectively form a multi-turn conversation between a human user and an AI system. These messages are instances of the classes  [HumanMessage](), [AIMessage](), and [ToolMessage]()
+MultiTurnSample represents a multi-turn interaction between Human, AI and optionally a Tool and expected results for evaluation. It is suitable for representing conversational agents in more complex interactions for evaluation. In `MultiTurnSample`, the `user_input` attribute represents a sequence of messages that collectively form a multi-turn conversation between a human user and an AI system. These messages are instances of the classes  `HumanMessage`, `AIMessage`, and `ToolMessage`
 
-[MultTurnSample API Reference]()
 
 ### Example
 The following example demonstrates how to create a `MultiTurnSample` instance for evaluating a multi-turn interaction. In this scenario, a user wants to know the current weather in New York City. The AI assistant will use a weather API tool to fetch the information and respond to the user.

diff --git a/docs/concepts/components/prompt.md b/docs/concepts/components/prompt.md
@@ -44,8 +44,6 @@ class MyPrompt(PydanticPrompt[MyInput,MyInput]):
 
 ```
 
-[Prompt Object API Reference]()
-
 ## Guidelines for Creating Effective Prompts
 
 When creating prompts in Ragas, consider the following guidelines to ensure that your prompts are effective and aligned with the task requirements:

diff --git a/docs/concepts/index.md b/docs/concepts/index.md
@@ -28,7 +28,7 @@
     Algorithms for synthesizing data to test [RAG](test_data_generation/index.md#retrieval-augmented-generation), [Agentic workflows](test_data_generation/index.md#agents-or-tool-use-cases) 
 
 
--   :material-chart-box-outline:{ .lg .middle } [__Feedback Intelligence__](feedback.md)
+-   :material-chart-box-outline:{ .lg .middle } [__Feedback Intelligence__](feedback/index.md)
 
     ---
 

diff --git a/docs/concepts/metrics/available_metrics/agents.md b/docs/concepts/metrics/available_metrics/agents.md
@@ -2,14 +2,78 @@
 
 Agentic or tool use workflows can be evaluated in multiple dimensions. Here are some of the metrics that can be used to evaluate the performance of agents or tools in a given task.
 
+
+## Topic Adherence
+
+AI systems deployed in real-world applications are expected to adhere to domains of interest while interacting with users but LLMs sometimes may answer general queries by ignoring this limitation. The topic adherence metric evaluates the ability of the AI to stay on predefined domains during the interactions. This metric is particularly important in conversational AI systems, where the AI is expected to only provide assistance to queries related to predefined domains.
+
+`TopicAdherenceScore` requires a predefined set of topics that the AI system is expected to adhere to which is provided using `reference_topics` along with `user_input`. The metric can compute precision, recall, and F1 score for topic adherence, defined as 
+
+$$
+\text{Precision } = {|\text{Queries that are answered and are adheres to any present reference topics}| \over |\text{Queries that are answered and are adheres to any present reference topics}| + |\text{Queries that are answered and do not adheres to any present reference topics}|}
+$$
+
+$$
+\text{Recall } = {|\text{Queries that are answered and are adheres to any present reference topics}| \over |\text{Queries that are answered and are adheres to any present reference topics}| + |\text{Queries that were refused and should have been answered}|}
+$$
+
+$$
+\text{F1 Score } = {2 \times \text{Precision} \times \text{Recall} \over \text{Precision} + \text{Recall}}
+$$
+
+### Example
+
+```python
+from ragas.dataset_schema import  SingleTurnSample, MultiTurnSample, EvaluationDataset
+from ragas.messages import HumanMessage,AIMessage,ToolMessage,ToolCall
+from ragas.metrics import TopicAdherenceScore
+
+
+sample_input_4 = [
+HumanMessage(content="Can you provide me with details about Einstein's theory of relativity?"),
+AIMessage(content="Sure, let me retrieve the relevant information for you.", tool_calls=[
+    ToolCall(name="document_search", args={"query": "Einstein's theory of relativity"})
+]),
+ToolMessage(content="Found relevant documents: 1. Relativity: The Special and the General Theory, 2. General Theory of Relativity by A. Einstein."),
+AIMessage(content="I found some documents on Einstein's theory of relativity. Which one would you like to know more about: 'Relativity: The Special and the General Theory' or 'General Theory of Relativity by A. Einstein'?"),
+HumanMessage(content="Tell me about the 'General Theory of Relativity'."),
+AIMessage(content="Got it! Let me fetch more details from 'General Theory of Relativity by A. Einstein'.", tool_calls=[
+    ToolCall(name="document_retrieve", args={"document": "General Theory of Relativity by A. Einstein"})
+]),
+ToolMessage(content="The document discusses how gravity affects the fabric of spacetime, describing the relationship between mass and spacetime curvature."),
+AIMessage(content="The 'General Theory of Relativity' explains how gravity affects the fabric of spacetime and the relationship between mass and spacetime curvature. Would you like more details or a specific explanation?"),
+HumanMessage(content="No, that's perfect. By the way, do you know any good recipes for a chocolate cake?"),
+AIMessage(content="Sure! Let me find a simple and delicious recipe for a chocolate cake.", tool_calls=[
+    ToolCall(name="recipe_search", args={"query": "chocolate cake recipe"})
+]),
+ToolMessage(content="Here’s a popular recipe for a chocolate cake: Ingredients include flour, sugar, cocoa powder, eggs, milk, and butter. Instructions: Mix dry ingredients, add wet ingredients, and bake at 350°F for 30-35 minutes."),
+AIMessage(content="I found a great recipe for chocolate cake! Would you like the full details, or is that summary enough?")
+]
+
+
+sample = MultiTurnSample(user_input=sample_input_4, reference_topics=["science"])
+scorer = TopicAdherenceScore(mode="precision")
+scorer.llm = openai_model
+await scorer.multi_turn_ascore(sample)
+```
+
+
+To change the mode to recall, set the `mode` parameter to `recall`.
+
+```python
+scorer = TopicAdherenceScore(mode="recall")
+```  
+
+
+
 ## Tool call Accuracy
 
-Tool call accuracy is a metric that can be used to evaluate the performance of the LLM in identifying and calling the required tools to complete a given task. This metric needs `user_input` and `reference_tool_calls` to evaluate the performance of the LLM in identifying and calling the required tools to complete a given task. The metric is computed by comparing the `reference_tool_calls` with the Tool calls made by the AI. The values range between 0 and 1, with higher values indicating better performance. 
+`ToolCallAccuracy` is a metric that can be used to evaluate the performance of the LLM in identifying and calling the required tools to complete a given task. This metric needs `user_input` and `reference_tool_calls` to evaluate the performance of the LLM in identifying and calling the required tools to complete a given task. The metric is computed by comparing the `reference_tool_calls` with the Tool calls made by the AI. The values range between 0 and 1, with higher values indicating better performance. 
 
 ```python
 from ragas.dataset_schema import  MultiTurnSample
 from ragas.messages import HumanMessage,AIMessage,ToolMessage,ToolCall
-from ragas.metrics._tool_call_accuracy import ToolCallAccuracy
+from ragas.metrics import ToolCallAccuracy
 
 
 sample = [
@@ -56,13 +120,13 @@ Agent goal accuracy is a metric that can be used to evaluate the performance of
 
 ### With reference
 
-Calculating agent goal accuracy with reference needs `user_input` and `reference` to evaluate the performance of the LLM in identifying and achieving the goals of the user. The annotated `reference` will be used as ideal outcome. The metric is computed by comparing the `reference` with the goal achieved by the end of workflow.
+Calculating `AgentGoalAccuracyWithReference` with reference needs `user_input` and `reference` to evaluate the performance of the LLM in identifying and achieving the goals of the user. The annotated `reference` will be used as ideal outcome. The metric is computed by comparing the `reference` with the goal achieved by the end of workflow.
 
 
 ```python
 from ragas.dataset_schema import  MultiTurnSample
 from ragas.messages import HumanMessage,AIMessage,ToolMessage,ToolCall
-from ragas.metrics._agent_goal_accuracy import AgentGoalAccuracyWithReference
+from ragas.metrics import AgentGoalAccuracyWithReference
 
 
 sample = MultiTurnSample(user_input=[
@@ -89,15 +153,15 @@ await metric.multi_turn_ascore(sample)
 
 ### Without reference
 
-In without reference mode, the metric will evaluate the performance of the LLM in identifying and achieving the goals of the user without any reference. Here the desired outcome is inferred from the human interactions in the workflow.
+`AgentGoalAccuracyWithoutReference` works in without reference mode, the metric will evaluate the performance of the LLM in identifying and achieving the goals of the user without any reference. Here the desired outcome is inferred from the human interactions in the workflow.
 
 
 ### Example
 
 ```python
 from ragas.dataset_schema import  MultiTurnSample
 from ragas.messages import HumanMessage,AIMessage,ToolMessage,ToolCall
-from ragas.metrics._agent_goal_accuracy import AgentGoalAccuracyWithoutReference
+from ragas.metrics import AgentGoalAccuracyWithoutReference
 
 
 sample = MultiTurnSample(user_input=[

diff --git a/docs/concepts/metrics/available_metrics/answer_relevance.md b/docs/concepts/metrics/available_metrics/answer_relevance.md
@@ -1,6 +1,6 @@
 ## Response Relevancy
 
-The evaluation metric, Answer Relevancy, focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This metric is computed using the `user_input`, the `retrived_contexts` and the `response`. 
+`ResponseRelevancy` metric focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This metric is computed using the `user_input`, the `retrived_contexts` and the `response`. 
 
 The Answer Relevancy is defined as the mean cosine similarity of the original `user_input` to a number of artificial questions, which where generated (reverse engineered) based on the `response`: 
 

diff --git a/docs/concepts/metrics/available_metrics/context_entities_recall.md b/docs/concepts/metrics/available_metrics/context_entities_recall.md
@@ -1,6 +1,6 @@
 ## Context Entities Recall
 
-This metric gives the measure of recall of the retrieved context, based on the number of entities present in both `reference` and `retrieved_contexts` relative to the number of entities present in the `reference` alone. Simply put, it is a measure of what fraction of entities are recalled from `reference`. This metric is useful in fact-based use cases like tourism help desk, historical QA, etc. This metric can help evaluate the retrieval mechanism for entities, based on comparison with entities present in `reference`, because in cases where entities matter, we need the `retrieved_contexts` which cover them.
+`ContextEntityRecall` metric gives the measure of recall of the retrieved context, based on the number of entities present in both `reference` and `retrieved_contexts` relative to the number of entities present in the `reference` alone. Simply put, it is a measure of what fraction of entities are recalled from `reference`. This metric is useful in fact-based use cases like tourism help desk, historical QA, etc. This metric can help evaluate the retrieval mechanism for entities, based on comparison with entities present in `reference`, because in cases where entities matter, we need the `retrieved_contexts` which cover them.
 
 To compute this metric, we use two sets, $GE$ and $CE$, as set of entities present in `reference` and set of entities present in `retrieved_contexts` respectively. We then take the number of elements in intersection of these sets and divide it by the number of elements present in the $GE$, given by the formula:
 

diff --git a/docs/concepts/metrics/available_metrics/context_precision.md b/docs/concepts/metrics/available_metrics/context_precision.md
@@ -17,7 +17,7 @@ The following metrics uses LLM to identify if a retrieved context is relevant or
 
 ### Context Precision without reference
 
-This metric is can be used when you have both retrieved contexts and also reference contexts associated with a `user_input`. To estimate if a retrieved contexts is relevant or not this method uses the LLM to compare each of the retrieved context or chunk present in `retrieved_contexts` with `response`. 
+`LLMContextPrecisionWithoutReference` metric is can be used when you have both retrieved contexts and also reference contexts associated with a `user_input`. To estimate if a retrieved contexts is relevant or not this method uses the LLM to compare each of the retrieved context or chunk present in `retrieved_contexts` with `response`. 
 
 #### Example
 
@@ -39,7 +39,7 @@ await context_precision.single_turn_ascore(sample)
 
 ### Context Precision with reference
 
-This metric is can be used when you have both retrieved contexts and also reference answer associated with a `user_input`. To estimate if a retrieved contexts is relevant or not this method uses the LLM to compare each of the retrieved context or chunk present in `retrieved_contexts` with `reference`. 
+`LLMContextPrecisionWithReference` metric is can be used when you have both retrieved contexts and also reference answer associated with a `user_input`. To estimate if a retrieved contexts is relevant or not this method uses the LLM to compare each of the retrieved context or chunk present in `retrieved_contexts` with `reference`. 
 
 #### Example
 
@@ -64,7 +64,7 @@ The following metrics uses traditional methods to identify if a retrieved contex
 
 ### Context Precision with reference contexts
 
-This metric is can be used when you have both retrieved contexts and also reference contexts associated with a `user_input`. To estimate if a retrieved contexts is relevant or not this method uses the LLM to compare each of the retrieved context or chunk present in `retrieved_contexts` with each ones present in `reference_contexts`. 
+`NonLLMContextPrecisionWithReference` metric is can be used when you have both retrieved contexts and also reference contexts associated with a `user_input`. To estimate if a retrieved contexts is relevant or not this method uses the LLM to compare each of the retrieved context or chunk present in `retrieved_contexts` with each ones present in `reference_contexts`. 
 
 #### Example
 

diff --git a/docs/concepts/metrics/available_metrics/context_recall.md b/docs/concepts/metrics/available_metrics/context_recall.md
@@ -7,7 +7,7 @@ In short, recall is about not missing anything important. Since it is about not
 
 ## LLM Based Context Recall
 
-Computed using `user_input`, `reference` and the  `retrieved_contexts`, and the values range between 0 and 1, with higher values indicating better performance. This metric uses `reference` as a proxy to `reference_contexts` which also makes it easier to use as annotating reference contexts can be very time consuming. To estimate context recall from the `reference`, the reference is broken down into claims each claim in the `reference` answer is analyzed to determine whether it can be attributed to the retrieved context or not. In an ideal scenario, all claims in the reference answer should be attributable to the retrieved context.
+`LLMContextRecall` is computed using `user_input`, `reference` and the  `retrieved_contexts`, and the values range between 0 and 1, with higher values indicating better performance. This metric uses `reference` as a proxy to `reference_contexts` which also makes it easier to use as annotating reference contexts can be very time consuming. To estimate context recall from the `reference`, the reference is broken down into claims each claim in the `reference` answer is analyzed to determine whether it can be attributed to the retrieved context or not. In an ideal scenario, all claims in the reference answer should be attributable to the retrieved context.
 
 
 The formula for calculating context recall is as follows:
@@ -36,7 +36,7 @@ await context_recall.single_turn_ascore(sample)
 
 ## Non LLM Based Context Recall
 
-Computed using `retrieved_contexts` and `reference_contexts`, and the values range between 0 and 1, with higher values indicating better performance. This metrics uses non llm string comparison metrics to identify if a retrieved context is relevant or not. You can use any non LLM based metrics as distance measure to identify if a retrieved context is relevant or not.
+`NonLLMContextRecall` metric is computed using `retrieved_contexts` and `reference_contexts`, and the values range between 0 and 1, with higher values indicating better performance. This metrics uses non llm string comparison metrics to identify if a retrieved context is relevant or not. You can use any non LLM based metrics as distance measure to identify if a retrieved context is relevant or not.
 
 The formula for calculating context recall is as follows:
 

diff --git a/docs/concepts/metrics/available_metrics/factual_correctness.md b/docs/concepts/metrics/available_metrics/factual_correctness.md
@@ -1,6 +1,6 @@
 ## Factual Correctness
 
-Factual correctness is a metric that compares and evaluates the factual accuracy of the generated `response` with the `reference`. This metric is used to determine the extent to which the generated response aligns with the reference. The factual correctness score ranges from 0 to 1, with higher values indicating better performance. To measure the alignment between the response and the reference, the metric uses the LLM for first break down the response and reference into claims and then uses natural language inference to determine the factual overlap between the response and the reference. Factual overlap is quantified using precision, recall, and F1 score, which can be controlled using the `mode` parameter.
+`FactualCorrectness` is a metric that compares and evaluates the factual accuracy of the generated `response` with the `reference`. This metric is used to determine the extent to which the generated response aligns with the reference. The factual correctness score ranges from 0 to 1, with higher values indicating better performance. To measure the alignment between the response and the reference, the metric uses the LLM for first break down the response and reference into claims and then uses natural language inference to determine the factual overlap between the response and the reference. Factual overlap is quantified using precision, recall, and F1 score, which can be controlled using the `mode` parameter.
 
 The formula for calculating True Positive (TP), False Positive (FP), and False Negative (FN) is as follows:
 

diff --git a/docs/concepts/metrics/available_metrics/faithfulness.md b/docs/concepts/metrics/available_metrics/faithfulness.md
@@ -1,6 +1,6 @@
 ## Faithfulness
 
-This measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better.
+`Faithfulness` metric measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better.
 
 The generated answer is regarded as faithful if all the claims made in the answer can be inferred from the given context. To calculate this, a set of claims from the generated answer is first identified. Then each of these claims is cross-checked with the given context to determine if it can be inferred from the context. The faithfulness score is given by:
-Original file line number
+Diff line change
@@ Expand Up / @@ -44,8 +44,6 @@ class MyPrompt(PydanticPrompt[MyInput,MyInput]): @@
     ```
-    [Prompt Object API Reference]()
     ## Guidelines for Creating Effective Prompts
     When creating prompts in Ragas, consider the following guidelines to ensure that your prompts are effective and aligned with the task requirements:
@@ Expand Down @@