add code intent

explodinggradients · Oct 11, 2024 · e7490c2 · e7490c2
1 parent 2c5ca01
commit e7490c2
Show file tree

Hide file tree

Showing 6 changed files with 30 additions and 30 deletions.
diff --git a/docs/concepts/metrics/available_metrics/factual_correctness.md b/docs/concepts/metrics/available_metrics/factual_correctness.md
@@ -1,6 +1,6 @@
 ## Factual Correctness
 
-Factual correctness is a metric that compares and evaluates the factual accuracy of the generated `response` with the `reference`. This metric is used to determine the extent to which the generated response aligns with the reference. The factual correctness score ranges from 0 to 1, with higher values indicating better performance. To measure the alignment between the response and the reference, the metric uses the LLM for first break down the response and reference into claims and then uses natural language inference to determine the factual overlap between the response and the reference. Factual overlap is quantified using precision, recall, and F1 score, which can be controlled using the `mode` parameter.
+`FactualCorrectness` is a metric that compares and evaluates the factual accuracy of the generated `response` with the `reference`. This metric is used to determine the extent to which the generated response aligns with the reference. The factual correctness score ranges from 0 to 1, with higher values indicating better performance. To measure the alignment between the response and the reference, the metric uses the LLM for first break down the response and reference into claims and then uses natural language inference to determine the factual overlap between the response and the reference. Factual overlap is quantified using precision, recall, and F1 score, which can be controlled using the `mode` parameter.
 
 The formula for calculating True Positive (TP), False Positive (FP), and False Negative (FN) is as follows:
 

diff --git a/docs/concepts/metrics/available_metrics/general_purpose.md b/docs/concepts/metrics/available_metrics/general_purpose.md
@@ -4,7 +4,7 @@ General purpose evaluation metrics are used to evaluate any given task.
 
 ## Aspect Critic 
 
-Aspect critic is an evaluation metric that can be used to evaluate responses based on predefined aspects in free form natural language. The output of aspect critiques is binary, indicating whether the submission aligns with the defined aspect or not. 
+`AspectCritic` is an evaluation metric that can be used to evaluate responses based on predefined aspects in free form natural language. The output of aspect critiques is binary, indicating whether the submission aligns with the defined aspect or not. 
 
 ### Example
 

diff --git a/docs/concepts/metrics/available_metrics/semantic_similarity.md b/docs/concepts/metrics/available_metrics/semantic_similarity.md
@@ -1,4 +1,4 @@
-##  Semantic similarity
+#  Semantic similarity
 
 The concept of Answer Semantic Similarity pertains to the assessment of the semantic resemblance between the generated answer and the ground truth. This evaluation is based on the `ground truth` and the `answer`, with values falling within the range of 0 to 1. A higher score signifies a better alignment between the generated answer and the ground truth.
 
@@ -8,19 +8,19 @@ Measuring the semantic similarity between answers can offer valuable insights in
 ### Example
 
 ```python
-from datasets import Dataset 
-from ragas.metrics import answer_similarity
-from ragas import evaluate
-
-
-data_samples = {
-    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
-    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
-    'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
-}
-dataset = Dataset.from_dict(data_samples)
-score = evaluate(dataset,metrics=[answer_similarity])
-score.to_pandas()
+from ragas.dataset_schema import SingleTurnSample
+from ragas.metrics import SemanticSimilarity
+
+
+sample = SingleTurnSample(
+    response="The Eiffel Tower is located in Paris.",
+    reference="The Eiffel Tower is located in Paris. I has a height of 1000ft."
+)
+
+scorer = SemanticSimilarity()
+scorer.embeddings = embedding_model
+await scorer.single_turn_ascore(sample)
+
 ```
 
 ### How It’s Calculated 

diff --git a/docs/concepts/metrics/available_metrics/sql.md b/docs/concepts/metrics/available_metrics/sql.md
@@ -6,7 +6,7 @@ In these metrics the resulting SQL is compared after executing the SQL query on
 
 ### DataCompy Score
 
-DataCompy is a python library that compares two pandas DataFrames. It provides a simple interface to compare two DataFrames and provides a detailed report of the differences. In this metric the `response` is executed on the database and the resulting data is compared with the expected data, ie `reference`. To enable comparison both `response` and `reference` should be in the form of a Comma-Separated Values as shown in the example.
+`DataCompyScore` metric uses DataCompy, a python library that compares two pandas DataFrames. It provides a simple interface to compare two DataFrames and provides a detailed report of the differences. In this metric the `response` is executed on the database and the resulting data is compared with the expected data, ie `reference`. To enable comparison both `response` and `reference` should be in the form of a Comma-Separated Values as shown in the example.
 
 Dataframes can be compared across rows or columns. This can be configured using `mode` parameter. 
 
@@ -24,7 +24,7 @@ By default, the mode is set to `row`, and metric is F1 score which is the harmon
 
 
 ```python
-from ragas.metrics._datacompy_score import DataCompyScore
+from ragas.metrics import DataCompyScore
 from ragas.dataset_schema import SingleTurnSample
 
 data1 = """acct_id,dollar_amt,name,float_fld,date_fld
@@ -60,10 +60,10 @@ Executing SQL queries on the database can be time-consuming and sometimes not fe
 
 ### SQL Query Semantic equivalence
 
-SQL Query Semantic equivalence is a metric that can be used to evaluate the equivalence of `response` query with `reference` query. The metric also needs database schema to be used when comparing queries, this is inputted in `reference_contexts`. This metric is a binary metric, with 1 indicating that the SQL queries are semantically equivalent and 0 indicating that the SQL queries are not semantically equivalent.
+`LLMSqlEquivalenceWithReference` is a metric that can be used to evaluate the equivalence of `response` query with `reference` query. The metric also needs database schema to be used when comparing queries, this is inputted in `reference_contexts`. This metric is a binary metric, with 1 indicating that the SQL queries are semantically equivalent and 0 indicating that the SQL queries are not semantically equivalent.
 
 ```python
-from ragas.metrics._sql_semantic_equivalence import LLMSqlEquivalenceWithReference
+from ragas.metrics import LLMSqlEquivalenceWithReference
 from ragas.dataset_schema import SingleTurnSample
 
 sample = SingleTurnSample(

diff --git a/docs/concepts/metrics/available_metrics/summarization_score.md b/docs/concepts/metrics/available_metrics/summarization_score.md
@@ -2,7 +2,7 @@
 
 ## Summarization Score
 
-This metric gives a measure of how well the summary (`response`) captures the important information from the `retrieved_contexts`. The intuition behind this metric is that a good summary shall contain all the important information present in the context(or text so to say).
+`SummarizationScore` metric gives a measure of how well the summary (`response`) captures the important information from the `retrieved_contexts`. The intuition behind this metric is that a good summary shall contain all the important information present in the context(or text so to say).
 
 We first extract a set of important keyphrases from the context. These keyphrases are then used to generate a set of questions. The answers to these questions are always `yes(1)` for the context. We then ask these questions to the summary and calculate the summarization score as the ratio of correctly answered questions to the total number of questions. 
 

diff --git a/docs/concepts/metrics/available_metrics/traditional.md b/docs/concepts/metrics/available_metrics/traditional.md
@@ -2,7 +2,7 @@
 
 ## Non LLM String Similarity
 
-he NonLLMStringSimilarity metric measures the similarity between the reference and the response using traditional string distance measures such as Levenshtein, Hamming, and Jaro. This metric is useful for evaluating the similarity of `response` to the `reference` text without relying on large language models (LLMs). The metric returns a score between 0 and 1, where 1 indicates a perfect match between the response and the reference. This is a non LLM based metric.
+`NonLLMStringSimilarity` metric measures the similarity between the reference and the response using traditional string distance measures such as Levenshtein, Hamming, and Jaro. This metric is useful for evaluating the similarity of `response` to the `reference` text without relying on large language models (LLMs). The metric returns a score between 0 and 1, where 1 indicates a perfect match between the response and the reference. This is a non LLM based metric.
 
 ### Example
 ```python
@@ -29,12 +29,12 @@ scorer = NonLLMStringSimilarity(distance_measure=DistanceMeasure.HAMMING)
 
 ## BLEU Score
 
-The [BLEU (Bilingual Evaluation Understudy)](https://en.wikipedia.org/wiki/BLEU) score is a metric used to evaluate the quality of `response` by comparing it with `reference`. It measures the similarity between the response and the reference based on n-gram precision and brevity penalty. BLEU score was originally designed to evaluate machine translation systems, but it is also used in other natural language processing tasks. Since it was designed to evaluate machine translation systems, it expects the response and reference to contain same number of sentences. The comparison is done at sentence level. BLEU score ranges from 0 to 1, where 1 indicates a perfect match between the response and the reference. This is a non LLM based metric.
+The `BleuScore` score is a metric used to evaluate the quality of `response` by comparing it with `reference`. It measures the similarity between the response and the reference based on n-gram precision and brevity penalty. BLEU score was originally designed to evaluate machine translation systems, but it is also used in other natural language processing tasks. Since it was designed to evaluate machine translation systems, it expects the response and reference to contain same number of sentences. The comparison is done at sentence level. BLEU score ranges from 0 to 1, where 1 indicates a perfect match between the response and the reference. This is a non LLM based metric.
 
 ### Example
 ```python
 from ragas.dataset_schema import SingleTurnSample
-from ragas.metrics._bleu_score import BleuScore
+from ragas.metrics import BleuScore
 
 sample = SingleTurnSample(
     response="The Eiffel Tower is located in India.",
@@ -54,11 +54,11 @@ scorer = BleuScore(weights=(0.25, 0.25, 0.25, 0.25))
 
 ## ROUGE Score
 
-The [ROUGE (Recall-Oriented Understudy for Gisting Evaluation)](https://en.wikipedia.org/wiki/ROUGE_(metric)) score is a set of metrics used to evaluate the quality of natural language generations. It measures the overlap between the generated `response` and the `reference` text based on n-gram recall, precision, and F1 score. ROUGE score ranges from 0 to 1, where 1 indicates a perfect match between the response and the reference. This is a non LLM based metric.
+The `RougeScore` score is a set of metrics used to evaluate the quality of natural language generations. It measures the overlap between the generated `response` and the `reference` text based on n-gram recall, precision, and F1 score. ROUGE score ranges from 0 to 1, where 1 indicates a perfect match between the response and the reference. This is a non LLM based metric.
 
 ```python
 from ragas.dataset_schema import SingleTurnSample
-from ragas.metrics._rogue_score import RougeScore
+from ragas.metrics import RougeScore
 
 sample = SingleTurnSample(
     response="The Eiffel Tower is located in India.",
@@ -82,11 +82,11 @@ scorer = RougeScore(measure_type="recall")
 ```
 
 ## Exact Match
-The ExactMatch metric checks if the response is exactly the same as the reference text. It is useful in scenarios where you need to ensure that the generated response matches the expected output word-for-word. For example, arguments in tool calls, etc. The metric returns 1 if the response is an exact match with the reference, and 0 otherwise.
+The `ExactMatch` metric checks if the response is exactly the same as the reference text. It is useful in scenarios where you need to ensure that the generated response matches the expected output word-for-word. For example, arguments in tool calls, etc. The metric returns 1 if the response is an exact match with the reference, and 0 otherwise.
 
 ```python
 from ragas.dataset_schema import SingleTurnSample
-from ragas.metrics._string import ExactMatch
+from ragas.metrics import ExactMatch
 
 sample = SingleTurnSample(
     response="India",
@@ -98,11 +98,11 @@ await scorer.single_turn_ascore(sample)
 ```
 
 ## String Presence
-The StringPresence metric checks if the response contains the reference text. It is useful in scenarios where you need to ensure that the generated response contains certain keywords or phrases. The metric returns 1 if the response contains the reference, and 0 otherwise.
+The `StringPresence` metric checks if the response contains the reference text. It is useful in scenarios where you need to ensure that the generated response contains certain keywords or phrases. The metric returns 1 if the response contains the reference, and 0 otherwise.
 
 ```python
 from ragas.dataset_schema import SingleTurnSample
-from ragas.metrics._string import StringPresence
+from ragas.metrics import StringPresence
 
 sample = SingleTurnSample(
     response="The Eiffel Tower is located in India.",