Skip to content

Commit

Permalink
add code intent
Browse files Browse the repository at this point in the history
  • Loading branch information
shahules786 committed Oct 11, 2024
1 parent 2c5ca01 commit e7490c2
Show file tree
Hide file tree
Showing 6 changed files with 30 additions and 30 deletions.
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## Factual Correctness

Factual correctness is a metric that compares and evaluates the factual accuracy of the generated `response` with the `reference`. This metric is used to determine the extent to which the generated response aligns with the reference. The factual correctness score ranges from 0 to 1, with higher values indicating better performance. To measure the alignment between the response and the reference, the metric uses the LLM for first break down the response and reference into claims and then uses natural language inference to determine the factual overlap between the response and the reference. Factual overlap is quantified using precision, recall, and F1 score, which can be controlled using the `mode` parameter.
`FactualCorrectness` is a metric that compares and evaluates the factual accuracy of the generated `response` with the `reference`. This metric is used to determine the extent to which the generated response aligns with the reference. The factual correctness score ranges from 0 to 1, with higher values indicating better performance. To measure the alignment between the response and the reference, the metric uses the LLM for first break down the response and reference into claims and then uses natural language inference to determine the factual overlap between the response and the reference. Factual overlap is quantified using precision, recall, and F1 score, which can be controlled using the `mode` parameter.

The formula for calculating True Positive (TP), False Positive (FP), and False Negative (FN) is as follows:

Expand Down
2 changes: 1 addition & 1 deletion docs/concepts/metrics/available_metrics/general_purpose.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ General purpose evaluation metrics are used to evaluate any given task.

## Aspect Critic

Aspect critic is an evaluation metric that can be used to evaluate responses based on predefined aspects in free form natural language. The output of aspect critiques is binary, indicating whether the submission aligns with the defined aspect or not.
`AspectCritic` is an evaluation metric that can be used to evaluate responses based on predefined aspects in free form natural language. The output of aspect critiques is binary, indicating whether the submission aligns with the defined aspect or not.

### Example

Expand Down
28 changes: 14 additions & 14 deletions docs/concepts/metrics/available_metrics/semantic_similarity.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## Semantic similarity
# Semantic similarity

The concept of Answer Semantic Similarity pertains to the assessment of the semantic resemblance between the generated answer and the ground truth. This evaluation is based on the `ground truth` and the `answer`, with values falling within the range of 0 to 1. A higher score signifies a better alignment between the generated answer and the ground truth.

Expand All @@ -8,19 +8,19 @@ Measuring the semantic similarity between answers can offer valuable insights in
### Example

```python
from datasets import Dataset
from ragas.metrics import answer_similarity
from ragas import evaluate


data_samples = {
'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
}
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[answer_similarity])
score.to_pandas()
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import SemanticSimilarity


sample = SingleTurnSample(
response="The Eiffel Tower is located in Paris.",
reference="The Eiffel Tower is located in Paris. I has a height of 1000ft."
)

scorer = SemanticSimilarity()
scorer.embeddings = embedding_model
await scorer.single_turn_ascore(sample)

```

### How It’s Calculated
Expand Down
8 changes: 4 additions & 4 deletions docs/concepts/metrics/available_metrics/sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ In these metrics the resulting SQL is compared after executing the SQL query on

### DataCompy Score

DataCompy is a python library that compares two pandas DataFrames. It provides a simple interface to compare two DataFrames and provides a detailed report of the differences. In this metric the `response` is executed on the database and the resulting data is compared with the expected data, ie `reference`. To enable comparison both `response` and `reference` should be in the form of a Comma-Separated Values as shown in the example.
`DataCompyScore` metric uses DataCompy, a python library that compares two pandas DataFrames. It provides a simple interface to compare two DataFrames and provides a detailed report of the differences. In this metric the `response` is executed on the database and the resulting data is compared with the expected data, ie `reference`. To enable comparison both `response` and `reference` should be in the form of a Comma-Separated Values as shown in the example.

Dataframes can be compared across rows or columns. This can be configured using `mode` parameter.

Expand All @@ -24,7 +24,7 @@ By default, the mode is set to `row`, and metric is F1 score which is the harmon


```python
from ragas.metrics._datacompy_score import DataCompyScore
from ragas.metrics import DataCompyScore
from ragas.dataset_schema import SingleTurnSample

data1 = """acct_id,dollar_amt,name,float_fld,date_fld
Expand Down Expand Up @@ -60,10 +60,10 @@ Executing SQL queries on the database can be time-consuming and sometimes not fe

### SQL Query Semantic equivalence

SQL Query Semantic equivalence is a metric that can be used to evaluate the equivalence of `response` query with `reference` query. The metric also needs database schema to be used when comparing queries, this is inputted in `reference_contexts`. This metric is a binary metric, with 1 indicating that the SQL queries are semantically equivalent and 0 indicating that the SQL queries are not semantically equivalent.
`LLMSqlEquivalenceWithReference` is a metric that can be used to evaluate the equivalence of `response` query with `reference` query. The metric also needs database schema to be used when comparing queries, this is inputted in `reference_contexts`. This metric is a binary metric, with 1 indicating that the SQL queries are semantically equivalent and 0 indicating that the SQL queries are not semantically equivalent.

```python
from ragas.metrics._sql_semantic_equivalence import LLMSqlEquivalenceWithReference
from ragas.metrics import LLMSqlEquivalenceWithReference
from ragas.dataset_schema import SingleTurnSample

sample = SingleTurnSample(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Summarization Score

This metric gives a measure of how well the summary (`response`) captures the important information from the `retrieved_contexts`. The intuition behind this metric is that a good summary shall contain all the important information present in the context(or text so to say).
`SummarizationScore` metric gives a measure of how well the summary (`response`) captures the important information from the `retrieved_contexts`. The intuition behind this metric is that a good summary shall contain all the important information present in the context(or text so to say).

We first extract a set of important keyphrases from the context. These keyphrases are then used to generate a set of questions. The answers to these questions are always `yes(1)` for the context. We then ask these questions to the summary and calculate the summarization score as the ratio of correctly answered questions to the total number of questions.

Expand Down
18 changes: 9 additions & 9 deletions docs/concepts/metrics/available_metrics/traditional.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Non LLM String Similarity

he NonLLMStringSimilarity metric measures the similarity between the reference and the response using traditional string distance measures such as Levenshtein, Hamming, and Jaro. This metric is useful for evaluating the similarity of `response` to the `reference` text without relying on large language models (LLMs). The metric returns a score between 0 and 1, where 1 indicates a perfect match between the response and the reference. This is a non LLM based metric.
`NonLLMStringSimilarity` metric measures the similarity between the reference and the response using traditional string distance measures such as Levenshtein, Hamming, and Jaro. This metric is useful for evaluating the similarity of `response` to the `reference` text without relying on large language models (LLMs). The metric returns a score between 0 and 1, where 1 indicates a perfect match between the response and the reference. This is a non LLM based metric.

### Example
```python
Expand All @@ -29,12 +29,12 @@ scorer = NonLLMStringSimilarity(distance_measure=DistanceMeasure.HAMMING)

## BLEU Score

The [BLEU (Bilingual Evaluation Understudy)](https://en.wikipedia.org/wiki/BLEU) score is a metric used to evaluate the quality of `response` by comparing it with `reference`. It measures the similarity between the response and the reference based on n-gram precision and brevity penalty. BLEU score was originally designed to evaluate machine translation systems, but it is also used in other natural language processing tasks. Since it was designed to evaluate machine translation systems, it expects the response and reference to contain same number of sentences. The comparison is done at sentence level. BLEU score ranges from 0 to 1, where 1 indicates a perfect match between the response and the reference. This is a non LLM based metric.
The `BleuScore` score is a metric used to evaluate the quality of `response` by comparing it with `reference`. It measures the similarity between the response and the reference based on n-gram precision and brevity penalty. BLEU score was originally designed to evaluate machine translation systems, but it is also used in other natural language processing tasks. Since it was designed to evaluate machine translation systems, it expects the response and reference to contain same number of sentences. The comparison is done at sentence level. BLEU score ranges from 0 to 1, where 1 indicates a perfect match between the response and the reference. This is a non LLM based metric.

### Example
```python
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics._bleu_score import BleuScore
from ragas.metrics import BleuScore

sample = SingleTurnSample(
response="The Eiffel Tower is located in India.",
Expand All @@ -54,11 +54,11 @@ scorer = BleuScore(weights=(0.25, 0.25, 0.25, 0.25))

## ROUGE Score

The [ROUGE (Recall-Oriented Understudy for Gisting Evaluation)](https://en.wikipedia.org/wiki/ROUGE_(metric)) score is a set of metrics used to evaluate the quality of natural language generations. It measures the overlap between the generated `response` and the `reference` text based on n-gram recall, precision, and F1 score. ROUGE score ranges from 0 to 1, where 1 indicates a perfect match between the response and the reference. This is a non LLM based metric.
The `RougeScore` score is a set of metrics used to evaluate the quality of natural language generations. It measures the overlap between the generated `response` and the `reference` text based on n-gram recall, precision, and F1 score. ROUGE score ranges from 0 to 1, where 1 indicates a perfect match between the response and the reference. This is a non LLM based metric.

```python
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics._rogue_score import RougeScore
from ragas.metrics import RougeScore

sample = SingleTurnSample(
response="The Eiffel Tower is located in India.",
Expand All @@ -82,11 +82,11 @@ scorer = RougeScore(measure_type="recall")
```

## Exact Match
The ExactMatch metric checks if the response is exactly the same as the reference text. It is useful in scenarios where you need to ensure that the generated response matches the expected output word-for-word. For example, arguments in tool calls, etc. The metric returns 1 if the response is an exact match with the reference, and 0 otherwise.
The `ExactMatch` metric checks if the response is exactly the same as the reference text. It is useful in scenarios where you need to ensure that the generated response matches the expected output word-for-word. For example, arguments in tool calls, etc. The metric returns 1 if the response is an exact match with the reference, and 0 otherwise.

```python
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics._string import ExactMatch
from ragas.metrics import ExactMatch

sample = SingleTurnSample(
response="India",
Expand All @@ -98,11 +98,11 @@ await scorer.single_turn_ascore(sample)
```

## String Presence
The StringPresence metric checks if the response contains the reference text. It is useful in scenarios where you need to ensure that the generated response contains certain keywords or phrases. The metric returns 1 if the response contains the reference, and 0 otherwise.
The `StringPresence` metric checks if the response contains the reference text. It is useful in scenarios where you need to ensure that the generated response contains certain keywords or phrases. The metric returns 1 if the response contains the reference, and 0 otherwise.

```python
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics._string import StringPresence
from ragas.metrics import StringPresence

sample = SingleTurnSample(
response="The Eiffel Tower is located in India.",
Expand Down

0 comments on commit e7490c2

Please sign in to comment.