You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. Add a link to the RAGAS nv metrics prompts
2. Add a note that the RAGAS nv metrics prompts are not tunable. The user can instead use the "Tunable RAG Evaluator" or add their own "Custom Evaluator"
3. Add a note on the recommended max_tokens for Trajectory Evluator.
## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/AIQToolkit/blob/develop/docs/source/resources/contributing.md).
- We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
- Any contribution which contains commits that are not Signed-Off will not be accepted.
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.
Authors:
- Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah)
Approvers:
- David Gardner (https://github.com/dagardner-nv)
URL: #322
Copy file name to clipboardExpand all lines: docs/source/workflows/evaluate.md
+7-3Lines changed: 7 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -115,7 +115,7 @@ llms:
115
115
model_name: meta/llama-3.1-70b-instruct
116
116
max_tokens: 8
117
117
```
118
-
For these metrics, it is recommended to use 8 tokens for the judge LLM.
118
+
For these metrics, it is recommended to use 8 tokens for the judge LLM. The judge LLM returns a floating point score between 0 and 1 for each metric where 1.0 indicates a perfect match between the expected output and the generated output.
119
119
120
120
Evaluation is dependent on the judge LLM's ability to accurately evaluate the generated output and retrieved context. This is the leadership board for the judge LLM:
121
121
```
@@ -126,6 +126,8 @@ Evaluation is dependent on the judge LLM's ability to accurately evaluate the ge
126
126
```
127
127
For a complete list of up-to-date judge LLMs, refer to the [RAGAS NV metrics leadership board](https://github.com/explodinggradients/ragas/blob/main/ragas/src/ragas/metrics/_nv_metrics.py)
128
128
129
+
For more information on the prompt used by the judge LLM, refer to the [RAGAS NV metrics](https://github.com/explodinggradients/ragas/blob/main/ragas/src/ragas/metrics/_nv_metrics.py). The prompt for these metrics is not configurable. If you need a custom prompt, you can use the [Tunable RAG Evaluator](../reference/evaluate.md#tunable-rag-evaluator) or implement your own evaluator using the [Custom Evaluator](../extend/custom-evaluator.md) documentation.
130
+
129
131
### Trajectory Evaluator
130
132
This evaluator uses the intermediate steps generated by the workflow to evaluate the workflow trajectory. The evaluator configuration includes the evaluator type and any additional parameters required by the evaluator.
131
133
@@ -138,9 +140,11 @@ eval:
138
140
llm_name: nim_trajectory_eval_llm
139
141
```
140
142
141
-
A judge LLM is used to evaluate the trajectory based on the tools available to the workflow.
143
+
A judge LLM is used to evaluate the trajectory produced by the workflow, taking into account the tools available during execution. It returns a floating-point score between 0 and 1, where 1.0 indicates a perfect trajectory.
144
+
145
+
It is recommended to set `max_tokens` to 1024 for the judge LLM to ensure sufficient context for evaluation.
142
146
143
-
The judge LLM is configured in the `llms` section of the configuration file and is referenced by the `llm_name` key in the evaluator configuration.
147
+
To configure the judge LLM, define it in the `llms` section of the configuration file, and reference it in the evaluator configuration using the `llm_name` key.
144
148
145
149
## Workflow Output
146
150
The `aiq eval` command runs the workflow on all the entries in the `dataset`. The output of these runs is stored in a file named `workflow_output.json` under the `output_dir` specified in the configuration file.
0 commit comments