Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to pass in multiple template text files for LLM-as-judge eval #99

Merged
merged 15 commits into from
Sep 11, 2024
Prev Previous commit
Next Next commit
small fixes for eval docs
rchan26 committed Sep 11, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
commit dc07252a3d0f60789c1e44c74d205309b8ccc6f8
4 changes: 4 additions & 0 deletions docs/evaluation.md
Original file line number Diff line number Diff line change
@@ -9,6 +9,8 @@ To perform an LLM-as-judge evaluation, we essentially treat this as just _anothe

Therefore, given a _completed_ experiment file (i.e., a jsonl file where each line is a json object containing the prompt and response from a model), we can create another experiment file where the prompts are generated using some judge evaluation template and the completed response file. We must specify the model that we want to use as the judge. We call this a _judge_ experiment file and we can use `prompto` again to run this experiment and obtain the judge evaluation responses.

Also see the [Running LLM-as-judge experiment notebook](https://alan-turing-institute.github.io/prompto/examples/evaluation/running_llm_as_judge_experiment/) for a more detailed walkthrough the library for creating and running judge evaluations.

### Judge folder

To run an LLM-as-judge evaluation, you must first create a _judge folder_ consisting of:
@@ -128,6 +130,8 @@ def my_scorer(prompt_dict: dict) -> dict:
return prompt_dict
```

Also see the [Running experiments with custom evaluations](https://alan-turing-institute.github.io/prompto/examples/evaluation/running_experiments_with_custom_evaluations/) for a more detailed walkthrough the library for using custom scoring functions.

### Using a scorer in `prompto`

In Python, to use a scorer, when processing an experiment, you can pass in a list of scoring functions to the `Experiment.process()` method. For instance, you can use the `match` and `includes` scorers as follows:
Original file line number Diff line number Diff line change
@@ -27,12 +27,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Environment setup\n",
"\n",
"In this experiment, we will use the Anthropic API, but feel free to edit the input file provided to use a different API and model.\n",
"\n",
"When using `prompto` to query models from the Anthropic API, lines in our experiment `.jsonl` files must have `\"api\": \"anthropic\"` in the prompt dict. \n",
"\n",
"## Environment variables\n",
"\n",
"For the [Anthropic API](https://alan-turing-institute.github.io/prompto/docs/anthropic/), there are two environment variables that could be set:\n",
"- `ANTHROPIC_API_KEY`: the API key for the Anthropic API\n",
"\n",