Benchmark Study: Large Language Models in Brazil's Law Exam

This is an introductory repo to my bachelor's thesis with most of the code used to generate the results (it does not include all the code used for the PDF parsing, but all required files to run the benchmark). It

Setup Python Virtual Environment

To ensure a consistent development environment, it is recommended to use a Python virtual environment. Follow these steps:

Install virtualenv if you haven't already:
```
pip install virtualenv
```
Create a virtual environment:
```
virtualenv venv
```
Activate the virtual environment:
- On Windows:
```
.\venv\Scripts\activate
```
- On Unix or MacOS:
```
source venv/bin/activate
```
Install project dependencies from requirements.txt:
```
pip install -r requirements.txt
```

Now your Python virtual environment is set up.

This Benchmark used GPT 4, GPT 3.5, Llama 2 13B, and Llama 2 70B. Experiments were conducted from 2023 Nov 9 to 2023 Nov 12 using OpenAI and Replicate APIs.

RAG Hyperparameters

Hyperparameter	Value
LLM Model Temperature	0.2
LLM Max Tokens	50
Text Chunk Size (Number of Chars)	512
Text Chunk Overlap (Number of Chars)	64

Results

How much did OpenAI models score on the 1st Phase of the 37th OAB SP Exam (Bar Exam)?

How much did Llama2 models score on the 1st Phase of the 37th OAB SP Exam (Bar Exam)?

How much does the embedding model matter when doing RAG? Using GPT 3.5 and retrieving 5 text chunks

Note on Reproducibility

The results presented here are point estimates and may not be 100% reproducible due to the stochastic nature of Large Language Models (LLMs). This is especially true for commercial LLMs, where the internal workings are not fully transparent. Keep in mind that variations in results might occur even with the same hyperparameters and settings.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
1A FASE OAB		1A FASE OAB
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Vade_mecum_2023.pdf		Vade_mecum_2023.pdf
Vade_mecum_2023.pdf.txt		Vade_mecum_2023.pdf.txt
how_much_did_llama2_models_score_on_the_1st_phase_of_the_37th_oab_sp_exam_(bar_exam)?.png		how_much_did_llama2_models_score_on_the_1st_phase_of_the_37th_oab_sp_exam_(bar_exam)?.png
how_much_did_openai_models_score_on_the_1st_phase_of_the_37th_oab_sp_exam_(bar_exam)?.png		how_much_did_openai_models_score_on_the_1st_phase_of_the_37th_oab_sp_exam_(bar_exam)?.png
how_much_the_embeddings_model_matter_when_doing_rag?_using_gpt_3.5_and_retrieving_5_text_chunks.png		how_much_the_embeddings_model_matter_when_doing_rag?_using_gpt_3.5_and_retrieving_5_text_chunks.png
partial_results.csv		partial_results.csv
pdf_parsing.ipynb		pdf_parsing.ipynb
plot_results.ipynb		plot_results.ipynb
requirements.txt		requirements.txt
run_benchmark.ipynb		run_benchmark.ipynb
vade_mecum_chapter_metadata.csv		vade_mecum_chapter_metadata.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmark Study: Large Language Models in Brazil's Law Exam

Table of Contents

Setup Python Virtual Environment

RAG Hyperparameters

Results

How much did OpenAI models score on the 1st Phase of the 37th OAB SP Exam (Bar Exam)?

How much did Llama2 models score on the 1st Phase of the 37th OAB SP Exam (Bar Exam)?

How much does the embedding model matter when doing RAG? Using GPT 3.5 and retrieving 5 text chunks

Note on Reproducibility

About

Releases

Packages

Languages

License

mateusnobre/oab_1st_phase_brazil_law_exam_RAG

Folders and files

Latest commit

History

Repository files navigation

Benchmark Study: Large Language Models in Brazil's Law Exam

Table of Contents

Setup Python Virtual Environment

RAG Hyperparameters

Results

How much did OpenAI models score on the 1st Phase of the 37th OAB SP Exam (Bar Exam)?

How much did Llama2 models score on the 1st Phase of the 37th OAB SP Exam (Bar Exam)?

How much does the embedding model matter when doing RAG? Using GPT 3.5 and retrieving 5 text chunks

Note on Reproducibility

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages