This is an introductory repo to my bachelor's thesis with most of the code used to generate the results (it does not include all the code used for the PDF parsing, but all required files to run the benchmark). It
To ensure a consistent development environment, it is recommended to use a Python virtual environment. Follow these steps:
-
Install
virtualenv
if you haven't already:pip install virtualenv
-
Create a virtual environment:
virtualenv venv
-
Activate the virtual environment:
- On Windows:
.\venv\Scripts\activate
- On Unix or MacOS:
source venv/bin/activate
- On Windows:
-
Install project dependencies from
requirements.txt
:pip install -r requirements.txt
Now your Python virtual environment is set up.
This Benchmark used GPT 4, GPT 3.5, Llama 2 13B, and Llama 2 70B. Experiments were conducted from 2023 Nov 9 to 2023 Nov 12 using OpenAI and Replicate APIs.
Hyperparameter | Value |
---|---|
LLM Model Temperature | 0.2 |
LLM Max Tokens | 50 |
Text Chunk Size (Number of Chars) | 512 |
Text Chunk Overlap (Number of Chars) | 64 |
The results presented here are point estimates and may not be 100% reproducible due to the stochastic nature of Large Language Models (LLMs). This is especially true for commercial LLMs, where the internal workings are not fully transparent. Keep in mind that variations in results might occur even with the same hyperparameters and settings.