This module shows how we benchmark question answering over CSV data. There are several components:
To setup, you should install all required packages:
pip install -r requirements.txt
You then need to set environment variables. This heavily uses LangSmith, so you need to set those environment variables:
export LANGCHAIN_TRACING_V2="true"
export LANGCHAIN_ENDPOINT=https://api.langchain.plus
export LANGCHAIN_API_KEY=...
This also uses OpenAI, so you need to set that environment variable:
export OPENAI_API_KEY=...
To do this, we set up a simple streamlit app that was logging questions, answers, and feedback to LangSmith. We then annotated examples in LangSmith and added them to a dataset we were creating. For more details on how to do this generally, see this cookbook
When doing this, you probably want to specific a project for all runs to be logged to:
export LANGCHAIN_PROJECT="Titanic CSV"
The streamlit_app.py
file contains the exact code used to run the application.
You can run this with streamlit run streamlit_app.py
See data.csv
for the data points we labeled.
In order to evaluate, we first upload our data to LangSmith, with dataset name Titanic CSV
.
This is done in upload_data.py
. You can run this with:
python upload_data.py
This allows us to track different evaluation runs against this dataset.
We then use a standard qa
evaluator to evaluate whether the generated answers are correct are not.
We include scripts for evaluating a few different methods:
Run with python pandas_agent_gpt_35.py
Results:
Run with python pandas_agent_gpt_4.py
Results:
Need into install more packages:
pip install beautifulsoup4 pandasai
Then can run with python pandas_ai.py
Results (note token tracking is off because not using LangChain):
A custom agent equipped with a custom prompt and some custom tools (Python REPL and vectorstore)
Run with python custom_agent.py
Results: