Experimental code to query tabular data with natural language queries. The LLM looks only at the content of the tabular data and not at the dataset description or any metadata.
The code takes a natural language query as input, transforms it into a SQL query, executes the query on the database and then tries to translate the result back into natural language. Data are downloaded locally, so if too large, the tool may take a moment to load.
The code is meant not have a memory, so each query is independent. This is a feature!
It exploits LangChain agents for the scope and OpenAI.
Requires an OpenAI key to run (see code).
In the near future the underlying LLM model will be replaced by other models.
If you already use another large language model, as long as it is supported by LangChain, the code requires no modification.
We plan to expand the functionalities of this tool.
- switched to model gpt-4o-mini (not sure it is any better though)
- moved from an sql agent to a pandas dataframe agent
- added statistical analysis
- text-davinci-003 deprecated on Jan 4th, 2024, we now use gpt-3.5-turbo
- can generate hallucinations
- can remain stuck into a loop that lead to no answer or an error
- there is no control over long queries, so the code may return an error
- the generated query can be too large for the context window of the LLM
- if the data is too large the tool may fail to launch (out of memory)
- the notifications appear at the bottom of the web page
- works only on tabular data
- fix some of the above issues
- add basic statistical analysis functionalities (done!)
- add basic plotting functionalities (works only locally)
- extend to other data formats
Deploy to ShinyApps.io
At present ShinyApps.io supports only python 3.10 so make sure your local environment is based on this release. Hint: you can create an environment after installing Anaconda with
conda create -n shinyapp python=3.10
conda activate shinyapp
- these instruction assume the deployment is to the
askdataverseShinyApps.io domain. You should change it to your ShinyApps.io domain. - adjust the variable
OPENAI_API_KEYin the key.json file to your needs. If you do not deploy on ShinyApps.io, you can replace this file with setting properly an environment variableOPENAI_API_KEYand retrieve it viaos.environ['OPENAI_API_KEY']in app.py - create a free account at shinyapps.io. With this you will get your domain to replace
askdataverse. - install
rsconnect - install the ShinyApps.io application token
- run the "install certificates.command" for your python version
- in your shell type:
pip install -r requirements.txt - in your shell type:
export CONNECT_REQUEST_TIMEOUT=36000. You can use a larger number ifrsconnectfails due to a timeout error. This is ashinyapps.ioissue - deploy with:
rsconnect deploy shiny ./ --name askdataverse --title askthedata key.json
Note: you have to replace askdataverse to your ShinyApps.io domain.
Move into the askthedata folder and build the container
docker build -t askthedata .
Then add the key.json at runtime with
docker run -d -p 8085:8085 \
-v $(pwd)/key.json:/app/key.json:ro \
--name askthedata askthedata
Connect to port 8085.
This script requires two arguments: fileId and siteUrl.
The parameter dataset_pid is collected but not yet used.
- for Harvard Dataverse
siteUrlalways equals to https://dataverse.harvard.edu fileIdis the internal identifier in the Dataverse database
This section assumes you are the local Dataverse admin. First, download askthedata.json
Optionally, edit toolUrl if you have installed AskTheData on your own server.
As explained in the Dataverse Guides, run the following command:
curl -X POST -H 'Content-type: application/json' http://localhost:8080/api/admin/externalTools --upload-file askthedata.json
Although some interesting queries may fail for a given data, in general it works pretty well. The more you think sql-ish, the higher the success to get an answer. Generic queries are calculated on the first few rows of the database, so sometimes you need to force the queries like those in Example 3 below.
- Example 1: Data about smoking habits. https://ai-services.dataverse.org/askthedata/?fileId=4862482&siteUrl=https://dataverse.harvard.edu
or, if run locally (replace 64504 with your port):
-
http://localhost:64504/?fileId=4862482&siteUrl=https://dataverse.harvard.edu
-
Example 2: Data about COVID-19 and hospitalization. https://ai-services.dataverse.org/askthedata/?fileId=4458512&siteUrl=https://dataverse.harvard.edu
-
Example 3: Data about metadata. https://ai-services.dataverse.org/askthedata/?fileId=6570377&siteUrl=https://dataverse.harvard.edu
- what is this data about? (this is the default starting question)
- what's the most interesting thing about this data?
- surprise me with this data (this is a risky question, but helps to understand the limits of this tool)
and in particular, if you use the data from Example 1 in the above:
- do males smoke more than women?
- what is the average age of smokers by city?
- where people smoke the most?
and if you use the data from Example 2 in the above:
- how many people died by covid?
- did females recover more than males?
- what is the average number of days between admission date and death? [no longer working on gpt-4o-mini because it converts dates to milliseconds!]
- calculate the severity by gender
and if you use the data from Example 3 in the above:
- what is this data about? may fail (answering
This data is about astrophysics) as the system looks at the first rows without further indication. But forcing it to go deeper may work. Try this (eventually twice): - Look at the first 100 rows and tell me what is this data about
- calculate the abundance of different disciplines
etc, the limit is your creativity... and the LLM ability to answer!
Enjoy