defog-ai · rishsriv · Jan 29, 2025 · Jan 29, 2025 · Jan 29, 2025
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -23,8 +23,6 @@ jobs:
           pip install --upgrade pip setuptools
           pip install -r requirements_test.txt
           pip install pytest
-      - name: Download spaCy model
-        run: python -m spacy download en_core_web_sm
       - name: Run tests
         run: |
           pytest tests/test*.py
diff --git a/.gitignore b/.gitignore
@@ -23,3 +23,6 @@ eda_*.ipynb
 
 # wandb output (created when running upload_wandb.ipynb)
 wandb/
+
+# mac os specific
+.DS_Store
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -15,7 +15,6 @@ Firstly, clone the repository where we store our database data and schema. Insta
 git clone https://github.com/defog-ai/defog-data.git
 cd defog-data
 pip install -r requirements.txt
-python -m spacy download en_core_web_sm
 pip install -e .
 ```
 

diff --git a/README.md b/README.md
@@ -25,7 +25,6 @@ Firstly, clone the repository where we store our database data and schema. Insta
 git clone https://github.com/defog-ai/defog-data.git
 cd defog-data
 pip install -r requirements.txt
-python -m spacy download en_core_web_sm
 pip install -e .
 ```
 
@@ -106,8 +105,6 @@ If you have a private dataset that you do not want to make publicly available bu
 - Begin by creating a separate git repository for your private data, that has a `setup.py` file, similar to [defog-data](https://github.com/defog-ai/defog-data).
 - Create the metadata and data files, and import them into your database. This is to allow our evaluation framework to run the generated queries with some actual data. You can refer to `defog-data`'s [metadata objects](https://github.com/defog-ai/defog-data/blob/main/defog_data/metadata.py) for the schema, and [setup.sh](https://github.com/defog-ai/defog-data/blob/main/setup.sh) as an example on how import the data into your database. We do not prescribe any specific folder structure, and leave it to you to decide how you want to organize your data, so long as you can import it into your database easily.
 - To use our metadata pruning utilities, you would need to have the following defined:
-  - A way to load your embeddings. In our case, we call a function [load_embeddings](https://github.com/defog-ai/defog-data/blob/db8c3d4c4004144d2b3ff5a2701529f5545f520f/defog_data/supplementary.py#L85) from `defog-data`'s supplementary module to load a dictionary of database name to a tuple of the 2D embedding matrix (num examples x embedding dimension) and the associated text metadata for each row/example. If you would like to see how we generate this tuple, you may refer to [generate_embeddings](https://github.com/defog-ai/defog-data/blob/main/defog_data/supplementary.py#L11) in the `defog-data` repository.
-  - A way to load columns associated with various named entities. In our case, we call a dictionary [columns_ner](https://github.com/defog-ai/defog-data/blob/db8c3d4c4004144d2b3ff5a2701529f5545f520f/defog_data/supplementary.py#L106) of database name to a nested dictionary that maps each named entity type to a list of column metadata strings that are associated with that named entity type. You can refer to the raw data for an example of how we generate this dictionary.
   - A way to define joinable columns between tables. In our case, we call a dictionary [columns_join](https://github.com/defog-ai/defog-data/blob/db8c3d4c4004144d2b3ff5a2701529f5545f520f/defog_data/supplementary.py#L233) of database name to a nested dictionary of table tuples to column name tuples. You can refer to the raw data for an example of how we generate this dictionary.
 
 Once all of the 3 above steps have completed, you would need to

diff --git a/requirements.txt b/requirements.txt
@@ -15,7 +15,6 @@ pytest
 pyyaml
 sentence-transformers
 snowflake-connector-python
-spacy
 sqlalchemy
 tiktoken
 together

diff --git a/requirements_test.txt b/requirements_test.txt
@@ -3,11 +3,7 @@ numpy
 openai
 pandas
 psycopg2-binary
-# pysqlite3
-sentence_transformers
 snowflake-connector-python
-spacy==3.7.2
 sqlalchemy
 sqlglot
-torch
 tqdm