Merge pull request #718 from sujee/intro-example1

Intro example 1
IBM · Oct 21, 2024 · c90017a · c90017a
2 parents 5942f42 + 4d070ca
commit c90017a
Show file tree

Hide file tree

Showing 11 changed files with 10,992 additions and 0 deletions.
diff --git a/examples/notebooks/intro/.gitignore b/examples/notebooks/intro/.gitignore
@@ -0,0 +1,10 @@
+output*/
+
+## File system artifacts
+.directory
+.DS_Store
+
+
+## Python output
+__pycache__
+.ipynb_checkpoints/
diff --git a/examples/notebooks/intro/README.md b/examples/notebooks/intro/README.md
@@ -0,0 +1,36 @@
+# Data Prep Kit Introduction
+
+This is an example featuring some of the features of data prep kit.
+
+## Running the code
+
+The code can be run on either 
+
+1.  Google colab: very easy to run; no local setup needed.
+2.  On your local Python environment.  Here is a quick guide.  You can  find instructions for latest version [here](../../../README.md#-getting-started)
+
+```bash
+conda create -n data-prep-kit -y python=3.11
+conda activate data-prep-kit
+
+# install the following in 'data-prep-kit' environment
+pip3 install data-prep-tooklit==0.2.1
+pip3 install data-prep-toolkit-transforms==0.2.1
+pip3 install data-prep-toolkit-transforms-ray==0.2.1
+pip3 install jupyterlab   ipykernel  ipywidgets
+
+## install custom kernel
+## Important: Use this kernel when running example notebooks!
+python -m ipykernel install --user --name=data-prep-kit --display-name "dataprepkit"
+
+# start jupyter and run the notebooks with this jupyter
+jupyter lab
+```
+
+## Intro
+
+This notebook will demonstrate processing PDFs
+
+`PDFs ---> text ---> chunks --->   exact dedupe ---> fuzzy dedupe ---> embeddings`
+
+[python version](dpk_intro_1_python.ipynb)  &nbsp;   |   &nbsp;  [ray version](dpk_intro_1_ray.ipynb)