Skip to content

Commit

Permalink
Merge pull request #718 from sujee/intro-example1
Browse files Browse the repository at this point in the history
Intro example 1
  • Loading branch information
touma-I authored Oct 21, 2024
2 parents 5942f42 + 4d070ca commit c90017a
Show file tree
Hide file tree
Showing 11 changed files with 10,992 additions and 0 deletions.
10 changes: 10 additions & 0 deletions examples/notebooks/intro/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
output*/

## File system artifacts
.directory
.DS_Store


## Python output
__pycache__
.ipynb_checkpoints/
36 changes: 36 additions & 0 deletions examples/notebooks/intro/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Data Prep Kit Introduction

This is an example featuring some of the features of data prep kit.

## Running the code

The code can be run on either

1. Google colab: very easy to run; no local setup needed.
2. On your local Python environment. Here is a quick guide. You can find instructions for latest version [here](../../../README.md#-getting-started)

```bash
conda create -n data-prep-kit -y python=3.11
conda activate data-prep-kit

# install the following in 'data-prep-kit' environment
pip3 install data-prep-tooklit==0.2.1
pip3 install data-prep-toolkit-transforms==0.2.1
pip3 install data-prep-toolkit-transforms-ray==0.2.1
pip3 install jupyterlab ipykernel ipywidgets

## install custom kernel
## Important: Use this kernel when running example notebooks!
python -m ipykernel install --user --name=data-prep-kit --display-name "dataprepkit"

# start jupyter and run the notebooks with this jupyter
jupyter lab
```

## Intro

This notebook will demonstrate processing PDFs

`PDFs ---> text ---> chunks ---> exact dedupe ---> fuzzy dedupe ---> embeddings`

[python version](dpk_intro_1_python.ipynb)   |   [ray version](dpk_intro_1_ray.ipynb)
Loading

0 comments on commit c90017a

Please sign in to comment.