Skip to content

Commit

Permalink
Merge pull request #810 from IBM/shahrokh_readmestuff
Browse files Browse the repository at this point in the history
A few changes in the root README
  • Loading branch information
touma-I authored Nov 19, 2024
2 parents 7e505c6 + 3563b0c commit 5f4c347
Show file tree
Hide file tree
Showing 3 changed files with 32 additions and 11 deletions.
22 changes: 15 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,14 @@ Explore more examples [here](examples/notebooks).

### Run your first data prep pipeline

Now that you have run a single transform, the next step is to explore how to put these transforms together to run a data prep pipeline for an end to end use case like fine tuning model or building a RAG application. This [notebook](examples/notebooks/fine%20tuning/code/sample-notebook.ipynb) gives an example of how to build an end to end data prep pipeline for fine tuning for code LLMs. You can also explore how to build a RAG pipeline [here](examples/notebooks/rag).
Now that you have run a single transform, the next step is to explore how to put these transforms
together to run a data prep pipeline for an end to end use case like fine tuning a model or building
a RAG application.
This [notebook](examples/notebooks/fine%20tuning/code/sample-notebook.ipynb) gives an example of
how to build an end to end data prep pipeline for fine tuning for code LLMs. Similarly, this
[notebook](examples/notebooks/fine%20tuning/language/demo_with_launcher.ipynb) is a fine tuning
example of an end-to-end sample data pipeline designed for processing language datasets.
You can also explore how to build a RAG pipeline [here](examples/notebooks/rag).

### Current list of transforms
The matrix below shows the the combination of modules and supported runtimes. All the modules can be accessed [here](transforms) and can be combined to form data processing pipelines, as shown in the [examples](examples) folder.
Expand All @@ -133,7 +140,8 @@ The matrix below shows the the combination of modules and supported runtimes. Al
| **Data Ingestion** | | | | |
| [Code (from zip) to Parquet](transforms/code/code2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: |
| [PDF to Parquet](transforms/language/pdf2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: |
| [HTML to Parquet](transforms/language/html2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | |
| [HTML to Parquet](transforms/language/html2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: |
| [Web to Parquet](transforms/universal/web2parquet/README.md) | :white_check_mark: | | | |
| **Universal (Code & Language)** | | | | |
| [Exact dedup filter](transforms/universal/ededup/ray/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: |
| [Fuzzy dedup filter](transforms/universal/fdedup/ray/README.md) | | :white_check_mark: | | :white_check_mark: |
Expand Down Expand Up @@ -223,11 +231,11 @@ If you use Data Prep Kit in your research, please cite our paper:
@misc{wood2024dataprepkitgettingdataready,
title={Data-Prep-Kit: getting your data ready for LLM application development},
author={David Wood and Boris Lublinsky and Alexy Roytman and Shivdeep Singh
and Abdulhamid Adebayo and Revital Eres and Mohammad Nassar and Hima Patel
and Yousaf Shah and Constantin Adam and Petros Zerfos and Nirmit Desai
and Daiki Tsuzuku and Takuya Goto and Michele Dolfi and Saptha Surendran
and Paramesvaran Selvam and Sungeun An and Yuan Chi Chang and Dhiraj Joshi
and Hajar Emami-Gohari and Xuan-Hong Dang and Yan Koyfman and Shahrokh Daijavad},
and Constantin Adam and Abdulhamid Adebayo and Sungeun An and Yuan Chi Chang
and Xuan-Hong Dang and Nirmit Desai and Michele Dolfi and Hajar Emami-Gohari
and Revital Eres and Takuya Goto and Dhiraj Joshi and Yan Koyfman
and Mohammad Nassar and Hima Patel and Paramesvaran Selvam and Yousaf Shah
and Saptha Surendran and Daiki Tsuzuku and Petros Zerfos and Shahrokh Daijavad},
year={2024},
eprint={2409.18164},
archivePrefix={arXiv},
Expand Down
7 changes: 6 additions & 1 deletion transforms/README-list.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,13 @@ Note: This list includes the transforms that were part of the release starting w
* [tokenization](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/tokenization/python/README.md)
* [doc_id](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/doc_id/python/README.md)
* [web2parquet](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/web2parquet/README.md)

## Release notes:


### 0.2.2.dev3
* web2parquet
### 0.2.2.dev2
* pdf2parquet now supports HTML,DOCX,PPTX, ... in addition to PDF



Expand Down
14 changes: 11 additions & 3 deletions transforms/universal/web2parquet/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,16 +21,24 @@ For configuring the crawl, users need to specify the following parameters:

The transform can be installed directly from pypi and has a dependency on the data-prep-toolkit and the data-prep-connector

Set up the local environment to run Jupyter notebook:
```
python -v venv venv
source venv/bin/activate
pip install jupyter lab
```
Install pre-requisites:

```
pip install data-prep-connector
pip install data-prep-toolkit>=0.2.2.dev2
pip install data-prep-toolkit-transform[web2parquet]>=0.2.2.dev3
pip install 'data-prep-toolkit-transforms[web2parquet]>=0.2.2.dev3'
```

If working from a fork in the git repo, from the root folder of the git repo, do the following:

```
cd transform/universal/web2parquet
cd transforms/universal/web2parquet
make venv
source venv/bin/activate
pip install -r requirements.txt
Expand All @@ -49,4 +57,4 @@ Web2Parquet(urls= ['https://thealliance.ai/'],
depth=2,
downloads=10,
folder='downloads').transform()
````
````

0 comments on commit 5f4c347

Please sign in to comment.