From 00045d6ce1fa373d63014f00fb50ac697c430414 Mon Sep 17 00:00:00 2001 From: SHAHROKH DAIJAVAD Date: Fri, 15 Nov 2024 14:02:08 -0800 Subject: [PATCH 1/8] Update README.md --- README.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 2c1caa04e6..88bed2721c 100644 --- a/README.md +++ b/README.md @@ -133,7 +133,7 @@ The matrix below shows the the combination of modules and supported runtimes. Al | **Data Ingestion** | | | | | | [Code (from zip) to Parquet](transforms/code/code2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | | [PDF to Parquet](transforms/language/pdf2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | -| [HTML to Parquet](transforms/language/html2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | | +| [HTML to Parquet](transforms/language/html2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | | **Universal (Code & Language)** | | | | | | [Exact dedup filter](transforms/universal/ededup/ray/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | | [Fuzzy dedup filter](transforms/universal/fdedup/ray/README.md) | | :white_check_mark: | | :white_check_mark: | @@ -223,11 +223,11 @@ If you use Data Prep Kit in your research, please cite our paper: @misc{wood2024dataprepkitgettingdataready, title={Data-Prep-Kit: getting your data ready for LLM application development}, author={David Wood and Boris Lublinsky and Alexy Roytman and Shivdeep Singh - and Abdulhamid Adebayo and Revital Eres and Mohammad Nassar and Hima Patel - and Yousaf Shah and Constantin Adam and Petros Zerfos and Nirmit Desai - and Daiki Tsuzuku and Takuya Goto and Michele Dolfi and Saptha Surendran - and Paramesvaran Selvam and Sungeun An and Yuan Chi Chang and Dhiraj Joshi - and Hajar Emami-Gohari and Xuan-Hong Dang and Yan Koyfman and Shahrokh Daijavad}, + and Constantin Adam and Abdulhamid Adebayo and Sungeun An and Yuan Chi Chang + and Xuan-Hong Dang and Nirmit Desai and Michele Dolfi and Hajar Emami-Gohari + and Revital Eres and Takuya Goto and Dhiraj Joshi and Yan Koyfman + and Mohammad Nassar and Hima Patel and Paramesvaran Selvam and Yousaf Shah + and Saptha Surendran and Daiki Tsuzuku and Petros Zerfos and Shahrokh Daijavad}, year={2024}, eprint={2409.18164}, archivePrefix={arXiv}, From 270186078b275d123ac1608fc047869bb38e6e5c Mon Sep 17 00:00:00 2001 From: SHAHROKH DAIJAVAD Date: Mon, 18 Nov 2024 07:14:38 -0800 Subject: [PATCH 2/8] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 88bed2721c..6b82e7a6af 100644 --- a/README.md +++ b/README.md @@ -134,6 +134,7 @@ The matrix below shows the the combination of modules and supported runtimes. Al | [Code (from zip) to Parquet](transforms/code/code2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | | [PDF to Parquet](transforms/language/pdf2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | | [HTML to Parquet](transforms/language/html2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | +| [Web to Parquet](transforms/universal/web2parquet/README.md) | :white_check_mark: | | | | | **Universal (Code & Language)** | | | | | | [Exact dedup filter](transforms/universal/ededup/ray/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | | [Fuzzy dedup filter](transforms/universal/fdedup/ray/README.md) | | :white_check_mark: | | :white_check_mark: | From 521bcbe0107ae7cff26dddd14f5d8cc608f1de13 Mon Sep 17 00:00:00 2001 From: SHAHROKH DAIJAVAD Date: Mon, 18 Nov 2024 08:02:57 -0800 Subject: [PATCH 3/8] Update README.md --- README.md | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 6b82e7a6af..716f3b0f22 100644 --- a/README.md +++ b/README.md @@ -122,7 +122,14 @@ Explore more examples [here](examples/notebooks). ### Run your first data prep pipeline -Now that you have run a single transform, the next step is to explore how to put these transforms together to run a data prep pipeline for an end to end use case like fine tuning model or building a RAG application. This [notebook](examples/notebooks/fine%20tuning/code/sample-notebook.ipynb) gives an example of how to build an end to end data prep pipeline for fine tuning for code LLMs. You can also explore how to build a RAG pipeline [here](examples/notebooks/rag). +Now that you have run a single transform, the next step is to explore how to put these transforms +together to run a data prep pipeline for an end to end use case like fine tuning a model or building +a RAG application. +This [notebook](examples/notebooks/fine%20tuning/code/sample-notebook.ipynb) gives an example of +how to build an end to end data prep pipeline for fine tuning for code LLMs. Similarly, this +[notebook](examples/notebooks/fine%20tuning/language/demo_with_launcher.ipynb) is a fine tuning +example of an end-to-end sample data pipeline designed for processing language datasets. +You can also explore how to build a RAG pipeline [here](examples/notebooks/rag). ### Current list of transforms The matrix below shows the the combination of modules and supported runtimes. All the modules can be accessed [here](transforms) and can be combined to form data processing pipelines, as shown in the [examples](examples) folder. From 37cd7a9cf5802c1266eb3c48ffa1f15adc72a3fc Mon Sep 17 00:00:00 2001 From: Shahrokh Daijavad Date: Mon, 18 Nov 2024 11:25:03 -0800 Subject: [PATCH 4/8] Update README.md transform => transforms --- transforms/universal/web2parquet/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/transforms/universal/web2parquet/README.md b/transforms/universal/web2parquet/README.md index 1841403a7e..98af08cc66 100644 --- a/transforms/universal/web2parquet/README.md +++ b/transforms/universal/web2parquet/README.md @@ -30,7 +30,7 @@ pip install data-prep-toolkit-transform[web2parquet]>=0.2.2.dev3 If working from a fork in the git repo, from the root folder of the git repo, do the following: ``` -cd transform/universal/web2parquet +cd transforms/universal/web2parquet make venv source venv/bin/activate pip install -r requirements.txt @@ -49,4 +49,4 @@ Web2Parquet(urls= ['https://thealliance.ai/'], depth=2, downloads=10, folder='downloads').transform() -```` \ No newline at end of file +```` From abe6b7a6f9bc01adea9f65a6e05bddb30f336397 Mon Sep 17 00:00:00 2001 From: Shahrokh Daijavad Date: Mon, 18 Nov 2024 11:37:52 -0800 Subject: [PATCH 5/8] Update README inweb2parquet syntax issues --- transforms/universal/web2parquet/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/transforms/universal/web2parquet/README.md b/transforms/universal/web2parquet/README.md index 98af08cc66..552b22a256 100644 --- a/transforms/universal/web2parquet/README.md +++ b/transforms/universal/web2parquet/README.md @@ -24,7 +24,7 @@ The transform can be installed directly from pypi and has a dependency on the da ``` pip install data-prep-connector pip install data-prep-toolkit>=0.2.2.dev2 -pip install data-prep-toolkit-transform[web2parquet]>=0.2.2.dev3 +pip install 'data-prep-toolkit-transforms[web2parquet]>=0.2.2.dev3' ``` If working from a fork in the git repo, from the root folder of the git repo, do the following: From ab7475ad487500ad4be61ebf6328e6a90bc5fb6c Mon Sep 17 00:00:00 2001 From: Shahrokh Daijavad Date: Mon, 18 Nov 2024 11:52:34 -0800 Subject: [PATCH 6/8] Update README.md for the web2parquet A few syntax changes --- transforms/universal/web2parquet/README.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/transforms/universal/web2parquet/README.md b/transforms/universal/web2parquet/README.md index 552b22a256..1a8ecb4086 100644 --- a/transforms/universal/web2parquet/README.md +++ b/transforms/universal/web2parquet/README.md @@ -21,6 +21,14 @@ For configuring the crawl, users need to specify the following parameters: The transform can be installed directly from pypi and has a dependency on the data-prep-toolkit and the data-prep-connector +Set up the local environment to run Jupyter notebook: +``` +python -v venv venv +source venv/bin/activate +pip install jupyter lab +``` +Install pre-requisites: + ``` pip install data-prep-connector pip install data-prep-toolkit>=0.2.2.dev2 From fca82b0ac7b64cf2cd1f24f6ec9eb250da3d2737 Mon Sep 17 00:00:00 2001 From: Shahrokh Daijavad Date: Mon, 18 Nov 2024 12:01:55 -0800 Subject: [PATCH 7/8] Update README-list.md --- transforms/README-list.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/transforms/README-list.md b/transforms/README-list.md index 3e70b6b62a..5567a55761 100644 --- a/transforms/README-list.md +++ b/transforms/README-list.md @@ -36,8 +36,13 @@ Note: This list includes the transforms that were part of the release starting w * [tokenization](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/tokenization/python/README.md) * [doc_id](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/doc_id/python/README.md) * [web2parquet](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/web2parquet/README.md) + +## Release notes: - +### 0.2.2.dev3 +* web2parquet +### 0.2.2.dev2 +* pdf2parquet now supports HTML,DOCX,PPTX in addition to PDF From 3563b0c7879c32e3769f8bc834f175a7f1ad4dca Mon Sep 17 00:00:00 2001 From: Shahrokh Daijavad Date: Mon, 18 Nov 2024 12:02:54 -0800 Subject: [PATCH 8/8] Update README-list.md --- transforms/README-list.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/transforms/README-list.md b/transforms/README-list.md index 5567a55761..8040dc7a98 100644 --- a/transforms/README-list.md +++ b/transforms/README-list.md @@ -42,7 +42,7 @@ Note: This list includes the transforms that were part of the release starting w ### 0.2.2.dev3 * web2parquet ### 0.2.2.dev2 -* pdf2parquet now supports HTML,DOCX,PPTX in addition to PDF +* pdf2parquet now supports HTML,DOCX,PPTX, ... in addition to PDF