From 00045d6ce1fa373d63014f00fb50ac697c430414 Mon Sep 17 00:00:00 2001
From: SHAHROKH DAIJAVAD <shahrokh@us.ibm.com>
Date: Fri, 15 Nov 2024 14:02:08 -0800
Subject: [PATCH 1/8] Update README.md

---
 README.md | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/README.md b/README.md
index 2c1caa04e6..88bed2721c 100644
--- a/README.md
+++ b/README.md
@@ -133,7 +133,7 @@ The matrix below shows the the combination of modules and supported runtimes. Al
 | **Data Ingestion**                                                                   |                    |                    |                    |                    |
 | [Code (from zip) to Parquet](transforms/code/code2parquet/python/README.md)          | :white_check_mark: | :white_check_mark: |                    | :white_check_mark: |
 | [PDF to Parquet](transforms/language/pdf2parquet/python/README.md)                   | :white_check_mark: | :white_check_mark: |                    | :white_check_mark: |
-| [HTML to Parquet](transforms/language/html2parquet/python/README.md)                 | :white_check_mark: | :white_check_mark: |                    |                    |
+| [HTML to Parquet](transforms/language/html2parquet/python/README.md)                 | :white_check_mark: | :white_check_mark: |                    | :white_check_mark: |
 | **Universal (Code & Language)**                                                      |                    |                    |                    |                    | 
 | [Exact dedup filter](transforms/universal/ededup/ray/README.md)                      | :white_check_mark: | :white_check_mark: |                    | :white_check_mark: |
 | [Fuzzy dedup filter](transforms/universal/fdedup/ray/README.md)                      |                    | :white_check_mark: |                    | :white_check_mark: |
@@ -223,11 +223,11 @@ If you use Data Prep Kit in your research, please cite our paper:
 @misc{wood2024dataprepkitgettingdataready,
       title={Data-Prep-Kit: getting your data ready for LLM application development}, 
       author={David Wood and Boris Lublinsky and Alexy Roytman and Shivdeep Singh 
-      and Abdulhamid Adebayo and Revital Eres and Mohammad Nassar and Hima Patel 
-      and Yousaf Shah and Constantin Adam and Petros Zerfos and Nirmit Desai 
-      and Daiki Tsuzuku and Takuya Goto and Michele Dolfi and Saptha Surendran 
-      and Paramesvaran Selvam and Sungeun An and Yuan Chi Chang and Dhiraj Joshi 
-      and Hajar Emami-Gohari and Xuan-Hong Dang and Yan Koyfman and Shahrokh Daijavad},
+      and Constantin Adam and Abdulhamid Adebayo and Sungeun An and Yuan Chi Chang 
+      and Xuan-Hong Dang and Nirmit Desai and Michele Dolfi and Hajar Emami-Gohari 
+      and Revital Eres and Takuya Goto and Dhiraj Joshi and Yan Koyfman 
+      and Mohammad Nassar and Hima Patel and Paramesvaran Selvam and Yousaf Shah  
+      and Saptha Surendran and Daiki Tsuzuku and Petros Zerfos and Shahrokh Daijavad},
       year={2024},
       eprint={2409.18164},
       archivePrefix={arXiv},

From 270186078b275d123ac1608fc047869bb38e6e5c Mon Sep 17 00:00:00 2001
From: SHAHROKH DAIJAVAD <shahrokh@us.ibm.com>
Date: Mon, 18 Nov 2024 07:14:38 -0800
Subject: [PATCH 2/8] Update README.md

---
 README.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/README.md b/README.md
index 88bed2721c..6b82e7a6af 100644
--- a/README.md
+++ b/README.md
@@ -134,6 +134,7 @@ The matrix below shows the the combination of modules and supported runtimes. Al
 | [Code (from zip) to Parquet](transforms/code/code2parquet/python/README.md)          | :white_check_mark: | :white_check_mark: |                    | :white_check_mark: |
 | [PDF to Parquet](transforms/language/pdf2parquet/python/README.md)                   | :white_check_mark: | :white_check_mark: |                    | :white_check_mark: |
 | [HTML to Parquet](transforms/language/html2parquet/python/README.md)                 | :white_check_mark: | :white_check_mark: |                    | :white_check_mark: |
+| [Web to Parquet](transforms/universal/web2parquet/README.md)                         | :white_check_mark: |                    |                    |                    |         
 | **Universal (Code & Language)**                                                      |                    |                    |                    |                    | 
 | [Exact dedup filter](transforms/universal/ededup/ray/README.md)                      | :white_check_mark: | :white_check_mark: |                    | :white_check_mark: |
 | [Fuzzy dedup filter](transforms/universal/fdedup/ray/README.md)                      |                    | :white_check_mark: |                    | :white_check_mark: |

From 521bcbe0107ae7cff26dddd14f5d8cc608f1de13 Mon Sep 17 00:00:00 2001
From: SHAHROKH DAIJAVAD <shahrokh@us.ibm.com>
Date: Mon, 18 Nov 2024 08:02:57 -0800
Subject: [PATCH 3/8] Update README.md

---
 README.md | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 6b82e7a6af..716f3b0f22 100644
--- a/README.md
+++ b/README.md
@@ -122,7 +122,14 @@ Explore more examples [here](examples/notebooks).
 
 ### Run your first data prep pipeline
 
-Now that you have run a single transform, the next step is to explore how to put these transforms together to run a data prep pipeline for an end to end use case like fine tuning model or building a RAG application. This [notebook](examples/notebooks/fine%20tuning/code/sample-notebook.ipynb) gives an example of how to build an end to end data prep pipeline for fine tuning for code LLMs. You can also explore how to build a RAG pipeline [here](examples/notebooks/rag).
+Now that you have run a single transform, the next step is to explore how to put these transforms 
+together to run a data prep pipeline for an end to end use case like fine tuning a model or building 
+a RAG application. 
+This [notebook](examples/notebooks/fine%20tuning/code/sample-notebook.ipynb) gives an example of 
+how to build an end to end data prep pipeline for fine tuning for code LLMs. Similarly, this 
+[notebook](examples/notebooks/fine%20tuning/language/demo_with_launcher.ipynb) is a fine tuning 
+example of an end-to-end sample data pipeline designed for processing language datasets. 
+You can also explore how to build a RAG pipeline [here](examples/notebooks/rag).
 
 ### Current list of transforms 
 The matrix below shows the the combination of modules and supported runtimes. All the modules can be accessed [here](transforms) and can be combined to form data processing pipelines, as shown in the [examples](examples) folder. 

From 37cd7a9cf5802c1266eb3c48ffa1f15adc72a3fc Mon Sep 17 00:00:00 2001
From: Shahrokh Daijavad <shahrokhDaijavad@users.noreply.github.com>
Date: Mon, 18 Nov 2024 11:25:03 -0800
Subject: [PATCH 4/8] Update README.md

transform => transforms
---
 transforms/universal/web2parquet/README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/transforms/universal/web2parquet/README.md b/transforms/universal/web2parquet/README.md
index 1841403a7e..98af08cc66 100644
--- a/transforms/universal/web2parquet/README.md
+++ b/transforms/universal/web2parquet/README.md
@@ -30,7 +30,7 @@ pip install data-prep-toolkit-transform[web2parquet]>=0.2.2.dev3
 If working from a fork in the git repo, from the root folder of the git repo, do the following:
 
 ```
-cd transform/universal/web2parquet
+cd transforms/universal/web2parquet
 make venv
 source venv/bin/activate
 pip install -r requirements.txt
@@ -49,4 +49,4 @@ Web2Parquet(urls= ['https://thealliance.ai/'],
                     depth=2, 
                     downloads=10,
                     folder='downloads').transform()
-````
\ No newline at end of file
+````

From abe6b7a6f9bc01adea9f65a6e05bddb30f336397 Mon Sep 17 00:00:00 2001
From: Shahrokh Daijavad <shahrokhDaijavad@users.noreply.github.com>
Date: Mon, 18 Nov 2024 11:37:52 -0800
Subject: [PATCH 5/8] Update README inweb2parquet

syntax issues
---
 transforms/universal/web2parquet/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/transforms/universal/web2parquet/README.md b/transforms/universal/web2parquet/README.md
index 98af08cc66..552b22a256 100644
--- a/transforms/universal/web2parquet/README.md
+++ b/transforms/universal/web2parquet/README.md
@@ -24,7 +24,7 @@ The transform can be installed directly from pypi and has a dependency on the da
 ```
 pip install data-prep-connector
 pip install data-prep-toolkit>=0.2.2.dev2
-pip install data-prep-toolkit-transform[web2parquet]>=0.2.2.dev3
+pip install 'data-prep-toolkit-transforms[web2parquet]>=0.2.2.dev3'
 ```
 
 If working from a fork in the git repo, from the root folder of the git repo, do the following:

From ab7475ad487500ad4be61ebf6328e6a90bc5fb6c Mon Sep 17 00:00:00 2001
From: Shahrokh Daijavad <shahrokhDaijavad@users.noreply.github.com>
Date: Mon, 18 Nov 2024 11:52:34 -0800
Subject: [PATCH 6/8] Update README.md for the web2parquet

A few syntax changes
---
 transforms/universal/web2parquet/README.md | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/transforms/universal/web2parquet/README.md b/transforms/universal/web2parquet/README.md
index 552b22a256..1a8ecb4086 100644
--- a/transforms/universal/web2parquet/README.md
+++ b/transforms/universal/web2parquet/README.md
@@ -21,6 +21,14 @@ For configuring the crawl, users need to specify the following parameters:
 
 The transform can be installed directly from pypi and has a dependency on the data-prep-toolkit and the data-prep-connector
 
+Set up the local environment to run Jupyter notebook:
+```
+python -v venv venv
+source venv/bin/activate
+pip install jupyter lab
+```
+Install pre-requisites:
+
 ```
 pip install data-prep-connector
 pip install data-prep-toolkit>=0.2.2.dev2

From fca82b0ac7b64cf2cd1f24f6ec9eb250da3d2737 Mon Sep 17 00:00:00 2001
From: Shahrokh Daijavad <shahrokhDaijavad@users.noreply.github.com>
Date: Mon, 18 Nov 2024 12:01:55 -0800
Subject: [PATCH 7/8] Update README-list.md

---
 transforms/README-list.md | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/transforms/README-list.md b/transforms/README-list.md
index 3e70b6b62a..5567a55761 100644
--- a/transforms/README-list.md
+++ b/transforms/README-list.md
@@ -36,8 +36,13 @@ Note: This list includes the transforms that were part of the release starting w
 	* [tokenization](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/tokenization/python/README.md)
 	* [doc_id](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/doc_id/python/README.md)
 	* [web2parquet](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/web2parquet/README.md)
+   
+## Release notes:
 
-	
+### 0.2.2.dev3 
+* web2parquet
+### 0.2.2.dev2
+* pdf2parquet now supports HTML,DOCX,PPTX in addition to PDF
 
 
 

From 3563b0c7879c32e3769f8bc834f175a7f1ad4dca Mon Sep 17 00:00:00 2001
From: Shahrokh Daijavad <shahrokhDaijavad@users.noreply.github.com>
Date: Mon, 18 Nov 2024 12:02:54 -0800
Subject: [PATCH 8/8] Update README-list.md

---
 transforms/README-list.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/transforms/README-list.md b/transforms/README-list.md
index 5567a55761..8040dc7a98 100644
--- a/transforms/README-list.md
+++ b/transforms/README-list.md
@@ -42,7 +42,7 @@ Note: This list includes the transforms that were part of the release starting w
 ### 0.2.2.dev3 
 * web2parquet
 ### 0.2.2.dev2
-* pdf2parquet now supports HTML,DOCX,PPTX in addition to PDF
+* pdf2parquet now supports HTML,DOCX,PPTX, ... in addition to PDF