Skip to content

Commit

Permalink
merge with dev
Browse files Browse the repository at this point in the history
Signed-off-by: Maroun Touma <[email protected]>
  • Loading branch information
touma-I committed Dec 4, 2024
2 parents 38f7e25 + 6866f78 commit 7b2381d
Show file tree
Hide file tree
Showing 63 changed files with 1,124 additions and 109 deletions.
2 changes: 1 addition & 1 deletion .make.versions
Original file line number Diff line number Diff line change
Expand Up @@ -66,4 +66,4 @@ endif
#
# If you change the versions numbers, be sure to run "make set-versions" to
# update version numbers across the transform (e.g., pyproject.toml).
TRANSFORMS_PKG_VERSION=0.2.3.dev0
TRANSFORMS_PKG_VERSION=0.2.3.dev1
1 change: 1 addition & 0 deletions data-processing-lib/python/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@
argparse
mmh3
psutil
polars>=1.9.0
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
################################################################################

import hashlib
import io
import os
import string
import sys
Expand Down Expand Up @@ -144,8 +145,21 @@ def convert_binary_to_arrow(data: bytes, schema: pa.schema = None) -> pa.Table:
table = pq.read_table(reader, schema=schema)
return table
except Exception as e:
logger.error(f"Failed to convert byte array to arrow table, exception {e}. Skipping it")
return None
logger.warning(f"Could not convert bytes to pyarrow: {e}")

# We have seen this exception before when using pyarrow, but polars does not throw it.
# "Nested data conversions not implemented for chunked array outputs"
# See issue 816 https://github.com/IBM/data-prep-kit/issues/816.
logger.info(f"Attempting read of pyarrow Table using polars")
try:
import polars

df = polars.read_parquet(io.BytesIO(data))
table = df.to_arrow()
except Exception as e:
logger.error(f"Could not convert bytes to pyarrow using polars: {e}. Skipping.")
table = None
return table

@staticmethod
def convert_arrow_to_binary(table: pa.Table) -> bytes:
Expand Down
2 changes: 1 addition & 1 deletion kfp/kfp_support_lib/shared_workflow_support/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ authors = [
dependencies = [
"requests",
"kubernetes",
"data-prep-toolkit[ray]==0.2.3.dev0",
"data-prep-toolkit[ray]>=0.2.3.dev0",
]

[build-system]
Expand Down
34 changes: 29 additions & 5 deletions resources.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
# New Features & Enhancements

- Support for Docling 2.0 added to DPK in [pdf2parquet](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/pdf2parquet/python) transform. The new updates allow DPK users to ingest other type of documents, e.g. MS Word, MS Powerpoint, Images, Markdown, Asciidocs, etc.
- Released [Web2parquet](https://github.com/IBM/data-prep-kit/tree/dev/transforms/universal/web2parquet) transform for crawling the web.

# Data Prep Kit Resources

## 📄 Papers
Expand All @@ -7,24 +12,43 @@
3. [Scaling Granite Code Models to 128K Context](https://arxiv.org/abs/2407.13739)


## 🎤 Talks
## 🎤 External Events and Showcase

1. **"Building Successful LLM Apps: The Power of high quality data"** - [Video](https://www.youtube.com/watch?v=u_2uiZBBVIE) | [Slides](https://www.slideshare.net/slideshow/data_prep_techniques_challenges_methods-pdf-a190/271527890)
2. **"Hands on session for fine tuning LLMs"** - [Video](https://www.youtube.com/watch?v=VEHIA3E64DM)
3. **"Build your own data preparation module using data-prep-kit"** - [Video](https://www.youtube.com/watch?v=0WUMG6HIgMg)
4. **"Data Prep Kit: A Comprehensive Cloud-Native Toolkit for Scalable Data Preparation in GenAI App"** - [Video](https://www.youtube.com/watch?v=WJ147TGULwo) | [Slides](https://ossaidevjapan24.sched.com/event/1jKBm)
5. **"RAG with Data Prep Kit" Workshop** @ Mountain View, CA, USA ** - [info](https://github.com/sujee/data-prep-kit-examples/blob/main/events/2024-09-21__RAG-workshop-data-riders.md)
6. **Tech Educator summit** [IBM CSR Event](https://www.linkedin.com/posts/aanchalaggarwal_github-ibmdata-prep-kit-open-source-project-activity-7254062098295472128-OA_x?utm_source=share&utm_medium=member_desktop)
7. **Talk and Hands on session** at [MIT Bangalore](https://www.linkedin.com/posts/saptha-surendran-71a4a0ab_ibmresearch-dataprepkit-llms-activity-7261987741087801346-h0no?utm_source=share&utm_medium=member_desktop)
8. **PyData NYC 2024** - [90 mins Tutorial](https://nyc2024.pydata.org/cfp/talk/AWLTZP/)
9. **Open Source AI** [Demo Night](https://lu.ma/oss-ai?tk=A8BgIt)
10. [**Data Exchange Podcast with Ben Lorica**](https://thedataexchange.media/ibm-data-prep-kit/)
11. Unstructured Data Meetup - SF, NYC, Silicon Valley
12. IBM TechXchange Las Vegas
13. Open Source [**RAG Pipeline workshop**](https://www.linkedin.com/posts/sujeemaniyam_dataprepkit-workshop-llm-activity-7256176802383986688-2UKc?utm_source=share&utm_medium=member_desktop) with Data Prep Kit at TechEquity's AI Summit in Silicon Valley
14. **Data Science Dojo Meetup** - [video](https://datasciencedojo.com/tutorial/data-preparation-toolkit/)
15. [**DPK tutorial and hands on session at IIIT Delhi**](https://www.linkedin.com/posts/cai-iiitd-97a6a4232_datascience-datapipelines-machinelearning-activity-7263121565125349376-FG8E?utm_source=share&utm_medium=member_desktop)


## Example Code
Find example code in readme section of each tranform and some sample jupyter notebooks for getting started [**here**](examples/notebooks)

## Blogs / Tutorials

- [**IBM Developer Blog**](https://developer.ibm.com/blogs/awb-unleash-potential-llms-data-prep-kit/)
- [**Introductory Blog on DPK**](https://www.linkedin.com/pulse/unleashing-potential-large-language-models-through-data-aanchal-goyal-fgtff)
- [**DPK Header Cleanser Module Blog by external contributor**](https://www.linkedin.com/pulse/enhancing-data-quality-developing-header-cleansing-tool-kalathiya-i1ohc/?trackingId=6iAeBkBBRrOLijg3LTzIGA%3D%3D)


## Workshops
# Relevant online communities

- **2024-09-21: "RAG with Data Prep Kit" Workshop** @ Mountain View, CA, USA - [info](https://github.com/sujee/data-prep-kit-examples/blob/main/events/2024-09-21__RAG-workshop-data-riders.md)
- [**Data Prep Kit Discord Channel**](https://discord.com/channels/1276554812359442504/1303454647427661866)
- [**DPK is now listed in Github Awesome-LLM under LLM Data section**](https://github.com/Hannibal046/Awesome-LLM)
- [**DPK is now up for access via IBM Skills Build Download**](https://academic.ibm.com/a2mt/downloads/artificial_intelligence#/)
- [**DPK added to the Application Hub of “AI Sustainability Catalog”**](https://enterprise-neurosystem.github.io/Sustainability-Catalog/)

## Discord
## We Want Your Feedback!
Feel free to contribute to discussions or create a new one to share your [feedback](https://github.com/IBM/data-prep-kit/discussions)

- [**Data Prep Kit Discord Channel**](https://discord.com/channels/1276554812359442504/1286046139921207476)

2 changes: 1 addition & 1 deletion transforms/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ build-pkg-dist:
-rm -fr src
mkdir src
# Copy all the src folders recursively (not clear if they have subfolders)
for x in $(shell find . | grep '[ray| python]/src$$') ; do \
for x in $(shell find . | grep '[ray| python | spark]/src$$') ; do \
echo $$x ; \
if [ -d "$$x" ]; then \
cp -r $$x/* src ; \
Expand Down
2 changes: 1 addition & 1 deletion transforms/code/code2parquet/python/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
data-prep-toolkit==0.2.3.dev0
data-prep-toolkit>=0.2.3.dev0
parameterized
pandas
2 changes: 1 addition & 1 deletion transforms/code/code2parquet/ray/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ authors = [
{ name = "Boris Lublinsky", email = "[email protected]" },
]
dependencies = [
"data-prep-toolkit[ray]==0.2.3.dev0",
"data-prep-toolkit[ray]>=0.2.3.dev0",
"dpk-code2parquet-transform-python==0.2.3.dev0",
"parameterized",
"pandas",
Expand Down
2 changes: 1 addition & 1 deletion transforms/code/code_profiler/python/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
data-prep-toolkit==0.2.3.dev0
data-prep-toolkit>=0.2.3.dev0
parameterized
pandas
aiolimiter==1.1.0
Expand Down
2 changes: 1 addition & 1 deletion transforms/code/code_profiler/ray/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ authors = [
]
dependencies = [
"dpk-code-profiler-transform-python==0.2.3.dev0",
"data-prep-toolkit[ray]==0.2.3.dev0",
"data-prep-toolkit[ray]>=0.2.3.dev0",
]

[build-system]
Expand Down
2 changes: 1 addition & 1 deletion transforms/code/code_quality/python/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
data-prep-toolkit==0.2.3.dev0
data-prep-toolkit>=0.2.3.dev0
bs4==0.0.2
transformers==4.38.2
2 changes: 1 addition & 1 deletion transforms/code/code_quality/ray/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ authors = [
]
dependencies = [
"dpk-code-quality-transform-python==0.2.3.dev0",
"data-prep-toolkit[ray]==0.2.3.dev0",
"data-prep-toolkit[ray]>=0.2.3.dev0",
]

[build-system]
Expand Down
2 changes: 1 addition & 1 deletion transforms/code/header_cleanser/python/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
data-prep-toolkit==0.2.3.dev0
data-prep-toolkit>=0.2.3.dev0
scancode-toolkit==32.1.0 ; platform_system != 'Darwin'

2 changes: 1 addition & 1 deletion transforms/code/header_cleanser/ray/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ authors = [
]
dependencies = [
"dpk-header-cleanser-transform-python==0.2.3.dev0",
"data-prep-toolkit[ray]==0.2.3.dev0",
"data-prep-toolkit[ray]>=0.2.3.dev0",
"scancode-toolkit==32.1.0",
]

Expand Down
2 changes: 1 addition & 1 deletion transforms/code/license_select/python/requirements.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
data-prep-toolkit==0.2.3.dev0
data-prep-toolkit>=0.2.3.dev0
2 changes: 1 addition & 1 deletion transforms/code/license_select/ray/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ authors = [
]
dependencies = [
"dpk-license-select-transform-python==0.2.3.dev0",
"data-prep-toolkit[ray]==0.2.3.dev0",
"data-prep-toolkit[ray]>=0.2.3.dev0",
]

[build-system]
Expand Down
2 changes: 1 addition & 1 deletion transforms/code/malware/python/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ authors = [
{ name = "Takuya Goto", email = "[email protected]" },
]
dependencies = [
"data-prep-toolkit==0.2.3.dev0",
"data-prep-toolkit>=0.2.3.dev0",
"clamd==1.0.2",
]

Expand Down
2 changes: 1 addition & 1 deletion transforms/code/malware/ray/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ authors = [
]
dependencies = [
"dpk-malware-transform-python==0.2.3.dev0",
"data-prep-toolkit[ray]==0.2.3.dev0",
"data-prep-toolkit[ray]>=0.2.3.dev0",
]

[build-system]
Expand Down
2 changes: 1 addition & 1 deletion transforms/code/proglang_select/python/requirements.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
data-prep-toolkit==0.2.3.dev0
data-prep-toolkit>=0.2.3.dev0
2 changes: 1 addition & 1 deletion transforms/code/proglang_select/ray/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ authors = [
]
dependencies = [
"dpk-proglang-select-transform-python==0.2.3.dev0",
"data-prep-toolkit[ray]==0.2.3.dev0",
"data-prep-toolkit[ray]>=0.2.3.dev0",
]

[build-system]
Expand Down
2 changes: 1 addition & 1 deletion transforms/code/repo_level_ordering/ray/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ authors = [
{ name = "Shanmukha Guttula", email = "[email protected]" },
]
dependencies = [
"data-prep-toolkit[ray]==0.2.3.dev0",
"data-prep-toolkit[ray]>=0.2.3.dev0",
"networkx==3.3",
"colorlog==6.8.2",
"func-timeout==4.3.5",
Expand Down
2 changes: 1 addition & 1 deletion transforms/language/doc_chunk/python/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
data-prep-toolkit==0.2.3.dev0
data-prep-toolkit>=0.2.3.dev0
docling-core==2.3.0
pydantic>=2.0.0,<2.10.0
llama-index-core>=0.11.22,<0.12.0
2 changes: 1 addition & 1 deletion transforms/language/doc_chunk/ray/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ authors = [
]
dependencies = [
"dpk-doc-chunk-transform-python==0.3.0",
"data-prep-toolkit[ray]==0.2.3.dev0",
"data-prep-toolkit[ray]>=0.2.3.dev0",
]

[build-system]
Expand Down
2 changes: 1 addition & 1 deletion transforms/language/doc_quality/python/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@

data-prep-toolkit==0.2.3.dev0
data-prep-toolkit>=0.2.3.dev0
2 changes: 1 addition & 1 deletion transforms/language/doc_quality/ray/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ authors = [
]
dependencies = [
"dpk-doc_quality-transform-python==0.2.3.dev0",
"data-prep-toolkit[ray]==0.2.3.dev0",
"data-prep-toolkit[ray]>=0.2.3.dev0",
]

[build-system]
Expand Down
2 changes: 1 addition & 1 deletion transforms/language/html2parquet/python/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
data-prep-toolkit==0.2.3.dev0
data-prep-toolkit>=0.2.3.dev0
trafilatura==1.12.0
2 changes: 1 addition & 1 deletion transforms/language/html2parquet/ray/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
dpk-html2parquet-transform-python==0.2.3.dev0
data-prep-toolkit[ray]==0.2.3.dev0
data-prep-toolkit[ray]>=0.2.3.dev0
trafilatura==1.12.0
2 changes: 1 addition & 1 deletion transforms/language/lang_id/python/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
data-prep-toolkit==0.2.3.dev0
data-prep-toolkit>=0.2.3.dev0
fasttext==0.9.2
langcodes==3.3.0
huggingface-hub >= 0.21.4, <1.0.0
Expand Down
2 changes: 1 addition & 1 deletion transforms/language/lang_id/ray/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ authors = [
]
dependencies = [
"dpk-lang_id-transform-python==0.2.3.dev0",
"data-prep-toolkit[ray]==0.2.3.dev0",
"data-prep-toolkit[ray]>=0.2.3.dev0",
]

[build-system]
Expand Down
2 changes: 1 addition & 1 deletion transforms/language/pdf2parquet/python/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
data-prep-toolkit==0.2.3.dev0
data-prep-toolkit>=0.2.3.dev0
docling-core==2.3.0
docling-ibm-models==2.0.3
deepsearch-glm==0.26.1
Expand Down
2 changes: 1 addition & 1 deletion transforms/language/pdf2parquet/ray/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
dpk-pdf2parquet-transform-python==0.3.0
data-prep-toolkit[ray]==0.2.3.dev0
data-prep-toolkit[ray]>=0.2.3.dev0
# docling-core==1.7.2
# docling-ibm-models==2.0.0
# deepsearch-glm==0.22.0
Expand Down
2 changes: 1 addition & 1 deletion transforms/language/pii_redactor/python/pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "dpk_pii_redactor_transform_python"
version = "0.2.2.dev2"
version = "0.2.3.dev0"
requires-python = ">=3.10,<3.13"
description = "PII redactor Transform for Python"
license = {text = "Apache-2.0"}
Expand Down
2 changes: 1 addition & 1 deletion transforms/language/pii_redactor/python/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
data-prep-toolkit==0.2.3.dev0
data-prep-toolkit>=0.2.3.dev0
presidio-analyzer>=2.2.355
presidio-anonymizer>=2.2.355
flair>=0.14.0
Expand Down
2 changes: 1 addition & 1 deletion transforms/language/pii_redactor/ray/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ authors = [
]
dependencies = [
"dpk_pii_redactor_transform_python==0.2.3.dev0",
"data-prep-toolkit[ray]==0.2.3.dev0",
"data-prep-toolkit[ray]>=0.2.3.dev0",
"presidio-analyzer>=2.2.355",
"presidio-anonymizer>=2.2.355",
"flair>=0.14.0",
Expand Down
2 changes: 1 addition & 1 deletion transforms/language/text_encoder/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
data-prep-toolkit==0.2.3.dev0
data-prep-toolkit>=0.2.3.dev0
sentence-transformers==3.0.1
2 changes: 1 addition & 1 deletion transforms/pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "data_prep_toolkit_transforms"
version = "1.0.1.dev0"
version = "0.2.3.dev1"
requires-python = ">=3.10,<3.13"
keywords = ["transforms", "data preprocessing", "data preparation", "llm", "generative", "ai", "fine-tuning", "llmapps" ]
description = "Data Preparation Toolkit Transforms using Ray"
Expand Down
2 changes: 1 addition & 1 deletion transforms/universal/doc_id/python/requirements.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
data-prep-toolkit==0.2.3.dev0
data-prep-toolkit>=0.2.3.dev0
2 changes: 1 addition & 1 deletion transforms/universal/doc_id/ray/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ authors = [
]
dependencies = [
"dpk_doc_id_transform_python==0.2.3.dev0",
"data-prep-toolkit[ray]==0.2.3.dev0",
"data-prep-toolkit[ray]>=0.2.3.dev0",
]

[build-system]
Expand Down
2 changes: 1 addition & 1 deletion transforms/universal/ededup/python/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
data-prep-toolkit==0.2.3.dev0
data-prep-toolkit>=0.2.3.dev0
mmh3>=4.1.0
xxhash==3.4.1
2 changes: 1 addition & 1 deletion transforms/universal/ededup/ray/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ authors = [
{ name = "Boris Lublinsky", email = "[email protected]" },
]
dependencies = [
"data-prep-toolkit[ray]==0.2.3.dev0",
"data-prep-toolkit[ray]>=0.2.3.dev0",
"dpk_ededup_transform_python==0.2.3.dev0",
"tqdm==4.66.3",
]
Expand Down
Loading

0 comments on commit 7b2381d

Please sign in to comment.