Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build transforms wheel #493

Merged
merged 35 commits into from
Sep 24, 2024
Merged
Show file tree
Hide file tree
Changes from 33 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
ddedccc
pytest for most but faild for pdf2parquet and resize
touma-I Aug 8, 2024
134c2c2
header_cleanser available for non Darwin installation
touma-I Aug 9, 2024
e227b9d
Added make file with unit tests for all transforms
touma-I Aug 10, 2024
d155a24
Build and deploy pypi package using makefile
touma-I Aug 12, 2024
fe0d380
remove pandas and keep pandas=2.2.2
touma-I Aug 13, 2024
b1e2707
Merge branch 'dev' into build-transforms-wheel
touma-I Aug 21, 2024
d96477a
added PII and HTML2Parquet
touma-I Aug 21, 2024
3c57682
Merge branch 'dev' into build-transforms-wheel
touma-I Aug 28, 2024
35c7e60
Update with latest available transforms
touma-I Aug 28, 2024
d66df27
restructure things to be able to test and build independently
touma-I Aug 28, 2024
a4f7e0a
publish dev2 for python
touma-I Aug 30, 2024
0af03cc
finish testing and publish dev2 python and ray packages:
touma-I Aug 30, 2024
65f4ac4
merge with dev
touma-I Sep 8, 2024
703ebe0
try different dependencies in attempt to resolve conflicts
touma-I Sep 9, 2024
d54708a
added pii redactor
touma-I Sep 11, 2024
e1309e5
Merge branch 'dev' into build-transforms-wheel
touma-I Sep 11, 2024
109ea29
updated with latest release for pdf2parquet
touma-I Sep 11, 2024
a874358
fixes for dev3 release
touma-I Sep 11, 2024
440975d
simplify test-src by using exiting targets
touma-I Sep 12, 2024
10c9159
Merge branch 'dev' into build-transforms-wheel
touma-I Sep 23, 2024
db60963
renamed requirement files
touma-I Sep 23, 2024
9bb36c5
use - in transform library name
touma-I Sep 23, 2024
ee63628
update requirements.txt files as appropriate when setting versions
touma-I Sep 23, 2024
346b82e
Added readme file to test, build and publish package to pypi
touma-I Sep 23, 2024
071836e
fix typos and removed double quotes
touma-I Sep 23, 2024
07b827f
Apply version update to all transforms
touma-I Sep 23, 2024
0369842
generate workflow for packaging folder
touma-I Sep 23, 2024
aa297e0
fix ededup dummy version
daw3rd Sep 23, 2024
32578d5
fix filter/spark dummy version
daw3rd Sep 23, 2024
3b52ecf
updated requirements based on latest release for docling
touma-I Sep 23, 2024
eb499de
Merge branch 'build-transforms-wheel' of github.com:IBM/data-prep-kit…
touma-I Sep 23, 2024
8d69b71
fix missing steps
touma-I Sep 23, 2024
d27a1c2
-sUpdate makefile to build and publish wheels
touma-I Sep 23, 2024
e155d2c
update based on reviewers comments
touma-I Sep 24, 2024
33b8853
fix packaging test workflow paths
daw3rd Sep 24, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions .github/workflows/test-packaging-python.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
#
# DO NOT EDIT THIS FILE: it is generated from test-transform.template, Edit there and run make to change these files
touma-I marked this conversation as resolved.
Show resolved Hide resolved
#
name: Test - transforms/packaging/python

on:
workflow_dispatch:
push:
branches:
- "dev"
- "releases/**"
tags:
- "*"
paths:
- "transforms/packaging/python/**"
- "data-processing-lib/**"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't include the transforms, you should probable not include the core library either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i pushed a fix for this

- "!transforms/packaging/python/**/kfp_ray/**" # This is/will be tested in separate workflow
- "!data-processing-lib/**/test/**"
- "!data-processing-lib/**/test-data/**"
- "!**.md"
- "!**/doc/**"
- "!**/images/**"
- "!**.gitignore"
pull_request:
branches:
- "dev"
- "releases/**"
paths:
- "transforms/packaging/python/**"
- "data-processing-lib/**"
- "!transforms/packaging/python/**/kfp_ray/**" # This is/will be tested in separate workflow
- "!data-processing-lib/**/test/**"
- "!data-processing-lib/**/test-data/**"
- "!**.md"
- "!**/doc/**"
- "!**/images/**"
- "!**.gitignore"

jobs:
test-src:
runs-on: ubuntu-22.04
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Free up space in github runner
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173
run: |
df -h
sudo rm -rf "/usr/local/share/boost"
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /usr/local/share/powershell /usr/share/swift /usr/local/.ghcup
sudo docker rmi $(docker image ls -aq) >/dev/null 2>&1 || true
df -h
- name: Test transform source in transforms/packaging/python
run: |
if [ -e "transforms/packaging/python/Makefile" ]; then
make -C transforms/packaging/python DOCKER=docker test-src
else
echo "transforms/packaging/python/Makefile not found - source testing disabled for this transform."
fi
60 changes: 60 additions & 0 deletions .github/workflows/test-packaging-ray.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
#
# DO NOT EDIT THIS FILE: it is generated from test-transform.template, Edit there and run make to change these files
#
name: Test - transforms/packaging/ray

on:
workflow_dispatch:
push:
branches:
- "dev"
- "releases/**"
tags:
- "*"
paths:
- "transforms/packaging/ray/**"
- "data-processing-lib/**"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, library not needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i pushed a fix for this.

- "!transforms/packaging/ray/**/kfp_ray/**" # This is/will be tested in separate workflow
- "!data-processing-lib/**/test/**"
- "!data-processing-lib/**/test-data/**"
- "!**.md"
- "!**/doc/**"
- "!**/images/**"
- "!**.gitignore"
pull_request:
branches:
- "dev"
- "releases/**"
paths:
- "transforms/packaging/ray/**"
- "data-processing-lib/**"
- "!transforms/packaging/ray/**/kfp_ray/**" # This is/will be tested in separate workflow
- "!data-processing-lib/**/test/**"
- "!data-processing-lib/**/test-data/**"
- "!**.md"
- "!**/doc/**"
- "!**/images/**"
- "!**.gitignore"

jobs:
test-src:
runs-on: ubuntu-22.04
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Free up space in github runner
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173
run: |
df -h
sudo rm -rf "/usr/local/share/boost"
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /usr/local/share/powershell /usr/share/swift /usr/local/.ghcup
sudo docker rmi $(docker image ls -aq) >/dev/null 2>&1 || true
df -h
- name: Test transform source in transforms/packaging/ray
run: |
if [ -e "transforms/packaging/ray/Makefile" ]; then
make -C transforms/packaging/ray DOCKER=docker test-src
else
echo "transforms/packaging/ray/Makefile not found - source testing disabled for this transform."
fi
15 changes: 14 additions & 1 deletion .make.defaults
Original file line number Diff line number Diff line change
Expand Up @@ -480,7 +480,8 @@ endif
if [ -e requirements.txt ]; then \
echo Installing requirements from requirements.txt; \
pip install $(PIP_INSTALL_EXTRA_ARGS) $$extra_url -r requirements.txt; \
elif [ -e pyproject.toml ]; then \
fi; \
if [ -e pyproject.toml ]; then \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you want to allow dependencies installation from both requirements.txt and pyproject.toml. It can confuse.
I'd add a WARNING message if both files exist. And specify in the message the installation order: first from requirements.txtand after that frompyproject.toml`

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @roytman . The situation we ran into is the pyproject.toml for the single package is using requirements.txt that was compiled from the different dependencies listed by all transforms. I will also be changing all the transforms pyproject.toml to use requirements.txt to list their dependencies. But agree, I need to watch this one closely.

echo Installing from pyproject.toml; \
pip install $(PIP_INSTALL_EXTRA_ARGS) $$extra_url -e .; \
fi
Expand Down Expand Up @@ -587,6 +588,18 @@ MINIO_ADMIN_PWD= localminiosecretkey
> tt.toml; \
mv tt.toml pyproject.toml; \
fi
@if [ -e requirements.txt ]; then \
cat requirements.txt | sed \
-e 's/data-prep-toolkit-ray\([=><~][=]\).*/data-prep-toolkit-ray\1$(DPK_LIB_VERSION)/' \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can all these sed operations be replaced by a single macro? the lines 582-587 are almost identical to lines 593-599.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Almost but not exactly. But agree, I will take a deeper look in a followup PR.

-e 's/data-prep-toolkit-transforms\([=><~][=]\).*/data-prep-toolkit-transforms\1$(DPK_TRANSFORMS_VERSION)/' \
-e 's/data-prep-toolkit-spark\([=><~][=]\).*/data-prep-toolkit-spark\1$(DPK_LIB_VERSION)/' \
-e 's/data-prep-toolkit-kfp\([=><~][=]\).*/data-prep-toolkit-kfp\1$(DPK_LIB_KFP_VERSION)/' \
-e 's/data-prep-toolkit\([=><~][=]\).*/data-prep-toolkit\1$(DPK_LIB_VERSION)/' \
-e 's/ray\[default\]\([=><~][=]\).*/ray\[default\]\1$(RAY)/' \
-e 's/data-prep-toolkit-kfp-shared\(..\).*/data-prep-toolkit-kfp-shared\1$(DPK_LIB_KFP_VERSION)/' \
> tt.txt; \
mv tt.txt requirements.txt; \
fi

# Build the distribution, usually in preparation for publishing using ith the .defaults.publish-dist target
.PHONY: .defaults.build-dist
Expand Down
5 changes: 4 additions & 1 deletion .make.versions
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ DPK_MINOR_VERSION=2
DPK_MICRO_VERSION=1
# The suffix is generally always set in the main/development branch and only nulled out when creating release branches.
# It can be manually incremented, for example, to allow publishing a new intermediate version wheel to pypi.
DPK_VERSION_SUFFIX=.dev0
DPK_VERSION_SUFFIX=.dev3

DPK_VERSION=$(DPK_MAJOR_VERSION).$(DPK_MINOR_VERSION).$(DPK_MICRO_VERSION)$(DPK_VERSION_SUFFIX)

Expand Down Expand Up @@ -103,6 +103,8 @@ PII_REDACTOR_PYTHON_VERSION=$(DPK_VERSION)

HTML2PARQUET_PYTHON_VERSION=$(DPK_VERSION)

DPK_TRANSFORMS_VERSION=$(DPK_VERSION)

################## ################## ################## ################## ################## ##################
# Begin versions that the repo depends on.

Expand All @@ -117,3 +119,4 @@ ifeq ($(KFPv2), 1)
else
WORKFLOW_SUPPORT_LIB=kfp_v1_workflow_support
endif

2 changes: 1 addition & 1 deletion data-processing-lib/python/pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "data_prep_toolkit"
version = "0.2.1.dev0"
version = "0.2.1.dev3"
requires-python = ">=3.10"
keywords = ["data", "data preprocessing", "data preparation", "llm", "generative", "ai", "fine-tuning", "llmapps" ]
description = "Data Preparation Toolkit Library"
Expand Down
4 changes: 2 additions & 2 deletions data-processing-lib/ray/pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "data_prep_toolkit_ray"
version = "0.2.1.dev0"
version = "0.2.1.dev3"
daw3rd marked this conversation as resolved.
Show resolved Hide resolved
keywords = ["data", "data preprocessing", "data preparation", "llm", "generative", "ai", "fine-tuning", "llmapps" ]
requires-python = ">=3.10"
description = "Data Preparation Toolkit Library for Ray"
Expand All @@ -11,7 +11,7 @@ authors = [
{ name = "Boris Lublinsky", email = "[email protected]" },
]
dependencies = [
"data-prep-toolkit==0.2.1.dev0",
"data-prep-toolkit>=0.2.1.dev3",
"ray[default]==2.24.0",
# These two are to fix security issues identified by quay.io
"fastapi>=0.110.2",
Expand Down
4 changes: 2 additions & 2 deletions data-processing-lib/spark/pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "data_prep_toolkit_spark"
version = "0.2.1.dev0"
version = "0.2.1.dev3"
keywords = ["data", "data preprocessing", "data preparation", "llm", "generative", "ai", "fine-tuning", "llmapps" ]
requires-python = ">=3.10"
description = "Data Preparation Toolkit Library for Spark"
Expand All @@ -11,7 +11,7 @@ authors = [
{ name = "Boris Lublinsky", email = "[email protected]" },
]
dependencies = [
"data-prep-toolkit==0.2.1.dev0",
"data-prep-toolkit==0.2.1.dev3",
"pyspark>=3.5.2",
"psutil>=6.0.0"
]
Expand Down
6 changes: 3 additions & 3 deletions examples/notebooks/rag/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## Data prep kit
data-prep-toolkit-transforms==0.2.1.dev1
data-prep-toolkit-transforms-ray==0.2.1.dev1
#data-prep-toolkit-transforms==0.2.1.dev1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that these are comments, but should they be with 0.2.1.dev3 ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this notebook will still work with dev3. There were changes to the transforms that broke the notebook. I think Sunjee has a new release he will be checking in soon.

#data-prep-toolkit-transforms-ray==0.2.1.dev1



Expand Down Expand Up @@ -53,4 +53,4 @@ ipython
ipywidgets
IProgress
chardet==5.2.0
charset-normalizer==3.3.2
charset-normalizer==3.3.2
4 changes: 2 additions & 2 deletions kfp/kfp_support_lib/kfp_v1_workflow_support/pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "data_prep_toolkit_kfp_v1"
version = "0.2.1.dev0"
version = "0.2.1.dev3"
requires-python = ">=3.10,<3.12"
description = "Data Preparation Kit Library. KFP support"
license = {text = "Apache-2.0"}
Expand All @@ -13,7 +13,7 @@ authors = [
]
dependencies = [
"kfp==1.8.22",
"data-prep-toolkit-kfp-shared==0.2.1.dev0",
"data-prep-toolkit-kfp-shared==0.2.1.dev3",
]

[build-system]
Expand Down
6 changes: 3 additions & 3 deletions kfp/kfp_support_lib/kfp_v2_workflow_support/pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "data_prep_toolkit_kfp_v2"
version = "0.2.1.dev0"
version = "0.2.1.dev3"
requires-python = ">=3.10,<3.12"
description = "Data Preparation Kit Library. KFP support"
license = {text = "Apache-2.0"}
Expand All @@ -12,9 +12,9 @@ authors = [
{ name = "Revital Eres", email = "[email protected]" },
]
dependencies = [
"kfp==2.7.0",
"kfp==2.8.0",
"kfp-kubernetes==1.2.0",
"data-prep-toolkit-kfp-shared==0.2.1.dev0",
"data-prep-toolkit-kfp-shared==0.2.1.dev3",
]

[build-system]
Expand Down
4 changes: 2 additions & 2 deletions kfp/kfp_support_lib/shared_workflow_support/pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "data_prep_toolkit_kfp_shared"
version = "0.2.1.dev0"
version = "0.2.1.dev3"
requires-python = ">=3.10,<3.12"
description = "Data Preparation Kit Library. KFP support"
license = {text = "Apache-2.0"}
Expand All @@ -14,7 +14,7 @@ authors = [
dependencies = [
"requests",
"kubernetes",
"data-prep-toolkit-ray==0.2.1.dev0",
"data-prep-toolkit-ray==0.2.1.dev3",
]

[build-system]
Expand Down
4 changes: 2 additions & 2 deletions transforms/code/code2parquet/python/pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "dpk_code2parquet_transform_python"
version = "0.2.1.dev0"
version = "0.2.1.dev3"
requires-python = ">=3.10"
description = "code2parquet Python Transform"
license = {text = "Apache-2.0"}
Expand All @@ -10,7 +10,7 @@ authors = [
{ name = "Boris Lublinsky", email = "[email protected]" },
]
dependencies = [
"data-prep-toolkit==0.2.1.dev0",
"data-prep-toolkit==0.2.1.dev3",
"parameterized",
"pandas",
]
Expand Down
6 changes: 3 additions & 3 deletions transforms/code/code2parquet/ray/pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "dpk_code2parquet_transform_ray"
version = "0.2.1.dev0"
version = "0.2.1.dev3"
requires-python = ">=3.10"
description = "code2parquet Ray Transform"
license = {text = "Apache-2.0"}
Expand All @@ -10,8 +10,8 @@ authors = [
{ name = "Boris Lublinsky", email = "[email protected]" },
]
dependencies = [
"data-prep-toolkit-ray==0.2.1.dev0",
"dpk-code2parquet-transform-python==0.2.1.dev0",
"data-prep-toolkit-ray==0.2.1.dev3",
"dpk-code2parquet-transform-python==0.2.1.dev3",
"parameterized",
"pandas",
]
Expand Down
4 changes: 2 additions & 2 deletions transforms/code/code_quality/python/pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "dpk_code_quality_transform_python"
version = "0.2.1.dev0"
version = "0.2.1.dev3"
requires-python = ">=3.10"
description = "Code Quality Python Transform"
license = {text = "Apache-2.0"}
Expand All @@ -9,7 +9,7 @@ authors = [
{ name = "Shivdeep Singh", email = "[email protected]" },
]
dependencies = [
"data-prep-toolkit==0.2.1.dev0",
"data-prep-toolkit==0.2.1.dev3",
"bs4==0.0.2",
"transformers==4.38.2",
]
Expand Down
6 changes: 3 additions & 3 deletions transforms/code/code_quality/ray/pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "dpk_code_quality_transform_ray"
version = "0.2.1.dev0"
version = "0.2.1.dev3"
requires-python = ">=3.10"
description = "Code Quality Ray Transform"
license = {text = "Apache-2.0"}
Expand All @@ -9,8 +9,8 @@ authors = [
{ name = "Shivdeep Singh", email = "[email protected]" },
]
dependencies = [
"dpk-code-quality-transform-python==0.2.1.dev0",
"data-prep-toolkit-ray==0.2.1.dev0",
"dpk-code-quality-transform-python==0.2.1.dev3",
"data-prep-toolkit-ray==0.2.1.dev3",
]

[build-system]
Expand Down
4 changes: 2 additions & 2 deletions transforms/code/header_cleanser/python/pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "dpk_header_cleanser_transform_python"
version = "0.2.1.dev0"
version = "0.2.1.dev3"
requires-python = ">=3.10"
description = "License and Copyright Removal Transform for Python"
license = {text = "Apache-2.0"}
Expand All @@ -9,7 +9,7 @@ authors = [
{ name = "Yash kalathiya", email = "[email protected]" },
]
dependencies = [
"data-prep-toolkit==0.2.1.dev0",
"data-prep-toolkit==0.2.1.dev3",
"scancode-toolkit==32.1.0",
]

Expand Down
6 changes: 3 additions & 3 deletions transforms/code/header_cleanser/ray/pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "dpk_header_cleanser_transform_ray"
version = "0.2.1.dev0"
version = "0.2.1.dev3"
requires-python = ">=3.10"
description = "License and copyright removal Transform for Ray"
license = {text = "Apache-2.0"}
Expand All @@ -9,8 +9,8 @@ authors = [
{ name = "Yash kalathiya", email = "[email protected]" },
]
dependencies = [
"dpk-header-cleanser-transform-python==0.2.1.dev0",
"data-prep-toolkit-ray==0.2.1.dev0",
"dpk-header-cleanser-transform-python==0.2.1.dev3",
"data-prep-toolkit-ray==0.2.1.dev3",
"scancode-toolkit==32.1.0",
]

Expand Down
Loading