Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build transforms wheel #493

Merged
merged 35 commits into from
Sep 24, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
ddedccc
pytest for most but faild for pdf2parquet and resize
touma-I Aug 8, 2024
134c2c2
header_cleanser available for non Darwin installation
touma-I Aug 9, 2024
e227b9d
Added make file with unit tests for all transforms
touma-I Aug 10, 2024
d155a24
Build and deploy pypi package using makefile
touma-I Aug 12, 2024
fe0d380
remove pandas and keep pandas=2.2.2
touma-I Aug 13, 2024
b1e2707
Merge branch 'dev' into build-transforms-wheel
touma-I Aug 21, 2024
d96477a
added PII and HTML2Parquet
touma-I Aug 21, 2024
3c57682
Merge branch 'dev' into build-transforms-wheel
touma-I Aug 28, 2024
35c7e60
Update with latest available transforms
touma-I Aug 28, 2024
d66df27
restructure things to be able to test and build independently
touma-I Aug 28, 2024
a4f7e0a
publish dev2 for python
touma-I Aug 30, 2024
0af03cc
finish testing and publish dev2 python and ray packages:
touma-I Aug 30, 2024
65f4ac4
merge with dev
touma-I Sep 8, 2024
703ebe0
try different dependencies in attempt to resolve conflicts
touma-I Sep 9, 2024
d54708a
added pii redactor
touma-I Sep 11, 2024
e1309e5
Merge branch 'dev' into build-transforms-wheel
touma-I Sep 11, 2024
109ea29
updated with latest release for pdf2parquet
touma-I Sep 11, 2024
a874358
fixes for dev3 release
touma-I Sep 11, 2024
440975d
simplify test-src by using exiting targets
touma-I Sep 12, 2024
10c9159
Merge branch 'dev' into build-transforms-wheel
touma-I Sep 23, 2024
db60963
renamed requirement files
touma-I Sep 23, 2024
9bb36c5
use - in transform library name
touma-I Sep 23, 2024
ee63628
update requirements.txt files as appropriate when setting versions
touma-I Sep 23, 2024
346b82e
Added readme file to test, build and publish package to pypi
touma-I Sep 23, 2024
071836e
fix typos and removed double quotes
touma-I Sep 23, 2024
07b827f
Apply version update to all transforms
touma-I Sep 23, 2024
0369842
generate workflow for packaging folder
touma-I Sep 23, 2024
aa297e0
fix ededup dummy version
daw3rd Sep 23, 2024
32578d5
fix filter/spark dummy version
daw3rd Sep 23, 2024
3b52ecf
updated requirements based on latest release for docling
touma-I Sep 23, 2024
eb499de
Merge branch 'build-transforms-wheel' of github.com:IBM/data-prep-kit…
touma-I Sep 23, 2024
8d69b71
fix missing steps
touma-I Sep 23, 2024
d27a1c2
-sUpdate makefile to build and publish wheels
touma-I Sep 23, 2024
e155d2c
update based on reviewers comments
touma-I Sep 24, 2024
33b8853
fix packaging test workflow paths
daw3rd Sep 24, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions transforms/packaging/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
**/src
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we combine it (except src) with the .gitignore in the root directory?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would work to. I think there may be situations in the future where we want to. add the wheel for specific transforms to git so folks can download it from git. This is still under discussion but yes, once we have consensus, we should move the gitignore to higher level as appropriate.

**/dist
**/*.egg-info
**/build

48 changes: 48 additions & 0 deletions transforms/packaging/.make.packaging
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@

venv:
$(MAKE) .defaults.create-venv

test:: setup test-src
@# Help: Setup environment, load wheel from pyproject.toml and run unit tests for all transforms

clean:: .transforms.clean
-rm -fr src

image:: .transforms.python-image

test-src::
source venv/bin/activate; \
for T in $(TRANSFORMS_NAMES); do \
echo running unit test on: $$T ; \
$(PYTEST) $(REPOROOT)/transforms/$$T/$(PACKAGING_RUN_TIME)/test; \
done;
@# Help: Run all unit tests from the same venv environment (should follow make venv)


setup: .transforms.setup venv
$(MAKE) src
source venv/bin/activate; \
$(PYTHON) -m pip install .
@# Help: Do any default transform setup before running make src


src:
for T in $(TRANSFORMS_NAMES); do \
echo copy src from $$T ; \
cp -R $(REPOROOT)/transforms/$$T/$(PACKAGING_RUN_TIME)/src/ src/ ; \
rm -fr *.egg-info ; \
rm -fr dist ; \
rm -fr build ; \
done;
@# Help: Setup src folder and remove old distribution


build:: build-dist

publish:: publish-dist

build-dist:: .defaults.build-dist

publish-dist:: .defaults.publish-dist


42 changes: 42 additions & 0 deletions transforms/packaging/python/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Define the root of the local git clone for the common rules to be able
# know where they are running from.
REPOROOT=../../..
# Include a library of common .transform.* targets which most
# transforms should be able to reuse. However, feel free
# to override/redefine the rules below.

# $(REPOROOT)/.make.versions file contains the versions

include $(REPOROOT)/transforms/.make.transforms
include ../.make.packaging

PACKAGING_RUN_TIME=python
DPK_TRNASFORM_REV=0.2.1.dev1
daw3rd marked this conversation as resolved.
Show resolved Hide resolved

TRANSFORMS_NAMES = code/code_quality \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should build this list automatically by looking for the python directories in transforms//

code/code2parquet \
code/header_cleanser \
code/code_quality \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code2parquet, malware

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

malware transforms need some additional work to be included as a pip install and was intentionally excluded from this initial release... It seem to require a docker container and does not fit in the current use case for running these transforms in a notebook.

code/proglang_select \
language/doc_chunk \
language/doc_quality \
language/lang_id \
language/pdf2parquet \
language/text_encoder \
language/pii_redactor \
universal/ededup \
universal/filter \
universal/resize \
universal/tokenization \
universal/html2parquet

test-with-pypi:
$(MAKE) .defaults.create-venv
source venv/bin/activate; \
$(PYTHON) -m pip install data_prep_toolkit_transforms==$(DPK_TRNASFORM_REV)
$(MAKE) test-src
@# Help: Load wheel from pypi and run all unit tests: final step in verification after deploying to pypi)




33 changes: 33 additions & 0 deletions transforms/packaging/python/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# DPK Python Transforms

## installation

The [transforms](https://github.com/IBM/data-prep-kit/blob/dev/transforms/README.md) are delivered as a standard pyton library available on pypi and can be installed using pip install:

`python -m pip install data-prep-toolkit-transforms`

installing the python transforms will also install `data-prep-toolkit`

## List of Transforms in current package

* code
* [code2parquet](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/code2parquet/python/README.md)
* header_cleanser (Not available on MacOS)
* code_quality
* proglang_select
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

malware

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

malware transforms need some additional work to be included as a pip install and was intentionally excluded from this initial release... It seem to require a docker container and does not fit in the current use case for running these transforms in a notebook.

* language
* doc_chunk
* *doc_quality
* lang_id
* pdf2parquet
* text_encoder
* universal
* ededup
* filter
* resize
* tokenization





63 changes: 63 additions & 0 deletions transforms/packaging/python/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
[project]
name = "data_prep_toolkit_transforms"
version = "0.2.1.dev1"
daw3rd marked this conversation as resolved.
Show resolved Hide resolved
requires-python = ">=3.10,<3.12"
keywords = ["transforms", "data preprocessing", "data preparation", "llm", "generative", "ai", "fine-tuning", "llmapps" ]
description = "Data Preparation Toolkit Transforms"
license = {text = "Apache-2.0"}
readme = {file = "README.md", content-type = "text/markdown"}
authors = [
{ name = "Maroun Touma", email = "[email protected]" },
]

dependencies = [
"data-prep-toolkit==0.2.1.dev1",
daw3rd marked this conversation as resolved.
Show resolved Hide resolved
"argparse",
"boto3==1.34.69",
"bs4==0.0.2",
"clamd==1.0.2",
"docling[ocr]==1.1.2",
"duckdb==0.10.1",
"fasttext==0.9.2",
"filetype >=1.2.0, <2.0.0",
"huggingface-hub >= 0.21.4, <1.0.0",
"langcodes==3.3.0",
"mmh3==4.1.0",
"numpy==1.26.4",
"pandas",
"parameterized",
"pyarrow==16.1.0",
"python-dateutil>=2.8.2",
"pytz>=2020.1",
"quackling==0.1.0",
"scancode-toolkit==32.1.0 ; platform_system != 'Darwin'",
"sentence-transformers==3.0.1",
"transformers==4.38.2",
"tzdata>=2022.7",
"xxhash==3.4.1",
]

[build-system]
requires = ["setuptools>=68.0.0", "wheel", "setuptools_scm[toml]>=7.1.0"]
build-backend = "setuptools.build_meta"


[options]
package_dir = ["src"]

[options.packages.find]
where = ["src/"]

[tool.pytest.ini_options]
# Currently we use low coverage since we have to run tests separately (see makefile)
#addopts = "--cov --cov-report term-missing --cov-fail-under 25"
markers = ["unit: unit tests", "integration: integration tests"]

[tool.coverage.run]
include = ["src/*"]






53 changes: 53 additions & 0 deletions transforms/packaging/ray/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Define the root of the local git clone for the common rules to be able
# know where they are running from.
REPOROOT=../../..
# Include a library of common .transform.* targets which most
# transforms should be able to reuse. However, feel free
# to override/redefine the rules below.

# $(REPOROOT)/.make.versions file contains the versions

include $(REPOROOT)/transforms/.make.transforms
include ../.make.packaging

PACKAGING_RUN_TIME=ray
DPK_TRNASFORM_REV=0.2.1.dev0

## Ray Transforms: `find . -name src | grep ray/src`
TRANSFORMS_NAMES = code/proglang_select \
code/header_cleanser \
code/code_quality \
code/repo_level_ordering \
code/code2parquet \
language/doc_quality \
language/doc_chunk \
language/lang_id \
language/text_encoder \
language/pdf2parquet \
language/pii_redactor \
universal/fdedup \
universal/tokenization \
universal/ededup \
universal/profiler \
universal/doc_id \
universal/filter \
universal/resize

test-with-local-python:
$(MAKE) clean
$(MAKE) .defaults.create-venv
source venv/bin/activate; \
cd ../python; \
$(PYTHON) -m pip install . ; \
cd ../ray; \
$(PYTHON) -m pip install . ; \
$(PYTHON) -m pip install data_prep_toolkit_transforms_ray==$(DPK_TRNASFORM_REV)
$(MAKE) test-src

test-with-pypi:
$(MAKE) clean
$(MAKE) .defaults.create-venv
source venv/bin/activate; \
$(PYTHON) -m pip install data_prep_toolkit_transforms_ray==$(DPK_TRNASFORM_REV)
$(MAKE) test-src

37 changes: 37 additions & 0 deletions transforms/packaging/ray/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# DPK Ray Transforms

## installation

The [transforms](https://github.com/IBM/data-prep-kit/blob/dev/transforms/README.md) are delivered as a standard pyton library available on pypi and can be installed using pip install:

`python -m pip install data-prep-toolkit-transforms-ray`

installing the Ray transforms will also install `data_prep_toolkit_transforms` and `data-prep-toolkit-ray`

## List of Ray Transforms availabe in current package

* code
* [code2parquet](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/code2parquet/ray/README.md)
* proglang_select
* header_cleanser (Not available on MacOS)
* code_quality
* repo_level_ordering
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

malware

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

malware transforms need some additional work to be included as a pip install and was intentionally excluded from this initial release... It seem to require a docker container and does not fit in the current use case for running these transforms in a notebook.

* language
* doc_quality
* doc_chunk
* lang_id
* text_encoder
* pdf2parquet
* universal
* fdedup
* tokenization
* ededup
* profiler
* doc_id
* filter
* resize
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

profiler,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blublinsky what is meant by this comment?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing transform






53 changes: 53 additions & 0 deletions transforms/packaging/ray/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
[project]
name = "data_prep_toolkit_transforms_ray"
version = "0.2.1.dev1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dev0

requires-python = ">=3.10,<3.12"
keywords = ["transforms", "data preprocessing", "data preparation", "llm", "generative", "ai", "fine-tuning", "llmapps" ]
description = "Data Preparation Toolkit Transforms using Ray"
license = {text = "Apache-2.0"}
readme = {file = "README.md", content-type = "text/markdown"}
authors = [
{ name = "Maroun Touma", email = "[email protected]" },
]

dependencies = [
#"data_prep_toolkit_transforms==0.2.1.dev1",
"data-prep-toolkit-ray==0.2.1.dev0",
"scancode-toolkit==32.1.0 ; platform_system != 'Darwin'",
"parameterized",
"tqdm==4.66.3",
"mmh3==4.1.0",
"xxhash==3.4.1",
"tqdm==4.66.3",
"scipy==1.12.0",
"networkx==3.3",
"colorlog==6.8.2",
"func-timeout==4.3.5",
"pandas==2.2.2",
"emerge-viz==2.0.0",
]

[build-system]
requires = ["setuptools>=68.0.0", "wheel", "setuptools_scm[toml]>=7.1.0"]
build-backend = "setuptools.build_meta"


[options]
package_dir = ["src"]

[options.packages.find]
where = ["src/"]

[tool.pytest.ini_options]
# Currently we use low coverage since we have to run tests separately (see makefile)
#addopts = "--cov --cov-report term-missing --cov-fail-under 25"
markers = ["unit: unit tests", "integration: integration tests"]

[tool.coverage.run]
include = ["src/*"]






Loading