Skip to content

Commit

Permalink
Merge branch 'dev' into transform-pipeline
Browse files Browse the repository at this point in the history
  • Loading branch information
daw3rd committed Sep 19, 2024
2 parents 9369482 + e4dac0b commit d97a076
Show file tree
Hide file tree
Showing 33 changed files with 3,145 additions and 4 deletions.
26 changes: 26 additions & 0 deletions .github/workflows/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@

# Directories in the transforms/universal directory for which we want to generate test workflows
UNIVERSAL_TRANSFORMS=doc_id ededup fdedup filter html2parquet noop profiler resize tokenization
# Directories in the transforms/code directory for which we want to generate test workflows
CODE_TRANSFORMS=code2parquet code_quality header_cleanser malware proglang_select repo_level_ordering
# Directories in the transforms/language directory for which we want to generate test workflows
LANG_TRANSFORMS=doc_chunk doc_quality lang_id pdf2parquet pii_redactor text_encoder


transform-tests:
$(MAKE) TRANSFORM_SUBDIR=universal .transform-tests
$(MAKE) TRANSFORM_SUBDIR=language .transform-tests
$(MAKE) TRANSFORM_SUBDIR=code .transform-tests

# Expects
# TRANSFORM_SUBDIR transforms subdirectory (such as universal)
.transform-tests:
@for i in $$(find ../../transforms/$(TRANSFORM_SUBDIR) -depth 1 -type d); do \
dir=$$(basename $$i); \
yml=test-$(TRANSFORM_SUBDIR)-$$dir.yml; \
echo Generating $$yml; \
cat test-transform.template | sed -e "s?@TARGET_TRANSFORM_DIR@?transforms/$${TRANSFORM_SUBDIR}/$$dir?g" > $$yml; \
done



73 changes: 73 additions & 0 deletions .github/workflows/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Workflow Management

Here we have the start of a system to automatically generated github workflows (currently only for transforms).
In general, the design is to use templates and `make` to generate/update the workflows.

#### Goals
1. Run only tests for a given transform when only the transform changes.
Includes python, ray, spark and kfp_ray as available.
2. When the core dpk lib components files changes, test all transforms
3. When the shared kfp components changes, test a randomly selected transform test
(We would like to avoid running all transform kfp tests in one PR)
4. Extra credit: If .md or other non-code changes are made, run no tests.

#### Assumptions
1. All transforms will have test workflows. A transform can disable its tests locally
(temporarily?) by renaming its Makefile. For example,
`cp transforms/universal/noop/Makefile transforms/universal/noop/Makefile.disabled`.

## DPK libraries (`data-processing-lib` directory)
The DPK libraries, in data-processing-lib/{python,ray,spark}, are tested
via the fixed
[test-lib.yml](test-lib.yml)
file and is triggered when any code files in that tree change.

The transforms test workflows also depend on this directory tree and so
changes made here will trigger transform tests.

## Transforms (`transforms` directory tree)
We define a unique test workflow for each transform, based on a common
template [test-transform.template](test-transform.template).
The [Makefile](Makefile) is used to (re)generate all workflows a necessary.
By design, workflows for a given transform should run when

* anything of substance effecting operation is modified in the transform's directory tree.
* anything in the core libraries in this repo (e.g., data-processing/lib) assuming the transform depends on these.

Note that the kfp tests (in kfp_ray/Makefile workflow-test) for a given transform are
**not** currently being run when the transform's tests are run.
Currently these are run randomly via the [test-kfp.yml](test-kfp.yml).
We expect to fix this is in the future.

When a new transform is added to the repository,

1. Run `make` in this directory to create the new test .yml for all transforms found in transforms/{universal,code,language} directories
1. commit and push the change to your branch with the new transform.

Something like the following:
```
git clone ....
...
git checkout -b new-branch
make # Creates new test*.yml workflows
git commit -a -s -m "update workflows"
git push --set-upstream origin new-branch
```

## KFP (`kfp` directory tree)

Like DPK core libs, kfp tests are defined in
[test-kfp.yml](test-kfp.yml) and run whenever changes are made in
the `kfp` directory tree. Tests currently include

1. test kfp on randomly selected transform.

Eventually we would like to enable the transform-specific kfp test
when only the transform code is modified or maybe when only
the `kfp_ray` directory contents is modified.

## Miscellaneous
[test-misc.yml](test-misc.yml) defines some repo consistency tests including

1. Make sure `set-versions` make target can be run recursively throughout the repo
2. Makes sure there is a test workflow for each transform in the repo.
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,20 @@ on:
branches:
- "dev"
- "releases/**"
paths:
- "data-processing-lib/**"
- "!**.md"
- "!**/doc/**"
- "!**/.gitignore"
pull_request:
branches:
- "dev"
- "releases/**"
paths:
- "data-processing-lib/**"
- "!**.md"
- "!**/doc/**"
- "!**/.gitignore"
jobs:
build-python-lib:
runs-on: ubuntu-22.04
Expand Down
122 changes: 122 additions & 0 deletions .github/workflows/test-code-code2parquet.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
#
# DO NOT EDIT THIS FILE: it is generated from test-transform.template, Edit there and run make to change these files
#
name: Test - transforms/code/code2parquet

on:
workflow_dispatch:
push:
branches:
- "dev"
- "releases/**"
tags:
- "*"
paths:
- "transforms/code/code2parquet/**"
- "data-processing-lib/**"
- "!transforms/code/code2parquet/**/kfp_ray/**" # This is/will be tested in separate workflow
- "!data-processing-lib/**/test/**"
- "!data-processing-lib/**/test-data/**"
- "!**.md"
- "!**/doc/**"
- "!**/images/**"
- "!**.gitignore"
pull_request:
branches:
- "dev"
- "releases/**"
paths:
- "transforms/code/code2parquet/**"
- "data-processing-lib/**"
- "!transforms/code/code2parquet/**/kfp_ray/**" # This is/will be tested in separate workflow
- "!data-processing-lib/**/test/**"
- "!data-processing-lib/**/test-data/**"
- "!**.md"
- "!**/doc/**"
- "!**/images/**"
- "!**.gitignore"

jobs:
check_if_push_image:
# check whether the Docker images should be pushed to the remote repository
# The images are pushed if it is a merge to dev branch or a new tag is created.
# The latter being part of the release process.
# The images tag is derived from the value of the DOCKER_IMAGE_VERSION variable set in the .make.versions file.
runs-on: ubuntu-22.04
outputs:
publish_images: ${{ steps.version.outputs.publish_images }}
steps:
- id: version
run: |
publish_images='false'
if [[ ${GITHUB_REF} == refs/heads/dev && ${GITHUB_EVENT_NAME} != 'pull_request' && ${GITHUB_REPOSITORY} == IBM/data-prep-kit ]] ;
then
publish_images='true'
fi
if [[ ${GITHUB_REF} == refs/tags/* && ${GITHUB_REPOSITORY} == IBM/data-prep-kit ]] ;
then
publish_images='true'
fi
echo "publish_images=$publish_images" >> "$GITHUB_OUTPUT"
test-src:
runs-on: ubuntu-22.04
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Free up space in github runner
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173
run: |
df -h
sudo rm -rf "/usr/local/share/boost"
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /usr/local/share/powershell /usr/share/swift /usr/local/.ghcup
sudo docker rmi $(docker image ls -aq) >/dev/null 2>&1 || true
df -h
- name: Test transform source in transforms/code/code2parquet
run: |
if [ -e "transforms/code/code2parquet/Makefile" ]; then
make -C transforms/code/code2parquet DOCKER=docker test-src
else
echo "transforms/code/code2parquet/Makefile not found - source testing disabled for this transform."
fi
test-image:
needs: [check_if_push_image]
runs-on: ubuntu-22.04
timeout-minutes: 120
env:
DOCKER_REGISTRY_USER: ${{ secrets.DOCKER_REGISTRY_USER }}
DOCKER_REGISTRY_KEY: ${{ secrets.DOCKER_REGISTRY_KEY }}
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Free up space in github runner
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173
run: |
df -h
sudo rm -rf /opt/ghc
sudo rm -rf "/usr/local/share/boost"
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /usr/local/share/powershell /usr/share/swift /usr/lib/jvm /usr/local/.ghcup
sudo docker rmi $(docker image ls -aq) >/dev/null 2>&1 || true
df -h
- name: Test transform image in transforms/code/code2parquet
run: |
if [ -e "transforms/code/code2parquet/Makefile" ]; then
make -C data-processing-lib/spark DOCKER=docker image
make -C transforms/code/code2parquet DOCKER=docker test-image
else
echo "transforms/code/code2parquet/Makefile not found - testing disabled for this transform."
fi
- name: Print space
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173
run: |
df -h
docker images
- name: Publish images
if: needs.check_if_push_image.outputs.publish_images == 'true'
run: |
if [ -e "transforms/code/code2parquet/Makefile" ]; then
make -C transforms/code/code2parquet publish
else
echo "transforms/code/code2parquet/Makefile not found - publishing disabled for this transform."
fi
122 changes: 122 additions & 0 deletions .github/workflows/test-code-code_quality.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
#
# DO NOT EDIT THIS FILE: it is generated from test-transform.template, Edit there and run make to change these files
#
name: Test - transforms/code/code_quality

on:
workflow_dispatch:
push:
branches:
- "dev"
- "releases/**"
tags:
- "*"
paths:
- "transforms/code/code_quality/**"
- "data-processing-lib/**"
- "!transforms/code/code_quality/**/kfp_ray/**" # This is/will be tested in separate workflow
- "!data-processing-lib/**/test/**"
- "!data-processing-lib/**/test-data/**"
- "!**.md"
- "!**/doc/**"
- "!**/images/**"
- "!**.gitignore"
pull_request:
branches:
- "dev"
- "releases/**"
paths:
- "transforms/code/code_quality/**"
- "data-processing-lib/**"
- "!transforms/code/code_quality/**/kfp_ray/**" # This is/will be tested in separate workflow
- "!data-processing-lib/**/test/**"
- "!data-processing-lib/**/test-data/**"
- "!**.md"
- "!**/doc/**"
- "!**/images/**"
- "!**.gitignore"

jobs:
check_if_push_image:
# check whether the Docker images should be pushed to the remote repository
# The images are pushed if it is a merge to dev branch or a new tag is created.
# The latter being part of the release process.
# The images tag is derived from the value of the DOCKER_IMAGE_VERSION variable set in the .make.versions file.
runs-on: ubuntu-22.04
outputs:
publish_images: ${{ steps.version.outputs.publish_images }}
steps:
- id: version
run: |
publish_images='false'
if [[ ${GITHUB_REF} == refs/heads/dev && ${GITHUB_EVENT_NAME} != 'pull_request' && ${GITHUB_REPOSITORY} == IBM/data-prep-kit ]] ;
then
publish_images='true'
fi
if [[ ${GITHUB_REF} == refs/tags/* && ${GITHUB_REPOSITORY} == IBM/data-prep-kit ]] ;
then
publish_images='true'
fi
echo "publish_images=$publish_images" >> "$GITHUB_OUTPUT"
test-src:
runs-on: ubuntu-22.04
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Free up space in github runner
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173
run: |
df -h
sudo rm -rf "/usr/local/share/boost"
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /usr/local/share/powershell /usr/share/swift /usr/local/.ghcup
sudo docker rmi $(docker image ls -aq) >/dev/null 2>&1 || true
df -h
- name: Test transform source in transforms/code/code_quality
run: |
if [ -e "transforms/code/code_quality/Makefile" ]; then
make -C transforms/code/code_quality DOCKER=docker test-src
else
echo "transforms/code/code_quality/Makefile not found - source testing disabled for this transform."
fi
test-image:
needs: [check_if_push_image]
runs-on: ubuntu-22.04
timeout-minutes: 120
env:
DOCKER_REGISTRY_USER: ${{ secrets.DOCKER_REGISTRY_USER }}
DOCKER_REGISTRY_KEY: ${{ secrets.DOCKER_REGISTRY_KEY }}
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Free up space in github runner
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173
run: |
df -h
sudo rm -rf /opt/ghc
sudo rm -rf "/usr/local/share/boost"
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /usr/local/share/powershell /usr/share/swift /usr/lib/jvm /usr/local/.ghcup
sudo docker rmi $(docker image ls -aq) >/dev/null 2>&1 || true
df -h
- name: Test transform image in transforms/code/code_quality
run: |
if [ -e "transforms/code/code_quality/Makefile" ]; then
make -C data-processing-lib/spark DOCKER=docker image
make -C transforms/code/code_quality DOCKER=docker test-image
else
echo "transforms/code/code_quality/Makefile not found - testing disabled for this transform."
fi
- name: Print space
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173
run: |
df -h
docker images
- name: Publish images
if: needs.check_if_push_image.outputs.publish_images == 'true'
run: |
if [ -e "transforms/code/code_quality/Makefile" ]; then
make -C transforms/code/code_quality publish
else
echo "transforms/code/code_quality/Makefile not found - publishing disabled for this transform."
fi
Loading

0 comments on commit d97a076

Please sign in to comment.