-
Notifications
You must be signed in to change notification settings - Fork 141
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #797 from IBM/crawler-transform
Crawler transform
- Loading branch information
Showing
20 changed files
with
954 additions
and
14 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,133 @@ | ||
# | ||
# DO NOT EDIT THIS FILE: it is generated from test-transform.template, Edit there and run make to change these files | ||
# | ||
name: Test - transforms/universal/web2parquet | ||
|
||
on: | ||
workflow_dispatch: | ||
push: | ||
branches: | ||
- "dev" | ||
- "releases/**" | ||
tags: | ||
- "*" | ||
paths: | ||
- ".make.*" | ||
- "transforms/.make.transforms" | ||
- "transforms/universal/web2parquet/**" | ||
- "data-processing-lib/**" | ||
- "!transforms/universal/web2parquet/**/kfp_ray/**" # This is/will be tested in separate workflow | ||
- "!data-processing-lib/**/test/**" | ||
- "!data-processing-lib/**/test-data/**" | ||
- "!**.md" | ||
- "!**/doc/**" | ||
- "!**/images/**" | ||
- "!**.gitignore" | ||
pull_request: | ||
branches: | ||
- "dev" | ||
- "releases/**" | ||
paths: | ||
- ".make.*" | ||
- "transforms/.make.transforms" | ||
- "transforms/universal/web2parquet/**" | ||
- "data-processing-lib/**" | ||
- "!transforms/universal/web2parquet/**/kfp_ray/**" # This is/will be tested in separate workflow | ||
- "!data-processing-lib/**/test/**" | ||
- "!data-processing-lib/**/test-data/**" | ||
- "!**.md" | ||
- "!**/doc/**" | ||
- "!**/images/**" | ||
- "!**.gitignore" | ||
|
||
# Taken from https://stackoverflow.com/questions/66335225/how-to-cancel-previous-runs-in-the-pr-when-you-push-new-commitsupdate-the-curre | ||
concurrency: | ||
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }} | ||
cancel-in-progress: true | ||
|
||
jobs: | ||
check_if_push_image: | ||
# check whether the Docker images should be pushed to the remote repository | ||
# The images are pushed if it is a merge to dev branch or a new tag is created. | ||
# The latter being part of the release process. | ||
# The images tag is derived from the value of the DOCKER_IMAGE_VERSION variable set in the .make.versions file. | ||
runs-on: ubuntu-22.04 | ||
outputs: | ||
publish_images: ${{ steps.version.outputs.publish_images }} | ||
steps: | ||
- id: version | ||
run: | | ||
publish_images='false' | ||
if [[ ${GITHUB_REF} == refs/heads/dev && ${GITHUB_EVENT_NAME} != 'pull_request' && ${GITHUB_REPOSITORY} == IBM/data-prep-kit ]] ; | ||
then | ||
publish_images='true' | ||
fi | ||
if [[ ${GITHUB_REF} == refs/tags/* && ${GITHUB_REPOSITORY} == IBM/data-prep-kit ]] ; | ||
then | ||
publish_images='true' | ||
fi | ||
echo "publish_images=$publish_images" >> "$GITHUB_OUTPUT" | ||
test-src: | ||
runs-on: ubuntu-22.04 | ||
steps: | ||
- name: Checkout | ||
uses: actions/checkout@v4 | ||
- name: Free up space in github runner | ||
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173 | ||
run: | | ||
df -h | ||
sudo rm -rf "/usr/local/share/boost" | ||
sudo rm -rf "$AGENT_TOOLSDIRECTORY" | ||
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /usr/local/share/powershell /usr/share/swift /usr/local/.ghcup | ||
sudo docker rmi $(docker image ls -aq) >/dev/null 2>&1 || true | ||
df -h | ||
- name: Test transform source in transforms/universal/web2parquet | ||
run: | | ||
if [ -e "transforms/universal/web2parquet/Makefile" ]; then | ||
make -C transforms/universal/web2parquet DOCKER=docker test-src | ||
else | ||
echo "transforms/universal/web2parquet/Makefile not found - source testing disabled for this transform." | ||
fi | ||
test-image: | ||
needs: [check_if_push_image] | ||
runs-on: ubuntu-22.04 | ||
timeout-minutes: 120 | ||
env: | ||
DOCKER_REGISTRY_USER: ${{ secrets.DOCKER_REGISTRY_USER }} | ||
DOCKER_REGISTRY_KEY: ${{ secrets.DOCKER_REGISTRY_KEY }} | ||
steps: | ||
- name: Checkout | ||
uses: actions/checkout@v4 | ||
- name: Free up space in github runner | ||
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173 | ||
run: | | ||
df -h | ||
sudo rm -rf /opt/ghc | ||
sudo rm -rf "/usr/local/share/boost" | ||
sudo rm -rf "$AGENT_TOOLSDIRECTORY" | ||
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /usr/local/share/powershell /usr/share/swift /usr/lib/jvm /usr/local/.ghcup | ||
sudo docker rmi $(docker image ls -aq) >/dev/null 2>&1 || true | ||
df -h | ||
- name: Test transform image in transforms/universal/web2parquet | ||
run: | | ||
if [ -e "transforms/universal/web2parquet/Makefile" ]; then | ||
if [ -d "transforms/universal/web2parquet/spark" ]; then | ||
make -C data-processing-lib/spark DOCKER=docker image | ||
fi | ||
make -C transforms/universal/web2parquet DOCKER=docker test-image | ||
else | ||
echo "transforms/universal/web2parquet/Makefile not found - testing disabled for this transform." | ||
fi | ||
- name: Print space | ||
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173 | ||
run: | | ||
df -h | ||
docker images | ||
- name: Publish images | ||
if: needs.check_if_push_image.outputs.publish_images == 'true' | ||
run: | | ||
if [ -e "transforms/universal/web2parquet/Makefile" ]; then | ||
make -C transforms/universal/web2parquet publish | ||
else | ||
echo "transforms/universal/web2parquet/Makefile not found - publishing disabled for this transform." | ||
fi |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,89 @@ | ||
# Define the root of the local git clone for the common rules to be able | ||
# know where they are running from. | ||
|
||
# Include a library of common .transform.* targets which most | ||
# transforms should be able to reuse. However, feel free | ||
# to override/redefine the rules below. | ||
include $(REPOROOT)/transforms/.make.transforms | ||
|
||
###################################################################### | ||
## Default setting for TRANSFORM_RUNTIME uses folder name-- Old layout | ||
TRANSFORM_PYTHON_RUNTIME_SRC_FILE=-m dpk_$(TRANSFORM_NAME).transform | ||
TRANSFORM_RAY_RUNTIME_SRC_FILE=-m dpk_$(TRANSFORM_NAME).ray.transform | ||
TRANSFORM_PYTHON_RUNTIME_SRC_FILE=-m dpk_$(TRANSFORM_NAME).spark.transform | ||
|
||
venv:: .defaults.create-venv | ||
source venv/bin/activate && $(PIP) install -e $(REPOROOT)/data-processing-lib[ray,spark] | ||
source venv/bin/activate && $(PIP) install -e $(REPOROOT)/data-connector-lib | ||
if [ -e requirements.txt ]; then \ | ||
source venv/bin/activate && $(PIP) install -r requirements.txt; \ | ||
fi; | ||
|
||
|
||
test:: .transforms.test-src test-image | ||
|
||
clean:: .transforms.clean | ||
|
||
## We need to think how we want to do this going forward | ||
set-versions:: | ||
|
||
## We need to think how we want to do this going forward | ||
build:: | ||
|
||
image:: | ||
@if [ -e Dockerfile ]; then \ | ||
$(MAKE) image-default ; \ | ||
else \ | ||
echo "Skipping image for $(shell pwd) since no Dockerfile is present"; \ | ||
fi | ||
|
||
publish:: | ||
@if [ -e Dockerfile ]; then \ | ||
$(MAKE) publish-default ; \ | ||
else \ | ||
echo "Skipping publish for $(shell pwd) since no Dockerfile is present"; \ | ||
fi | ||
|
||
publish-image:: | ||
@if [ -e Dockerfile ]; then \ | ||
$(MAKE) publish-image-default ; \ | ||
else \ | ||
echo "Skipping publish-image for $(shell pwd) since no Dockerfile is present"; \ | ||
fi | ||
|
||
test-image:: | ||
@if [ -e Dockerfile ]; then \ | ||
$(MAKE) test-image-default ; \ | ||
else \ | ||
echo "Skipping test-image for $(shell pwd) since no Dockerfile is present"; \ | ||
fi | ||
|
||
test-src:: .transforms.test-src | ||
|
||
setup:: .transforms.setup | ||
|
||
publish-default:: publish-image | ||
|
||
publish-image-default:: .defaults.publish-image | ||
|
||
test-image-default:: image .transforms.test-image-help .defaults.test-image-pytest .transforms.clean | ||
|
||
build-lib-wheel: | ||
make -C $(REPOROOT)/data-processing-lib build-pkg-dist | ||
|
||
image-default:: build-lib-wheel | ||
@$(eval LIB_WHEEL_FILE := $(shell find $(REPOROOT)/data-processing-lib/dist/*.whl)) | ||
rm -fr dist && mv $(REPOROOT)/data-processing-lib/dist . | ||
$(eval WHEEL_FILE_NAME := $(shell basename $(LIB_WHEEL_FILE))) | ||
$(DOCKER) build -t $(DOCKER_IMAGE_NAME) $(DOCKER_BUILD_EXTRA_ARGS) \ | ||
--platform $(DOCKER_PLATFORM) \ | ||
--build-arg EXTRA_INDEX_URL=$(EXTRA_INDEX_URL) \ | ||
--build-arg BASE_IMAGE=$(RAY_BASE_IMAGE) \ | ||
--build-arg BUILD_DATE=$(shell date -u +'%Y-%m-%dT%H:%M:%SZ') \ | ||
--build-arg WHEEL_FILE_NAME=$(WHEEL_FILE_NAME) \ | ||
--build-arg TRANSFORM_NAME=$(TRANSFORM_NAME) \ | ||
--build-arg GIT_COMMIT=$(shell git log -1 --format=%h) . | ||
$(DOCKER) tag $(DOCKER_LOCAL_IMAGE) $(DOCKER_REMOTE_IMAGE) | ||
rm -fr dist | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
REPOROOT=../../.. | ||
# Use make help, to see the available rules | ||
include $(REPOROOT)/transforms/.make.cicd.targets | ||
|
||
# | ||
# This is intended to be included across the Makefiles provided within | ||
# a given transform's directory tree, so must use compatible syntax. | ||
# | ||
################################################################################ | ||
# This defines the name of the transform and is used to match against | ||
# expected files and is used to define the transform's image name. | ||
TRANSFORM_NAME=$(shell basename `pwd`) | ||
|
||
################################################################################ | ||
# This defines the transforms' version number as would be used | ||
# when publishing the wheel. In general, only the micro version | ||
# number should be advanced relative to the DPK_VERSION. | ||
# | ||
# If you change the versions numbers, be sure to run "make set-versions" to | ||
# update version numbers across the transform (e.g., pyproject.toml). | ||
#TRANSFORM_VERSION=$(DPK_VERSION) | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
# Web Crawler to Parquet | ||
|
||
This tranform crawls the web and downloads files in real-time. | ||
|
||
This first release of the transform, only accepts the following 4 parameters. Additional releases will extend the functionality to allow the user to specify additional constraints such as mime-type, domain-focus, etc. | ||
|
||
|
||
## Parameters | ||
|
||
For configuring the crawl, users need to specify the following parameters: | ||
|
||
| parameter:type | Description | | ||
| --- | --- | | ||
| urls:list | list of seed URLs (i.e., ['https://thealliance.ai'] or ['https://www.apache.org/projects','https://www.apache.org/foundation']). The list can include any number of valid URLS that are not configured to block web crawlers | | ||
|depth:int | control crawling depth | | ||
| downloads:int | number of downloads that are stored to the download folder. Since the crawler operations happen asynchronously, the process can result in any 10 of the visited URLs being retrieved (i.e. consecutive runs can result in different files being downloaded) | | ||
| folder:str | folder where downloaded files are stored. If the folder is not empty, new files are added or replace the existing ones with the same URLs | | ||
|
||
|
||
## Install the transform | ||
|
||
The transform can be installed directly from pypi and has a dependency on the data-prep-toolkit and the data-prep-connector | ||
|
||
``` | ||
pip install data-prep-connector | ||
pip install data-prep-toolkit>=0.2.2.dev2 | ||
pip install data-prep-toolkit-transform[web2parquet]>=0.2.2.dev3 | ||
``` | ||
|
||
If working from a fork in the git repo, from the root folder of the git repo, do the following: | ||
|
||
``` | ||
cd transform/universal/web2parquet | ||
make venv | ||
source venv/bin/activate | ||
pip install -r requirements.txt | ||
``` | ||
|
||
## Invoking the transform from a notebook | ||
|
||
In order to invoke the transfrom from a notebook, users must enable nested asynchronous ( https://pypi.org/project/nest-asyncio/ ), import the transform class and call the `transform()`function as shown in the example below: | ||
|
||
|
||
``` | ||
import nest_asyncio | ||
nest_asyncio.apply() | ||
from dpk_web2parquet.transform import Web2Parquet | ||
Web2Parquet(urls= ['https://thealliance.ai/'], | ||
depth=2, | ||
downloads=10, | ||
folder='downloads').transform() | ||
```` |
Oops, something went wrong.