Refactor doc_id with its own dpk_ module name #860

touma-I · 2024-12-05T19:42:17Z

Why are these changes needed?

his is a first of a series of restructuring changes that are done to have each transform built as its own module (e.g. dpk_doc_id) with a ray submodule (dpk_doc_id.ray ).

Removed python and ray folders and refactor Dockerfile.python and Dockerfile.ray
remove pyproject.toml and Makefiles
move python code under dpk_doc_id
move ray code under dpk_text_encoder/ray
change import statement to include module name
replace recursive Makefile and use targets from .make.cicd.targets
adapt kfp_ray/Makefile and other make target

Related issue number (if any).

#774

Signed-off-by: Maroun Touma <[email protected]>

cmadam

Please update the documentation with the commands needed to build the transform, test it, run it locally, etc. As the build/testing/running methodology has changed, all the corresponding sections in the documentation should be changed accordingly. Even better, as all the transforms can be built/tested/ran using the same commands, this could be provided as a separate document, with each transform linking to that generic documentation.

Please also fix the local run examples for ray and spark, as they error out.

cmadam · 2024-12-06T22:34:16Z

transforms/universal/doc_id/README.md

Please update the documentation with the commands needed to build the transform, test it, run it locally, etc. Also, as now we have a unified testing infrastructure for python, ray, and spark does the testing run trigger tests for all the runtimes, or only for a particular runtime?

Thanks @cmadam . Yes, we do need to fix the documentation. Let me know if you are willing/have bandwidth to help with this. We will have a new Issue/PR to address the documentation for all the new transforms.

cmadam · 2024-12-06T22:48:11Z

transforms/universal/doc_id/Makefile

The target run-cli-spark-sample don't work:

$ make run-cli-spark-sample . . . python3.11 -m dpk_doc_id.spark.transform \ --run_locally True --data_local_config "{ 'input_folder' : '../test-data/input', 'output_folder' : '../output'}" \ --doc_id_int True usage: transform.py [-h] [--doc_id_doc_column DOC_ID_DOC_COLUMN] [--doc_id_hash_column DOC_ID_HASH_COLUMN] [--doc_id_int_column DOC_ID_INT_COLUMN] [--data_s3_cred DATA_S3_CRED] [--data_s3_config DATA_S3_CONFIG] [--data_local_config DATA_LOCAL_CONFIG] [--data_max_files DATA_MAX_FILES] [--data_checkpointing DATA_CHECKPOINTING] [--data_files_to_checkpoint DATA_FILES_TO_CHECKPOINT] [--data_data_sets DATA_DATA_SETS] [--data_files_to_use DATA_FILES_TO_USE] [--data_num_samples DATA_NUM_SAMPLES] [--runtime_parallelization RUNTIME_PARALLELIZATION] [--runtime_pipeline_id RUNTIME_PIPELINE_ID] [--runtime_job_id RUNTIME_JOB_ID] [--runtime_code_location RUNTIME_CODE_LOCATION] transform.py: error: unrecognized arguments: --run_locally True make: *** [Makefile:20: run-cli-spark-sample] Error 2

Thanks @cmadam . Is this something that you can help fix? If not, don't worry, I will tackle it in a next iteration.

cmadam · 2024-12-06T22:51:52Z

transforms/universal/doc_id/Makefile

The target run-cli-ray-sample doesn't work:

$ make run-cli-ray-sample . . . python3.11 -m dpk_doc_id.ray.transform \ --run_locally True --data_local_config "{ 'input_folder' : '../test-data/input', 'output_folder' : '../output'}" \ --doc_id_int True 17:42:02 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': None, 'int_column': 'True', 'start_id': 0} 17:42:02 INFO - pipeline id pipeline_id 17:42:02 INFO - code location None 17:42:02 INFO - number of workers 1 worker options {'num_cpus': 0.8, 'max_restarts': -1} 17:42:02 INFO - actor creation delay 0 17:42:02 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_id', 'job type': 'ray', 'job id': 'job_id'} 17:42:02 INFO - data factory data_ is using local data access: input_folder - ../test-data/input output_folder - ../output 17:42:02 INFO - data factory data_ max_files -1, n_sample -1 17:42:02 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet'] 17:42:02 INFO - Running locally 2024-12-06 17:42:05,111 INFO worker.py:1777 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 (orchestrate pid=165088) 17:42:07 INFO - orchestrator started at 2024-12-06 17:42:07 (orchestrate pid=165088) 17:42:07 ERROR - No input files to process - exiting

Signed-off-by: Maroun Touma <[email protected]>

Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>

shahrokhDaijavad · 2024-12-10T20:35:54Z

@touma-I Fixed the README for this.

Signed-off-by: Maroun Touma <[email protected]>

Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>

Signed-off-by: Maroun Touma <[email protected]>

touma-I added 21 commits December 5, 2024 14:12

refactored code as its own module

3c9b999

Signed-off-by: Maroun Touma <[email protected]>

added __init__

49a22ae

Signed-off-by: Maroun Touma <[email protected]>

merge with latest from dev

24b6d9d

Signed-off-by: Maroun Touma <[email protected]>

Fix typo

808521a

Signed-off-by: Maroun Touma <[email protected]>

remove spark unit test for now

b8f3b6f

Signed-off-by: Maroun Touma <[email protected]>

Show example for runnign ray runtime

3baad3e

Signed-off-by: Maroun Touma <[email protected]>

fixing issues with spark

37881ef

Signed-off-by: Maroun Touma <[email protected]>

remove BASE_IMAGE arg from dockerfile.spark

9c869ee

Signed-off-by: Maroun Touma <[email protected]>

added login to quay.io

6fa0c04

Signed-off-by: Maroun Touma <[email protected]>

debug registry credential

bbe9a02

Signed-off-by: Maroun Touma <[email protected]>

use dpk secrets

b77aaa3

Signed-off-by: Maroun Touma <[email protected]>

testing registry user

2c52fd5

Signed-off-by: Maroun Touma <[email protected]>

testing registry user

5ffadb4

Signed-off-by: Maroun Touma <[email protected]>

testing environment secrets

8b234a4

Signed-off-by: Maroun Touma <[email protected]>

testing environment secrets

878fa42

Signed-off-by: Maroun Touma <[email protected]>

testing environment secrets

5632c06

Signed-off-by: Maroun Touma <[email protected]>

testing environment secrets

6da893f

Signed-off-by: Maroun Touma <[email protected]>

testing environment secrets

d426439

Signed-off-by: Maroun Touma <[email protected]>

Delete .github/workflows/test-universal-doc_id.yml

e8bb04a

restore workflow file

3647ebc

clear testing of docker login

adaf78f

Signed-off-by: Maroun Touma <[email protected]>

cmadam self-requested a review December 6, 2024 19:16

cmadam requested changes Dec 6, 2024

View reviewed changes

touma-I and others added 3 commits December 10, 2024 11:05

fix Makefile failing targets

0c38a4c

Signed-off-by: Maroun Touma <[email protected]>

merged with dev

fd0b261

Signed-off-by: Maroun Touma <[email protected]>

README changes

87eecf0

Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>

touma-I and others added 3 commits December 10, 2024 22:29

fix notebook

3b2420b

Signed-off-by: Maroun Touma <[email protected]>

More changes to README

30912d1

Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>

fix notebook

0cbe63f

Signed-off-by: Maroun Touma <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor doc_id with its own dpk_ module name #860

Refactor doc_id with its own dpk_ module name #860

touma-I commented Dec 5, 2024

cmadam left a comment •

edited

Loading

cmadam Dec 6, 2024

touma-I Dec 10, 2024

cmadam Dec 6, 2024

touma-I Dec 10, 2024

cmadam Dec 6, 2024

shahrokhDaijavad commented Dec 10, 2024

Refactor doc_id with its own dpk_ module name #860

Are you sure you want to change the base?

Refactor doc_id with its own dpk_ module name #860

Conversation

touma-I commented Dec 5, 2024

Why are these changes needed?

Related issue number (if any).

cmadam left a comment • edited Loading

Choose a reason for hiding this comment

cmadam Dec 6, 2024

Choose a reason for hiding this comment

touma-I Dec 10, 2024

Choose a reason for hiding this comment

cmadam Dec 6, 2024

Choose a reason for hiding this comment

touma-I Dec 10, 2024

Choose a reason for hiding this comment

cmadam Dec 6, 2024

Choose a reason for hiding this comment

shahrokhDaijavad commented Dec 10, 2024

cmadam left a comment •

edited

Loading