Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor doc_id with its own dpk_ module name #860

Draft
wants to merge 27 commits into
base: dev
Choose a base branch
from

Conversation

touma-I
Copy link
Collaborator

@touma-I touma-I commented Dec 5, 2024

Why are these changes needed?

his is a first of a series of restructuring changes that are done to have each transform built as its own module (e.g. dpk_doc_id) with a ray submodule (dpk_doc_id.ray ).

Removed python and ray folders and refactor Dockerfile.python and Dockerfile.ray
remove pyproject.toml and Makefiles
move python code under dpk_doc_id
move ray code under dpk_text_encoder/ray
change import statement to include module name
replace recursive Makefile and use targets from .make.cicd.targets
adapt kfp_ray/Makefile and other make target

Related issue number (if any).

#774

Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
@cmadam cmadam self-requested a review December 6, 2024 19:16
Copy link
Collaborator

@cmadam cmadam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the documentation with the commands needed to build the transform, test it, run it locally, etc. As the build/testing/running methodology has changed, all the corresponding sections in the documentation should be changed accordingly. Even better, as all the transforms can be built/tested/ran using the same commands, this could be provided as a separate document, with each transform linking to that generic documentation.

Please also fix the local run examples for ray and spark, as they error out.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the documentation with the commands needed to build the transform, test it, run it locally, etc. Also, as now we have a unified testing infrastructure for python, ray, and spark does the testing run trigger tests for all the runtimes, or only for a particular runtime?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cmadam . Yes, we do need to fix the documentation. Let me know if you are willing/have bandwidth to help with this. We will have a new Issue/PR to address the documentation for all the new transforms.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The target run-cli-spark-sample don't work:

$ make run-cli-spark-sample
. . .
python3.11 -m dpk_doc_id.spark.transform \
                --run_locally True --data_local_config "{ 'input_folder' : '../test-data/input', 'output_folder' : '../output'}"  \
                --doc_id_int True
usage: transform.py [-h] [--doc_id_doc_column DOC_ID_DOC_COLUMN] [--doc_id_hash_column DOC_ID_HASH_COLUMN] [--doc_id_int_column DOC_ID_INT_COLUMN] [--data_s3_cred DATA_S3_CRED]
                    [--data_s3_config DATA_S3_CONFIG] [--data_local_config DATA_LOCAL_CONFIG] [--data_max_files DATA_MAX_FILES] [--data_checkpointing DATA_CHECKPOINTING]
                    [--data_files_to_checkpoint DATA_FILES_TO_CHECKPOINT] [--data_data_sets DATA_DATA_SETS] [--data_files_to_use DATA_FILES_TO_USE] [--data_num_samples DATA_NUM_SAMPLES]
                    [--runtime_parallelization RUNTIME_PARALLELIZATION] [--runtime_pipeline_id RUNTIME_PIPELINE_ID] [--runtime_job_id RUNTIME_JOB_ID] [--runtime_code_location RUNTIME_CODE_LOCATION]
transform.py: error: unrecognized arguments: --run_locally True
make: *** [Makefile:20: run-cli-spark-sample] Error 2

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cmadam . Is this something that you can help fix? If not, don't worry, I will tackle it in a next iteration.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The target run-cli-ray-sample doesn't work:

$ make run-cli-ray-sample
. . .
python3.11 -m dpk_doc_id.ray.transform \
                --run_locally True --data_local_config "{ 'input_folder' : '../test-data/input', 'output_folder' : '../output'}"  \
                --doc_id_int True
17:42:02 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': None, 'int_column': 'True', 'start_id': 0}
17:42:02 INFO - pipeline id pipeline_id
17:42:02 INFO - code location None
17:42:02 INFO - number of workers 1 worker options {'num_cpus': 0.8, 'max_restarts': -1}
17:42:02 INFO - actor creation delay 0
17:42:02 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_id', 'job type': 'ray', 'job id': 'job_id'}
17:42:02 INFO - data factory data_ is using local data access: input_folder - ../test-data/input output_folder - ../output
17:42:02 INFO - data factory data_ max_files -1, n_sample -1
17:42:02 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
17:42:02 INFO - Running locally
2024-12-06 17:42:05,111	INFO worker.py:1777 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 
(orchestrate pid=165088) 17:42:07 INFO - orchestrator started at 2024-12-06 17:42:07
(orchestrate pid=165088) 17:42:07 ERROR - No input files to process - exiting

touma-I and others added 3 commits December 10, 2024 11:05
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>
@shahrokhDaijavad
Copy link
Member

@touma-I Fixed the README for this.

touma-I and others added 3 commits December 10, 2024 22:29
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants