-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor doc_id with its own dpk_ module name #860
base: dev
Are you sure you want to change the base?
Conversation
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the documentation with the commands needed to build the transform, test it, run it locally, etc. As the build/testing/running methodology has changed, all the corresponding sections in the documentation should be changed accordingly. Even better, as all the transforms can be built/tested/ran using the same commands, this could be provided as a separate document, with each transform linking to that generic documentation.
Please also fix the local run examples for ray and spark, as they error out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the documentation with the commands needed to build the transform, test it, run it locally, etc. Also, as now we have a unified testing infrastructure for python, ray, and spark does the testing run trigger tests for all the runtimes, or only for a particular runtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cmadam . Yes, we do need to fix the documentation. Let me know if you are willing/have bandwidth to help with this. We will have a new Issue/PR to address the documentation for all the new transforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The target run-cli-spark-sample
don't work:
$ make run-cli-spark-sample
. . .
python3.11 -m dpk_doc_id.spark.transform \
--run_locally True --data_local_config "{ 'input_folder' : '../test-data/input', 'output_folder' : '../output'}" \
--doc_id_int True
usage: transform.py [-h] [--doc_id_doc_column DOC_ID_DOC_COLUMN] [--doc_id_hash_column DOC_ID_HASH_COLUMN] [--doc_id_int_column DOC_ID_INT_COLUMN] [--data_s3_cred DATA_S3_CRED]
[--data_s3_config DATA_S3_CONFIG] [--data_local_config DATA_LOCAL_CONFIG] [--data_max_files DATA_MAX_FILES] [--data_checkpointing DATA_CHECKPOINTING]
[--data_files_to_checkpoint DATA_FILES_TO_CHECKPOINT] [--data_data_sets DATA_DATA_SETS] [--data_files_to_use DATA_FILES_TO_USE] [--data_num_samples DATA_NUM_SAMPLES]
[--runtime_parallelization RUNTIME_PARALLELIZATION] [--runtime_pipeline_id RUNTIME_PIPELINE_ID] [--runtime_job_id RUNTIME_JOB_ID] [--runtime_code_location RUNTIME_CODE_LOCATION]
transform.py: error: unrecognized arguments: --run_locally True
make: *** [Makefile:20: run-cli-spark-sample] Error 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cmadam . Is this something that you can help fix? If not, don't worry, I will tackle it in a next iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The target run-cli-ray-sample
doesn't work:
$ make run-cli-ray-sample
. . .
python3.11 -m dpk_doc_id.ray.transform \
--run_locally True --data_local_config "{ 'input_folder' : '../test-data/input', 'output_folder' : '../output'}" \
--doc_id_int True
17:42:02 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': None, 'int_column': 'True', 'start_id': 0}
17:42:02 INFO - pipeline id pipeline_id
17:42:02 INFO - code location None
17:42:02 INFO - number of workers 1 worker options {'num_cpus': 0.8, 'max_restarts': -1}
17:42:02 INFO - actor creation delay 0
17:42:02 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_id', 'job type': 'ray', 'job id': 'job_id'}
17:42:02 INFO - data factory data_ is using local data access: input_folder - ../test-data/input output_folder - ../output
17:42:02 INFO - data factory data_ max_files -1, n_sample -1
17:42:02 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
17:42:02 INFO - Running locally
2024-12-06 17:42:05,111 INFO worker.py:1777 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
(orchestrate pid=165088) 17:42:07 INFO - orchestrator started at 2024-12-06 17:42:07
(orchestrate pid=165088) 17:42:07 ERROR - No input files to process - exiting
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>
@touma-I Fixed the README for this. |
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Why are these changes needed?
his is a first of a series of restructuring changes that are done to have each transform built as its own module (e.g. dpk_doc_id) with a ray submodule (dpk_doc_id.ray ).
Removed python and ray folders and refactor Dockerfile.python and Dockerfile.ray
remove pyproject.toml and Makefiles
move python code under dpk_doc_id
move ray code under dpk_text_encoder/ray
change import statement to include module name
replace recursive Makefile and use targets from .make.cicd.targets
adapt kfp_ray/Makefile and other make target
Related issue number (if any).
#774