Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.
This project wraps the Fuzzy Dedup transform with a Ray runtime.
Fuzzy Dedup configuration and command line options are the same as for the base python transform.
When running the transform with the Ray launcher (i.e. TransformLauncher), In addition to those available to the transform as defined in here, the set of ray launcher are available.
To run the samples, use the following make
target to create a virtual environment:
make venv
Subsequently, the main orchestration program can run with:
source venv/bin/activate
cd src
python fdedup_transform_ray.py
Alternatively the transforms included in fuzzy dedup can be launched independently:
source venv/bin/activate
cd src
python signature_calc_local_ray.py
python cluster_analysis_local_ray.py
python get_duplicate_list_local_ray.py
python data_cleaning_local_ray.py
After running the transforms, execute:
ls output
To see results of the transform.
To use the transform image to transform your data, please refer to the running images quickstart, substituting the name of this transform image and runtime as appropriate.
This is a sample notebook that shows how to invoke the ray fuzzy dedup transform.
For testing fuzzy deduplication in a ray runtime, use the following make
targets. To launch integration tests
for all the component transforms of fuzzy dedup (signature calculation, cluster analysis, get duplicate list and data
cleaning) use:
make test-src
To test the creation of the Docker image for fuzzy dedup transform and the capability to run a local program inside that image, use:
make test-image