Skip to content

Commit

Permalink
add documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
blublinsky committed Nov 27, 2024
1 parent 01f79b8 commit 52b5a0d
Show file tree
Hide file tree
Showing 2 changed files with 46 additions and 1 deletion.
42 changes: 42 additions & 0 deletions data-processing-lib/doc/pipelined_transform.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Pipelined transform

Typical DPK usage is a sequential invocation of individual transforms that process all of the input data and create
the output one. Such execution is very convenient as it produces all of the intermediate data, which can be useful,
especially during the debugging.

This said, such approach creates a lot of intermediate data and executes a lot of reads and writes, which might
significantly slow down processing, especially in the case of large data sets.

To overcome this drawback, DPK introduced a new type of transform - pipeline transform. Pipeline transform
(somewhat similar to [sklearn pipeline](https://scikit-learn.org/1.5/modules/generated/sklearn.pipeline.Pipeline.html))
is a transform, meaning it transforms one file at a time and a pipeline, meaning that this file is transformed by
a set of individual transformers, passing data between then as a byte array in memory.

## Creating pipeline transform.

Creation of the pipeline transform requires creation of runtime specific transform runtime configuration
leveraging [PipelineTransformConfiguration](../python/src/data_processing/transform/pipeline_transform_configuration.py)
Examples of such configuration can be found:

* [Python](../../transforms/universal/noop/python/src/noop_pipeline_transform_python.py)
* [Ray](../../transforms/universal/noop/ray/src/noop_pipeline_transform_ray.py)
* [Spark](../../transforms/universal/noop/spark/src/noop_pipeline_transform_spark.py)

These are very simple examples using pipeline containing a single transform.

More complex example defining pipeline of two examples - Resize and NOOP can be found
[Python](../python/src/data_processing/test_support/transform/pipeline_transform.py) and
[Ray](../ray/src/data_processing_ray/test_support/transform/pipeline_transform.py)

***Note*** the limitation of pipeline transform is that all participating transforms have to be different,
The same transform can not be included twice.

## Running pipeline transform

Similar to the `ordinary` transforms, pipeline transforms can be invoked using launcher, but parameters,
in this case have to include parameters for all participating transforms. The base class
[AbstractPipelineTransform](../python/src/data_processing/transform/pipeline_transform.py) will initialize
all participating transforms based on these parameters

***Note*** as per DPK convention, parameters for every transform are prefixed by a transform name, which means
that a given transform will always get an appropriate parameter
5 changes: 4 additions & 1 deletion data-processing-lib/doc/transforms.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,12 @@ There are currently two types of transforms defined in DPK:

* [AbstractBinaryTransform](../python/src/data_processing/transform/binary_transform.py) which is a base
class for all data transforms. Data transforms convert a file of data producing zero or more data files
and metadata. A specific class of the binary transform is
and metadata. Specific classes of the binary transform are
[AbstractTableTransform](../python/src/data_processing/transform/table_transform.py) that consumes and produces
data files containing [pyarrow tables](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html)
and [AbstractPipelineTransform](../python/src/data_processing/transform/pipeline_transform.py) that creates
pipelined execution of one or more transforms. For more information on pipelined transforms reffer to
[this](pipelined_transform.md)
* [AbstractFolderTransform](../python/src/data_processing/transform/folder_transform.py) which is a base
class consuming a folder (that can contain an arbitrary set of files, that need to be processed together)
and proces zero or more data files and metadata.
Expand Down

0 comments on commit 52b5a0d

Please sign in to comment.