[Task]: When running a Hop pipeline on a Spark Standalone cluster, the pipeline generates temporary files instead of the exact output file specified in the configuration. #4865

Raja10D · 2025-02-05T07:39:36Z

What needs to happen?

Configure and run the Hop pipeline on a Spark Standalone cluster.
Check the output directory for the generated files.

Expected Result: The pipeline should generate the exact output file as specified in the configuration.(name of output file: output)

Actual Result: The pipeline generates temporary files instead of the exact output file. (the file generating: output_6656ced7-7a39-4014-9903-c388b8b84852_1.xls)

This pipeline is nothing but a simple excel input and excel writer.
My excel writer configuration in the pipeline:

Is there anything I need to change in configuration or something else.
Please guide me.
The pipeline is producing crct output file on Hop local runner

Issue Priority

Priority: 2

Issue Component

Component: Pipelines, Component: Actions

hansva · 2025-02-05T08:28:04Z

I would have to check, but I think this is the intended behavior when running a pipeline via Beam.
There is no way to guarantee that only one file will be written when processing large volumes of data using a distributed engine. To avoid having multiple tasks writing to the same file (which wouldn't work for an excel file) we add unique identifiers.

Raja10D · 2025-02-05T12:51:39Z

I would have to check, but I think this is the intended behavior when running a pipeline via Beam.
There is no way to guarantee that only one file will be written when processing large volumes of data using a distributed engine. To avoid having multiple tasks writing to the same file (which wouldn't work for an excel file) we add unique identifiers.

Issue Description:
(Running on Spark)
When building Hop pipeline with a single Excel file to single excel writer, the pipeline produces a single output file as expected.(the output filename is different if we give result in configuration it is generating a file with result_0_88cccfee-c808-47f9-922d-e8f443630f78_1.xlsv but the content inside the file is correct)

When using Hop with two Excel files as input to single excel writer, the pipeline produces two separate output files are generated
(The content in both files is correct, but I need it as combined one)
I will insert the pipeline images:

Is there a way to overcome this issue and ensure that the pipeline produces a single output file when multiple Excel files are used as input? Our primary focus is on working with Excel files.

bamaer · 2025-02-05T12:58:34Z

why on earth would you do this in Spark?

Raja10D · 2025-02-05T13:21:04Z

why on earth would you do this in Spark?

Are there any alternative methods or recommended best practices for executing Hop pipelines in a real-time production environment, especially for handling Excel files effectively

bamaer · 2025-02-05T13:29:04Z

the local pipeline run configuration is perfect for this.

Raja10D · 2025-02-06T07:00:33Z

the local pipeline run configuration is perfect for this.

Thank you for the suggestion to use the local pipeline run configuration. However, we need to run Hop pipelines on a Spark cluster due to our production environment requirements. Given this context, is it correct to say that running Hop pipelines with large Excel files on a Spark cluster is not workable, and if so, are there any recommended approaches or best practices to overcome this limitation?

bamaer · 2025-02-06T07:48:33Z

"large" in the context of an Excel file is not what is considered "large" in the context of a Spark cluster. Reading/writing Excel files on a distributed Spark cluster sounds like a square peg, round hole problem to me, but it shouldn't be impossible. You'll need to check your beam + spark configuration to tweak your pipeline.

Raja10D · 2025-02-06T10:33:38Z

"large" in the context of an Excel file is not what is considered "large" in the context of a Spark cluster. Reading/writing Excel files on a distributed Spark cluster sounds like a square peg, round hole problem to me, but it shouldn't be impossible. You'll need to check your beam + spark configuration to tweak your pipeline.

For additional context, I am using Hop 2.8.0 to create pipelines, Apache Beam 2.50, Spark 3.4.4, and Java 11. The Spark Standalone cluster is set up with the master on my local PC, and the worker is also on the same PC.

Raja10D · 2025-02-11T06:58:06Z

"large" in the context of an Excel file is not what is considered "large" in the context of a Spark cluster. Reading/writing Excel files on a distributed Spark cluster sounds like a square peg, round hole problem to me, but it shouldn't be impossible. You'll need to check your beam + spark configuration to tweak your pipeline.

Got it. I was able to run my Hop pipelines on Spark cluster.
There is a small doubt, some of the hop components does not support Spark engine like csv file input, why it is not supporting Spark Engine??

bamaer · 2025-02-11T07:21:00Z

The CSV file input transform is optimized for performance. It lets you read local files by splitting them into multiple parts and reading those parts in parallel. This doesn't work in combination with the distributed processing in e.g. a Spark cluster.

Raja10D added awaiting triage task labels Feb 5, 2025

github-actions bot added P2 Default Priority Pipelines Actions labels Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Task]: When running a Hop pipeline on a Spark Standalone cluster, the pipeline generates temporary files instead of the exact output file specified in the configuration. #4865

[Task]: When running a Hop pipeline on a Spark Standalone cluster, the pipeline generates temporary files instead of the exact output file specified in the configuration. #4865

Raja10D commented Feb 5, 2025

hansva commented Feb 5, 2025

Raja10D commented Feb 5, 2025

bamaer commented Feb 5, 2025

Raja10D commented Feb 5, 2025

bamaer commented Feb 5, 2025

Raja10D commented Feb 6, 2025

bamaer commented Feb 6, 2025

Raja10D commented Feb 6, 2025

Raja10D commented Feb 11, 2025 •

edited

Loading

bamaer commented Feb 11, 2025

[Task]: When running a Hop pipeline on a Spark Standalone cluster, the pipeline generates temporary files instead of the exact output file specified in the configuration. #4865

[Task]: When running a Hop pipeline on a Spark Standalone cluster, the pipeline generates temporary files instead of the exact output file specified in the configuration. #4865

Comments

Raja10D commented Feb 5, 2025

What needs to happen?

Issue Priority

Issue Component

hansva commented Feb 5, 2025

Raja10D commented Feb 5, 2025

bamaer commented Feb 5, 2025

Raja10D commented Feb 5, 2025

bamaer commented Feb 5, 2025

Raja10D commented Feb 6, 2025

bamaer commented Feb 6, 2025

Raja10D commented Feb 6, 2025

Raja10D commented Feb 11, 2025 • edited Loading

bamaer commented Feb 11, 2025

Raja10D commented Feb 11, 2025 •

edited

Loading