Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task]: When running a Hop pipeline on a Spark Standalone cluster, the pipeline generates temporary files instead of the exact output file specified in the configuration. #4865

Open
Raja10D opened this issue Feb 5, 2025 · 10 comments

Comments

@Raja10D
Copy link

Raja10D commented Feb 5, 2025

What needs to happen?

  1. Configure and run the Hop pipeline on a Spark Standalone cluster.

  2. Check the output directory for the generated files.

Expected Result: The pipeline should generate the exact output file as specified in the configuration.(name of output file: output)

Actual Result: The pipeline generates temporary files instead of the exact output file. (the file generating: output_6656ced7-7a39-4014-9903-c388b8b84852_1.xls)

This pipeline is nothing but a simple excel input and excel writer.
My excel writer configuration in the pipeline:

Image

Is there anything I need to change in configuration or something else.
Please guide me.
The pipeline is producing crct output file on Hop local runner

Issue Priority

Priority: 2

Issue Component

Component: Pipelines, Component: Actions

@hansva
Copy link
Contributor

hansva commented Feb 5, 2025

I would have to check, but I think this is the intended behavior when running a pipeline via Beam.
There is no way to guarantee that only one file will be written when processing large volumes of data using a distributed engine. To avoid having multiple tasks writing to the same file (which wouldn't work for an excel file) we add unique identifiers.

@Raja10D
Copy link
Author

Raja10D commented Feb 5, 2025

I would have to check, but I think this is the intended behavior when running a pipeline via Beam.
There is no way to guarantee that only one file will be written when processing large volumes of data using a distributed engine. To avoid having multiple tasks writing to the same file (which wouldn't work for an excel file) we add unique identifiers.

Issue Description:
(Running on Spark)
When building Hop pipeline with a single Excel file to single excel writer, the pipeline produces a single output file as expected.(the output filename is different if we give result in configuration it is generating a file with result_0_88cccfee-c808-47f9-922d-e8f443630f78_1.xlsv but the content inside the file is correct)

When using Hop with two Excel files as input to single excel writer, the pipeline produces two separate output files are generated
(The content in both files is correct, but I need it as combined one)
I will insert the pipeline images:

Image
Image

Is there a way to overcome this issue and ensure that the pipeline produces a single output file when multiple Excel files are used as input? Our primary focus is on working with Excel files.

@bamaer
Copy link
Contributor

bamaer commented Feb 5, 2025

why on earth would you do this in Spark?

@Raja10D
Copy link
Author

Raja10D commented Feb 5, 2025

why on earth would you do this in Spark?

Are there any alternative methods or recommended best practices for executing Hop pipelines in a real-time production environment, especially for handling Excel files effectively

@bamaer
Copy link
Contributor

bamaer commented Feb 5, 2025

the local pipeline run configuration is perfect for this.

@Raja10D
Copy link
Author

Raja10D commented Feb 6, 2025

the local pipeline run configuration is perfect for this.

Thank you for the suggestion to use the local pipeline run configuration. However, we need to run Hop pipelines on a Spark cluster due to our production environment requirements. Given this context, is it correct to say that running Hop pipelines with large Excel files on a Spark cluster is not workable, and if so, are there any recommended approaches or best practices to overcome this limitation?

@bamaer
Copy link
Contributor

bamaer commented Feb 6, 2025

"large" in the context of an Excel file is not what is considered "large" in the context of a Spark cluster. Reading/writing Excel files on a distributed Spark cluster sounds like a square peg, round hole problem to me, but it shouldn't be impossible. You'll need to check your beam + spark configuration to tweak your pipeline.

@Raja10D
Copy link
Author

Raja10D commented Feb 6, 2025

"large" in the context of an Excel file is not what is considered "large" in the context of a Spark cluster. Reading/writing Excel files on a distributed Spark cluster sounds like a square peg, round hole problem to me, but it shouldn't be impossible. You'll need to check your beam + spark configuration to tweak your pipeline.

For additional context, I am using Hop 2.8.0 to create pipelines, Apache Beam 2.50, Spark 3.4.4, and Java 11. The Spark Standalone cluster is set up with the master on my local PC, and the worker is also on the same PC.

@Raja10D
Copy link
Author

Raja10D commented Feb 11, 2025

"large" in the context of an Excel file is not what is considered "large" in the context of a Spark cluster. Reading/writing Excel files on a distributed Spark cluster sounds like a square peg, round hole problem to me, but it shouldn't be impossible. You'll need to check your beam + spark configuration to tweak your pipeline.

Got it. I was able to run my Hop pipelines on Spark cluster.
There is a small doubt, some of the hop components does not support Spark engine like csv file input, why it is not supporting Spark Engine??

@bamaer
Copy link
Contributor

bamaer commented Feb 11, 2025

The CSV file input transform is optimized for performance. It lets you read local files by splitting them into multiple parts and reading those parts in parallel. This doesn't work in combination with the distributed processing in e.g. a Spark cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants