-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Task]: When running a Hop pipeline on a Spark Standalone cluster, the pipeline generates temporary files instead of the exact output file specified in the configuration. #4865
Comments
I would have to check, but I think this is the intended behavior when running a pipeline via Beam. |
Issue Description: When using Hop with two Excel files as input to single excel writer, the pipeline produces two separate output files are generated Is there a way to overcome this issue and ensure that the pipeline produces a single output file when multiple Excel files are used as input? Our primary focus is on working with Excel files. |
why on earth would you do this in Spark? |
Are there any alternative methods or recommended best practices for executing Hop pipelines in a real-time production environment, especially for handling Excel files effectively |
the local pipeline run configuration is perfect for this. |
Thank you for the suggestion to use the local pipeline run configuration. However, we need to run Hop pipelines on a Spark cluster due to our production environment requirements. Given this context, is it correct to say that running Hop pipelines with large Excel files on a Spark cluster is not workable, and if so, are there any recommended approaches or best practices to overcome this limitation? |
"large" in the context of an Excel file is not what is considered "large" in the context of a Spark cluster. Reading/writing Excel files on a distributed Spark cluster sounds like a square peg, round hole problem to me, but it shouldn't be impossible. You'll need to check your beam + spark configuration to tweak your pipeline. |
For additional context, I am using Hop 2.8.0 to create pipelines, Apache Beam 2.50, Spark 3.4.4, and Java 11. The Spark Standalone cluster is set up with the master on my local PC, and the worker is also on the same PC. |
Got it. I was able to run my Hop pipelines on Spark cluster. |
The CSV file input transform is optimized for performance. It lets you read local files by splitting them into multiple parts and reading those parts in parallel. This doesn't work in combination with the distributed processing in e.g. a Spark cluster. |
What needs to happen?
Configure and run the Hop pipeline on a Spark Standalone cluster.
Check the output directory for the generated files.
Expected Result: The pipeline should generate the exact output file as specified in the configuration.(name of output file: output)
Actual Result: The pipeline generates temporary files instead of the exact output file. (the file generating: output_6656ced7-7a39-4014-9903-c388b8b84852_1.xls)
This pipeline is nothing but a simple excel input and excel writer.
My excel writer configuration in the pipeline:
Is there anything I need to change in configuration or something else.
Please guide me.
The pipeline is producing crct output file on Hop local runner
Issue Priority
Priority: 2
Issue Component
Component: Pipelines, Component: Actions
The text was updated successfully, but these errors were encountered: