Connection to an external Spark cluster #1

wetneb · 2021-08-04T14:11:40Z

When running OpenRefine with the Spark runner, we always spin a local Spark instance for it. We should instead make it possible to configure OpenRefine to connect to an existing Spark cluster for this.

Proposed solution

Introduce the relevant configuration variables for the Spark runner to make this possible.

wetneb · 2021-12-23T20:11:31Z

This is now configurable. As things stand it is not easy to use though: we also need to publish a .jar file which contains all the application code of OpenRefine (and its dependencies) so that it can be loaded in the remote Spark cluster.

fcomte · 2022-01-03T16:20:01Z

I saw that we can start a spark driver connected with a mesos engine.
Is it to possible that openrefine start a spark driver connected with a kubernetes cluster in stead of mesos?

I a very interrested in connecting openrefine and spark in a cloud native way.
I distribute those tools in a kubernetes plateform. One instance of this project is here : onyxia
User can start openrefine

wetneb · 2022-01-03T17:21:48Z

Hi @fcomte,

I have not looked into this, but it looks like it should be feasible! Just to check, is your goal to run the OpenRefine web server in Kubernetes itself, or just to have OpenRefine connect to a Spark cluster running in Kubernetes?

For the latter, it seems doable, although there might be some minor things to tweak on OpenRefine's side. The existing parameter refine.runner.sparkMasterURI can already be used to supply a k8s:// address (as documented here: https://spark.apache.org/docs/latest/running-on-kubernetes.html#cluster-mode), we could add some other parameters to supply other parameters to Spark (spark.executor.instances and spark.kubernetes.container.image).

That being said I should say that running OpenRefine with Spark should mostly be useful when deploying OpenRefine workflows in some production environment, rather than using it interactively through OpenRefine's UI. The latter is possible of course, but it is likely to be less responsive than with the default runner (especially when connecting to a remote cluster).

I am really keen to understand your needs better in any case.

fcomte · 2022-01-03T22:27:45Z

I am running openrefine inside a container (kubernetes). It's working.
But this new feature that enable openrefine to run job inside a spark cluster have a great potential for my users.

Of course in my context, the spark cluster should use a kubernetes master.
It would be perfect if i can start a runner that enable a spark cluster mode ( with k8s )

I have no real use case, I manage a data platform and I think this is great to empowered users with this feature and I am very interested in testing that myself.

How the data would be handle in that case ?
Is that possible to have data in an external s3 bucket ?

wetneb · 2022-01-05T08:14:57Z

At the moment, it is possible to import a file from an external source like S3, but it will create a copy in the workspace (which is stored locally). That is not so useful as things stand! Especially in the context of working with an external Spark cluster, because it means that data will be streamed from OpenRefine to the worker nodes and back, at every operation.
In the future I would like to make it possible to disable this initial copying and let people work off the original copy directly. If that copy is well accessible for the workers then I would expect much more interesting performance.

fcomte · 2022-01-05T12:56:48Z

yes, it would be important to not copy the content of the whole file on the workspace if we use spark.

wetneb · 2023-05-17T12:44:18Z

One other approach which looks also promising to me is to make it possible to have the workspace itself accessed via HDFS.

Having the possibility of executing a workflow directly from a file is appealing on paper, but it is likely to be quite slow as soon as the original file is not splittable. Most import formats that OpenRefine supports are not splittable (even CSV with the default settings) so you'd be limited to running the workflow on a single node, which defeats the purpose of Spark.

wetneb added the enhancement New feature or request label Aug 4, 2021

wetneb referenced this issue in OpenRefine/OpenRefine Dec 23, 2021

Enable connection to an external Spark cluster, for #4090

823fb89

wetneb transferred this issue from OpenRefine/OpenRefine May 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection to an external Spark cluster #1

Connection to an external Spark cluster #1

wetneb commented Aug 4, 2021

wetneb commented Dec 23, 2021

fcomte commented Jan 3, 2022

wetneb commented Jan 3, 2022 •

edited

Loading

fcomte commented Jan 3, 2022 •

edited

Loading

wetneb commented Jan 5, 2022

fcomte commented Jan 5, 2022

wetneb commented May 17, 2023

Connection to an external Spark cluster #1

Connection to an external Spark cluster #1

Comments

wetneb commented Aug 4, 2021

Proposed solution

wetneb commented Dec 23, 2021

fcomte commented Jan 3, 2022

wetneb commented Jan 3, 2022 • edited Loading

fcomte commented Jan 3, 2022 • edited Loading

wetneb commented Jan 5, 2022

fcomte commented Jan 5, 2022

wetneb commented May 17, 2023

wetneb commented Jan 3, 2022 •

edited

Loading

fcomte commented Jan 3, 2022 •

edited

Loading