Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Apache Zeppelin module (or alternative web-based notebook) #23

Open
nchammas opened this issue Oct 16, 2015 · 12 comments
Open

Comments

@nchammas
Copy link
Owner

Would it be useful for users if Flintrock offered support for Apache Zeppelin as an optional module?

I haven't used Zeppelin, but the idea is that Flintrock would install it on the cluster during launch and the user would be able to just point their browser somewhere and get coding. The user works locally and the cluster is just a remote execution environment.

Alternatives to adding a Zeppelin module include:

  • Adding a Spark Notebook module.
  • Performing additional configuration to making running IPython against a remote Spark shell easy.
  • Doing nothing.

I'm not familiar with any of these things (well, apart from doing nothing), so I'll have to play around with them to understand the differences. I like the fact that Zeppelin appears well integrated with Spark and supports coding in multiple languages, not just Scala or just Python.

If you have input on which way to go and why, I'd love to hear it. What does your workflow look like when you are throwing together a quick experiment or prototype and you need a cluster?

@aimran
Copy link

aimran commented Apr 18, 2016

Hey @nchammas,

Just watched your talk at the Spark Summit and the first thing I googled was flintrock + apache zeppelin.

From data science side of things, being able to launch a spark cluster with zeppelin notebook server running would be huge and removes a big painpoint. AWS EMR allows it at the moment but who wants to pay EMR rates for a notebook server that you can't halt/restart?

Is there anyway for flintrock to accepts extensions that would allow users to add additional facilities like
zeppelin

Many thanks again for the brilliant effort
Asif

@nchammas
Copy link
Owner Author

@aimran - We recently merged in a change to open up the Spark master port (#103) so that things like Zeppelin can connect from your local machine to the Spark master on the cluster. Does that cover your use case?

I'm not familiar with Zeppelin. Is the notebook server a persistent service that would live on the cluster?

Is there anyway for flintrock to accepts extensions that would allow users to add additional facilities like zeppelin

Flintrock currently does not have a system for extensions, but that is something I would be interested in. I don't think it will happen anytime soon though.

What you can do today to setup Zeppelin in an automated fashion is to create a script that does all the setup you want and then run it on the master (if that's the only place it's needed).

For example:

flintrock copy-file --master-only my-cluster setup-zeppelin.sh /tmp/
flintrock run-command --master-only my-cluster 'chmod u+x /tmp/setup-zeppelin.sh'
flintrock run-command --master-only my-cluster '/tmp/setup-zeppelin.sh'

Does that get you a workable solution for now?

@aimran
Copy link

aimran commented Apr 20, 2016

@nchammas

Thanks for the suggestions.

Its definitely helpful to be able to connect from local machine to local cluster but with a notebook server the idea would be to have it co-located with your cluster to reduce any roundtrip involved.

The copy-file will also come in handy. However with some tweaks(!), we can leverage flintrocks full power. To illustrate what I mean, consider the following env setup file necessary for zeppelin deployment:

export ZEPPELIN_PORT=<%= @server_port %>
export ZEPPELIN_WEBSOCKET_PORT=<%= @web_socket_port %>
export ZEPPELIN_CONF_DIR=/etc/zeppelin/conf
export ZEPPELIN_LOG_DIR=/var/log/zeppelin
export ZEPPELIN_PID_DIR=/var/run/zeppelin
export ZEPPELIN_NOTEBOOK_DIR=/var/lib/zeppelin/notebook
export MASTER=<%= @spark_master_url %>
export SPARK_HOME=/usr/lib/spark
export HADOOP_CONF_DIR=/etc/hadoop/conf

Note the use of templates. These parameters (some of them) are already available to flintrock when you setup the original spark cluster. In fact, we can even use the same config.yml file to keep configurations in one place so long we provide separate sections.

Therefore, we would need a way for flintrock to fill in the templated parameters as best as it could (jinja2 perhaps). Afterwards, I can still use copy-file, run-command solution as you prescribed.

Something along the line of:

flintrock copy-file --master-only my-cluster setup-zeppelin.sh /tmp/ --with-config my-cluster

There maybe similar use cases for other deployments too. Let me know what you think.

@nchammas
Copy link
Owner Author

Let me step back here for a moment and clarify the original intent of this issue, since I didn't lay it out in enough detail in the initial posting.

At some point I think it may be valuable to add Zeppelin (or some other notebook) as a first-class Flintrock service, if that makes sense. (Again, I am not too familiar with these web-based notebooks and how they are deployed, but I know people find them useful.)

A "first-class Flintrock service" means Zeppelin gets its own section under services in config.yaml and matching command-line options. (On the Flintrock "backend", it will also get any templated files that it needs, like Spark and HDFS.) As a user, you just say that you want Zeppelin installed on the cluster and Flintrock takes care of the rest.

This is a feature that doesn't exist today. It may be added in the future, after Flintrock is a little more mature. Until then, an intermediate solution that lets people automate the setup of Zeppelin (or anything, really) on their Flintrock cluster is to use Flintrock's copy-file and run-command commands. That is what I was offering in my comment from yesterday.

Now returning to your most recent comments, are you saying that you cannot setup Zeppelin in an automated fashion using copy-file and run-command as they are today?

As for adding a --with-config option to the copy-file command, I didn't understand the use of that. I think copy-file should be a simple command that just does what it says. Performing configuration is outside of its scope. If you want to edit existing files in an automated fashion, you can overwrite them with copy-file or modify them using run-command.

@aimran
Copy link

aimran commented Apr 22, 2016

@nchammas

Apologies. I think I just jumped ahead of myself with the templates/extension talk. I definitely see it being useful in the future and totally understand you implementing them as flintrock matures.

Otherwise, my reason for using a --with-config was to automagically fill in the params in zeppelin.conf where you get parameters like this:

export MASTER=<%= @spark_master_url %>

Clearly, after launch, flintrock knows the spark_master_url for a given named cluster.

In any case, I had a chance to dig more into the tool. For now, copy-file will have to do.

Thanks so much again

@nchammas
Copy link
Owner Author

Otherwise, my reason for using a --with-config was to automagically fill in the params in zeppelin.conf where you get parameters like this:

export MASTER=<%= @spark_master_url %>

Ah yes, to automate that today you'd have to parse the YAML output of flintrock describe to get the master address, or perhaps you could define MASTER based on one of the existing environment variables set for Spark itself inside spark-env.sh.

@serialx
Copy link
Contributor

serialx commented Apr 25, 2016

Nicholas, are you also interested in Jupyter notebook(IPython notebook) support? We are internally heavily using Jupyter notebook for data analysis on Spark. If you are interested in adding Jupyter notebook support to flintrock, I'd like to work on it.

I've already have success installing Jupyter notebook using run-command command with flintrock. It's relatively simple to do. Just installing anaconda and you get a working Jupyter notebook without even interrupting the existing Python installed in the host. This should work on any modern Linux distributions as well.

A few lines here and there would make this work. Would also be maintenance light. :)

@nchammas
Copy link
Owner Author

I'm interested in Jupyter support, but like with Apache Zeppelin, I'm not too familiar with the setup. I've only used Jupyter as a local client tool, so I'm not sure about other deployment and usage scenarios that it supports.

Would Jupyter be installed as a running service on the cluster? So you connect and disconnect multiple times from your local machine and Jupyter preserves your Spark session until the cluster is restarted--is that roughly how it would work?

@serialx
Copy link
Contributor

serialx commented May 2, 2016

I think the usage pattern would be creating a command like this:

flintrock jupyter
# Opens jupyter notebook in the master port 8754. Maybe open a browser
# automatically for the user to connect to http://[master_ip]:8754

When user terminates the process using ctrl+c and the like, the jupyter notebook would be turned off. This is how people use jupyter.

@ameent
Copy link

ameent commented May 17, 2016

Just dropping a note here that my team would love to see Jupyter support 👍

@AlexIoannides
Copy link

We would be very happy with Zeppelin (or Jupyter) support.

@mrocklin
Copy link

I would also be happy to see Jupyter support.

@serialx can you share your current solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants