-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Apache Zeppelin module (or alternative web-based notebook) #23
Comments
Hey @nchammas, Just watched your talk at the Spark Summit and the first thing I googled was flintrock + apache zeppelin. From data science side of things, being able to launch a spark cluster with zeppelin notebook server running would be huge and removes a big painpoint. AWS EMR allows it at the moment but who wants to pay EMR rates for a notebook server that you can't halt/restart? Is there anyway for flintrock to accepts extensions that would allow users to add additional facilities like Many thanks again for the brilliant effort |
@aimran - We recently merged in a change to open up the Spark master port (#103) so that things like Zeppelin can connect from your local machine to the Spark master on the cluster. Does that cover your use case? I'm not familiar with Zeppelin. Is the notebook server a persistent service that would live on the cluster?
Flintrock currently does not have a system for extensions, but that is something I would be interested in. I don't think it will happen anytime soon though. What you can do today to setup Zeppelin in an automated fashion is to create a script that does all the setup you want and then run it on the master (if that's the only place it's needed). For example: flintrock copy-file --master-only my-cluster setup-zeppelin.sh /tmp/
flintrock run-command --master-only my-cluster 'chmod u+x /tmp/setup-zeppelin.sh'
flintrock run-command --master-only my-cluster '/tmp/setup-zeppelin.sh' Does that get you a workable solution for now? |
Thanks for the suggestions. Its definitely helpful to be able to connect from local machine to local cluster but with a notebook server the idea would be to have it co-located with your cluster to reduce any roundtrip involved. The export ZEPPELIN_PORT=<%= @server_port %>
export ZEPPELIN_WEBSOCKET_PORT=<%= @web_socket_port %>
export ZEPPELIN_CONF_DIR=/etc/zeppelin/conf
export ZEPPELIN_LOG_DIR=/var/log/zeppelin
export ZEPPELIN_PID_DIR=/var/run/zeppelin
export ZEPPELIN_NOTEBOOK_DIR=/var/lib/zeppelin/notebook
export MASTER=<%= @spark_master_url %>
export SPARK_HOME=/usr/lib/spark
export HADOOP_CONF_DIR=/etc/hadoop/conf Note the use of templates. These parameters (some of them) are already available to flintrock when you setup the original spark cluster. In fact, we can even use the same config.yml file to keep configurations in one place so long we provide separate sections. Therefore, we would need a way for flintrock to fill in the templated parameters as best as it could (jinja2 perhaps). Afterwards, I can still use copy-file, run-command solution as you prescribed. Something along the line of:
There maybe similar use cases for other deployments too. Let me know what you think. |
Let me step back here for a moment and clarify the original intent of this issue, since I didn't lay it out in enough detail in the initial posting. At some point I think it may be valuable to add Zeppelin (or some other notebook) as a first-class Flintrock service, if that makes sense. (Again, I am not too familiar with these web-based notebooks and how they are deployed, but I know people find them useful.) A "first-class Flintrock service" means Zeppelin gets its own section under This is a feature that doesn't exist today. It may be added in the future, after Flintrock is a little more mature. Until then, an intermediate solution that lets people automate the setup of Zeppelin (or anything, really) on their Flintrock cluster is to use Flintrock's Now returning to your most recent comments, are you saying that you cannot setup Zeppelin in an automated fashion using As for adding a |
Apologies. I think I just jumped ahead of myself with the templates/extension talk. I definitely see it being useful in the future and totally understand you implementing them as flintrock matures. Otherwise, my reason for using a export MASTER=<%= @spark_master_url %> Clearly, after launch, flintrock knows the spark_master_url for a given named cluster. In any case, I had a chance to dig more into the tool. For now, Thanks so much again |
Ah yes, to automate that today you'd have to parse the YAML output of |
Nicholas, are you also interested in Jupyter notebook(IPython notebook) support? We are internally heavily using Jupyter notebook for data analysis on Spark. If you are interested in adding Jupyter notebook support to flintrock, I'd like to work on it. I've already have success installing Jupyter notebook using A few lines here and there would make this work. Would also be maintenance light. :) |
I'm interested in Jupyter support, but like with Apache Zeppelin, I'm not too familiar with the setup. I've only used Jupyter as a local client tool, so I'm not sure about other deployment and usage scenarios that it supports. Would Jupyter be installed as a running service on the cluster? So you connect and disconnect multiple times from your local machine and Jupyter preserves your Spark session until the cluster is restarted--is that roughly how it would work? |
I think the usage pattern would be creating a command like this:
When user terminates the process using ctrl+c and the like, the jupyter notebook would be turned off. This is how people use jupyter. |
Just dropping a note here that my team would love to see Jupyter support 👍 |
We would be very happy with Zeppelin (or Jupyter) support. |
I would also be happy to see Jupyter support. @serialx can you share your current solution? |
Would it be useful for users if Flintrock offered support for Apache Zeppelin as an optional module?
I haven't used Zeppelin, but the idea is that Flintrock would install it on the cluster during launch and the user would be able to just point their browser somewhere and get coding. The user works locally and the cluster is just a remote execution environment.
Alternatives to adding a Zeppelin module include:
I'm not familiar with any of these things (well, apart from doing nothing), so I'll have to play around with them to understand the differences. I like the fact that Zeppelin appears well integrated with Spark and supports coding in multiple languages, not just Scala or just Python.
If you have input on which way to go and why, I'd love to hear it. What does your workflow look like when you are throwing together a quick experiment or prototype and you need a cluster?
The text was updated successfully, but these errors were encountered: