Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to download Hadoop from a custom URL #71

Closed
nchammas opened this issue Dec 15, 2015 · 9 comments
Closed

Add option to download Hadoop from a custom URL #71

nchammas opened this issue Dec 15, 2015 · 9 comments
Labels
good first issue Good issues for new contributors to tackle

Comments

@nchammas
Copy link
Owner

As a follow-up to the discussion in #66, perhaps we should add an option to let users download Hadoop from a custom URL.

  • Command-line option: --hdfs-download-source

  • Config file option:

    modules:
    hdfs:
      version: 2.7.1
      download-source: "http://www.apache.org/dyn/closer.lua/hadoop/common/hadoop-{v}/hadoop-{v}.tar.gz?as_json"

{v} will be replaced by the HDFS version passed in to Flintrock, either via the config file or via the command line.

The .../dyn/closer.lua/... value will be Flintrock's internal default, which the user can replace with a specific Apache mirror, or some other source. The only requirements are that the package be downloadable from the cluster without authentication, and that the URL contain the {v} template somewhere.

I would like to deprecate this option as soon as we have a more robust way of downloading Hadoop quickly and reliably, since that's the main motivation for adding this option in.

Related to #66, #69, #84.

@nchammas
Copy link
Owner Author

@ericmjonas - To continue our discussion from #84 about an Apache mirror blacklist here, what mirrors have you noticed take forever to serve up Hadoop?

The mirrors I currently have on my list are:

  • 104.45.233.178
  • 74.206.97.82

@ericmjonas
Copy link
Contributor

I have to be honest, I haven't actually cataloged them yet -- I normally
get so frustrated that I abort and immediately go to hacking
download_hadoopy.py. I'm sorry :(

On Fri, Feb 19, 2016 at 6:55 PM, Nicholas Chammas [email protected]
wrote:

@ericmjonas https://github.com/ericmjonas - To continue our discussion
from #84 #84 about an
Apache mirror blacklist here, what mirrors have you noticed take forever to
serve up Hadoop?

The mirrors I currently have on my list are:

  • 104.45.233.178
  • 74.206.97.82


Reply to this email directly or view it on GitHub
#71 (comment).

@nchammas
Copy link
Owner Author

OK, so we can say that at least for you, the download source option described here would work really well. Right? 😀

Perhaps we should get that working first, and then later revisit the idea of a mirror blacklist, or some kind of speculative download-retry mechanism for when a download seems like it it's gonna take forever.

What do you say?

@ericmjonas
Copy link
Contributor

On Sat, Feb 20, 2016 at 9:59 AM, Nicholas Chammas [email protected]
wrote:

What do you say?

That sounds great!

@nchammas
Copy link
Owner Author

I agree with you that the out-of-box experience is critical. For the record, I'm hesitant to dive into these other approaches because:

  • Speculative download-retry: This would be the "ultimate" solution, but it seems like it would cost a lot in terms of complexity. We'd need a way to monitor the download rate, kill it based on some heuristic, and then try a new download. Seems like a lot of work to me for a relatively limited feature.
  • Mirror blacklist: I'm more open to this, but we have this initial problem of finding and documenting the bad mirrors. I think a prerequisite to making this useful is to implement the work described in Implement a more lightweight display of launch/start/etc. progress #27 so that it's easy for users to identify stragglers during launch and see if it's an Apache mirror that is slowing things down.
  • Torrent-style download: This is something I looked into a couple of months ago but couldn't get to work. It's another potential "correct" solution. Basically, have Flintrock use multiple mirrors to download Hadoop, getting each piece of the file from a different mirror. We need some library to make this easy though; this would not be something I would want to build for Flintrock.

@nchammas
Copy link
Owner Author

@ericmjonas - I'm currently focused on getting #77 wrapped up and then tackling #16 for 0.4.0.

If you want to take a crack at a PR for this in the meantime, be my guest. I've updated the issue description with my latest thoughts on how this would work.

@BenFradet
Copy link
Contributor

@nchammas @ericmjonas Do you guys mind if I take this?

@nchammas
Copy link
Owner Author

Fine by me @BenFradet. Have you run into the issues described here btw? Do you agree with the proposed solution?

@BenFradet
Copy link
Contributor

Nope, I haven't run into those issues with flintrock per say but I do have my own mirror which I have found more reliable than apache's in the past.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good issues for new contributors to tackle
Projects
None yet
Development

No branches or pull requests

3 participants