Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse robots.txt from originating URLs #442

Merged
merged 3 commits into from
Apr 15, 2015
Merged

Conversation

rgbkrk
Copy link
Member

@rgbkrk rgbkrk commented Apr 15, 2015

This totally doesn't seem to be working on http://nbviewer.ipython.org/url/jdj.mit.edu/~stevenj/IJulia%20Preview.ipynb

:-/

Posting this early to help with PyCon sprinters that want to touch the same stuff.

@rgbkrk
Copy link
Member Author

rgbkrk commented Apr 15, 2015

The robots.txt parser seems to work right when I was directly using it with read():

>>> from urllib import robotparser
>>> rp = robotparser.RobotFileParser()
>>> rp.set_url("http://jdj.mit.edu/robots.txt")
>>> rp.read()
>>> rp.can_fetch("*", "http://jdj.mit.edu/~stevenj/IJulia%20Preview.ipynb")
False
>>> rp.can_fetch("*", "http://jdj.mit.edu/~stevenj/IJulia%20Preview.ipynb")
True

@rgbkrk
Copy link
Member Author

rgbkrk commented Apr 15, 2015

Ah, parse doesn't take the whole file. It takes a list of lines.

rp = robotparser.RobotFileParser()
rp.set_url("http://test.com")
rp.parse('User-Agent: *\nDisallow: /'.splitlines())
rp.can_fetch('hello', "/")

Nicholas Buihner and others added 2 commits April 15, 2015 17:07
@rgbkrk
Copy link
Member Author

rgbkrk commented Apr 15, 2015

With changes from @melignus we're properly parsing the robots txt while still using Tornado's async fetch of the resource. This will help for #405.

@rgbkrk rgbkrk changed the title [WIP] Parse robots.txt from originating URLs Parse robots.txt from originating URLs Apr 15, 2015
rgbkrk added a commit that referenced this pull request Apr 15, 2015
Parse robots.txt from originating URLs
@rgbkrk rgbkrk merged commit 66319cc into jupyter:master Apr 15, 2015
@rgbkrk rgbkrk deleted the robotstxt branch April 15, 2015 23:29
@bollwyvl
Copy link
Contributor

Awesome stuff.

As we think about the url/s handlers as the baseline for other providers (#428), what can we learn from this work? Is a secret gist also not public? Google docs links?

Also, based on this, can we tell whether it is okay to scrape the page and look for ipynbs and directories?

@bollwyvl
Copy link
Contributor

Also, does this need any form of configuration?

@melignus
Copy link

No configuration. As it stands now it's only adding a curtesy noindex,nofollow in the viewer if the same settings are set in the robots.txt to honor the origins settings.

I would think if the robots.txt specifically states nofollow,noindex, it would be polite not to scrape the page for more notebooks not in the specific path?

@rgbkrk
Copy link
Member Author

rgbkrk commented Apr 16, 2015

A secret gist is also not public. This one only handled the URL case (not a special GitHub provider). I haven't looked into Google docs.

In general, we're not currently scraping the page to look for other notebooks and directories. That makes me the most comfortable operationally and for the sake of development resources.

Also, does this need any form of configuration?

If it comes up later for alternative deployments of nbviewer, we can address it. I'm only worried about nbviewer.ipython.org being a fair citizen server of the interwebs for the moment. Propagating noindex (and in turn our own robots.txt) seemed like a decent start.

It might have been better if we had made the default noindex instead of both noindex and nofollow though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants