Parse robots.txt from originating URLs #442

rgbkrk · 2015-04-15T19:43:55Z

This totally doesn't seem to be working on http://nbviewer.ipython.org/url/jdj.mit.edu/~stevenj/IJulia%20Preview.ipynb

:-/

Posting this early to help with PyCon sprinters that want to touch the same stuff.

This totally doesn't seem to be working on http://nbviewer.ipython.org/url/jdj.mit.edu/~stevenj/IJulia%20Preview.ipynb :-/

rgbkrk · 2015-04-15T19:51:59Z

The robots.txt parser seems to work right when I was directly using it with read():

>>> from urllib import robotparser
>>> rp = robotparser.RobotFileParser()
>>> rp.set_url("http://jdj.mit.edu/robots.txt")
>>> rp.read()
>>> rp.can_fetch("*", "http://jdj.mit.edu/~stevenj/IJulia%20Preview.ipynb")
False
>>> rp.can_fetch("*", "http://jdj.mit.edu/~stevenj/IJulia%20Preview.ipynb")
True

rgbkrk · 2015-04-15T19:56:41Z

Ah, parse doesn't take the whole file. It takes a list of lines.

rp = robotparser.RobotFileParser()
rp.set_url("http://test.com")
rp.parse('User-Agent: *\nDisallow: /'.splitlines())
rp.can_fetch('hello', "/")

fixed the robots parsing and added the nofollow meta tag

rgbkrk · 2015-04-15T22:18:45Z

With changes from @melignus we're properly parsing the robots txt while still using Tornado's async fetch of the resource. This will help for #405.

Parse robots.txt from originating URLs

bollwyvl · 2015-04-15T23:54:07Z

Awesome stuff.

As we think about the url/s handlers as the baseline for other providers (#428), what can we learn from this work? Is a secret gist also not public? Google docs links?

Also, based on this, can we tell whether it is okay to scrape the page and look for ipynbs and directories?

bollwyvl · 2015-04-15T23:54:51Z

Also, does this need any form of configuration?

melignus · 2015-04-16T00:56:20Z

No configuration. As it stands now it's only adding a curtesy noindex,nofollow in the viewer if the same settings are set in the robots.txt to honor the origins settings.

I would think if the robots.txt specifically states nofollow,noindex, it would be polite not to scrape the page for more notebooks not in the specific path?

rgbkrk · 2015-04-16T04:38:23Z

A secret gist is also not public. This one only handled the URL case (not a special GitHub provider). I haven't looked into Google docs.

In general, we're not currently scraping the page to look for other notebooks and directories. That makes me the most comfortable operationally and for the sake of development resources.

Also, does this need any form of configuration?

If it comes up later for alternative deployments of nbviewer, we can address it. I'm only worried about nbviewer.ipython.org being a fair citizen server of the interwebs for the moment. Propagating noindex (and in turn our own robots.txt) seemed like a decent start.

It might have been better if we had made the default noindex instead of both noindex and nofollow though.

Bring in the robot parser.

762be36

This totally doesn't seem to be working on http://nbviewer.ipython.org/url/jdj.mit.edu/~stevenj/IJulia%20Preview.ipynb :-/

Nicholas Buihner and others added 2 commits April 15, 2015 17:07

fixed the robots parsing and added the nofollow meta tag

3688f7d

Merge pull request #1 from melignus/robotssplit

4c8f25b

fixed the robots parsing and added the nofollow meta tag

rgbkrk changed the title ~~[WIP] Parse robots.txt from originating URLs~~ Parse robots.txt from originating URLs Apr 15, 2015

rgbkrk added a commit that referenced this pull request Apr 15, 2015

Merge pull request #442 from rgbkrk/robotstxt

66319cc

Parse robots.txt from originating URLs

rgbkrk merged commit 66319cc into jupyter:master Apr 15, 2015

rgbkrk deleted the robotstxt branch April 15, 2015 23:29

rgbkrk mentioned this pull request Apr 20, 2015

Don't completely die out when robots.txt not found #446

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse robots.txt from originating URLs #442

Parse robots.txt from originating URLs #442

rgbkrk commented Apr 15, 2015

rgbkrk commented Apr 15, 2015

rgbkrk commented Apr 15, 2015

rgbkrk commented Apr 15, 2015

bollwyvl commented Apr 15, 2015

bollwyvl commented Apr 15, 2015

melignus commented Apr 16, 2015

rgbkrk commented Apr 16, 2015

Parse robots.txt from originating URLs #442

Parse robots.txt from originating URLs #442

Conversation

rgbkrk commented Apr 15, 2015

rgbkrk commented Apr 15, 2015

rgbkrk commented Apr 15, 2015

rgbkrk commented Apr 15, 2015

bollwyvl commented Apr 15, 2015

bollwyvl commented Apr 15, 2015

melignus commented Apr 16, 2015

rgbkrk commented Apr 16, 2015