-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse robots.txt from originating URLs #442
Conversation
This totally doesn't seem to be working on http://nbviewer.ipython.org/url/jdj.mit.edu/~stevenj/IJulia%20Preview.ipynb :-/
The robots.txt parser seems to work right when I was directly using it with
|
Ah, parse doesn't take the whole file. It takes a list of lines.
|
fixed the robots parsing and added the nofollow meta tag
Parse robots.txt from originating URLs
Awesome stuff. As we think about the url/s handlers as the baseline for other providers (#428), what can we learn from this work? Is a secret gist also not public? Google docs links? Also, based on this, can we tell whether it is okay to scrape the page and look for ipynbs and directories? |
Also, does this need any form of configuration? |
No configuration. As it stands now it's only adding a curtesy noindex,nofollow in the viewer if the same settings are set in the robots.txt to honor the origins settings. I would think if the robots.txt specifically states nofollow,noindex, it would be polite not to scrape the page for more notebooks not in the specific path? |
A secret gist is also not public. This one only handled the URL case (not a special GitHub provider). I haven't looked into Google docs. In general, we're not currently scraping the page to look for other notebooks and directories. That makes me the most comfortable operationally and for the sake of development resources.
If it comes up later for alternative deployments of nbviewer, we can address it. I'm only worried about nbviewer.ipython.org being a fair citizen server of the interwebs for the moment. Propagating noindex (and in turn our own robots.txt) seemed like a decent start. It might have been better if we had made the default |
This totally doesn't seem to be working on http://nbviewer.ipython.org/url/jdj.mit.edu/~stevenj/IJulia%20Preview.ipynb
:-/
Posting this early to help with PyCon sprinters that want to touch the same stuff.