Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add rel="canonical" to snapshots so search engines know what they *should* index. #33

Open
phonedude opened this issue Aug 10, 2017 · 18 comments

Comments

@phonedude
Copy link

forbidding indexing by robots.txt is 1/2 a solution. using rel="canonical" to provide a suggestion to SEs is the other 1/2: it will prevent indexing of the snapshot and inform the SE what they should index. it also aligns with industry standard practice, see the wikipedia example at: http://ws-dl.blogspot.com/2017/08/2017-08-07-relcanonical-does-not-mean.html

@annevk
Copy link
Member

annevk commented Aug 10, 2017

I don't think Wikipedia is doing the right thing.

@hvdsomp
Copy link

hvdsomp commented Aug 10, 2017

Your comment suggests that your thinking is authoritative without justification. Can't wait for actual justification.

@phonedude
Copy link
Author

phonedude commented Aug 11, 2017

While not definitive, there is some pretty strong evidence that Google worked with wikia.com (and thus transitively mediawiki/Wikipedia) on early rel="canonical" implementations. I guess it's possible wikia.com coordinated on one aspect of rel="canonical" and then went rogue on another aspect, but that seems unlikely.

whatwg/html#2899 (comment)

@annevk
Copy link
Member

annevk commented Aug 11, 2017

There's nothing in the definition of canonical that suggests that the canonical version of a dated resource is its maintained variant.

@domenic
Copy link
Member

domenic commented Aug 11, 2017

The issue may be that the definition of canonical is wrong then. The authors of the dated resources we are examining would rather have search engines index the maintained variant. canonical accomplishes this. If the definition of canonical does not support that usage, we should fix its definition. Can you help suggest new text?

@annevk
Copy link
Member

annevk commented Aug 11, 2017

Is that the dominant pattern though? What about folks using it per the definition? If you want to change anything here you'd first have to do some kind of analysis of the landscape.

@phonedude
Copy link
Author

phonedude commented Aug 11, 2017

here's another example:

inside:
https://www.w3.org/TR/2017/REC-shacl-20170720/
and
https://www.w3.org/TR/2017/PR-shacl-20170608/
etc.

there's:

<link rel="canonical" href="https://www.w3.org/TR/shacl/">

@domenic
Copy link
Member

domenic commented Aug 11, 2017

Is that the dominant pattern though? What about folks using it per the definition? If you want to change anything here you'd first have to do some kind of analysis of the landscape.

I would be extraordinarily surprised if it wasn't the dominant pattern, given the many many articles on SEO explaining how to use it in that fashion, and its widely-known benefits for popular search engines.

That said, we can probably do some HTTP archive analysis if you think that's necessary...

@domenic
Copy link
Member

domenic commented May 1, 2018

@annevk any further thoughts on this? Especially given upcoming review drafts, I'd really like to direct search engines to the Living Standard, if they encounter any incoming links to snapshots or review drafts.

@annevk
Copy link
Member

annevk commented May 2, 2018

@domenic I still think the Living Standard is not the canonical representation of a snapshot. And I also think that if we adjust robots.txt to include review-drafts/ (as I'm planning to at least) it won't be a problem.

@phonedude
Copy link
Author

just a reminder, the options are between:

  1. follow ~10 years of established practice by Google, mediawiki/Wikipedia/wikia, W3C, etc.

  2. create a new method

@phonedude
Copy link
Author

phonedude commented May 2, 2018

I'll try another pass.

Perhaps it's the overloaded word "canonical" that is the problem. Let's replace all instances of "canoncial" with "9f3fda2fef6dda85970e12ce9a9b8cbe", the md5 hash of "canonical":

$ echo -n "canonical" | md5
9f3fda2fef6dda85970e12ce9a9b8cbe

there are browser extensions to replace strings with other strings so you never have to see them, so for us all the W3C, Wikipedia, etc. pages now say things like:

<link rel="9f3fda2fef6dda85970e12ce9a9b8cbe" href="https://www.w3.org/TR/shacl/">
<link rel="9f3fda2fef6dda85970e12ce9a9b8cbe" href="https://en.wikipedia.org/wiki/DJ_Shadow"/>

etc.

Now decide if the interactions between Google and these pages produce the desired semantics (i.e., dated variants hinting "don't index me, index my undated friend here")

then rel="9f3fda2fef6dda85970e12ce9a9b8cbe" is the rel type you should use.

@domenic
Copy link
Member

domenic commented May 2, 2018

@annevk As noted previously, I don't think "canonical representation" is a useful definition for rel=canonical. The useful definition (i.e. the one used by implementers) is "what should I put in my search engine index when I see this page."

In the short term, I'd like to implement rel=canonical in our review drafts, without you blocking me. In the longer term, I'd welcome your help in changing the definition of rel=canonical to match implementations.

As for rel=canonical vs. robots.txt, I think it's better to have a crawler be able to follow incoming links and go to the right place, than to block crawlers entirely.

@annevk
Copy link
Member

annevk commented May 2, 2018

Per https://webmasters.googleblog.com/2013/04/5-common-mistakes-with-relcanonical.html the pages have to be more or less the same. So whenever we do a major refactoring we'd be abusing it, no? Is there some URL that backs up your point of view?

@domenic
Copy link
Member

domenic commented May 2, 2018

Sure, the first link in that blog post takes us to https://support.google.com/webmasters/answer/139066?visit_id=1-636608746222517583-624958929&rd=1 which has more discussion.

@annevk
Copy link
Member

annevk commented May 2, 2018

Yeah, and all that talks about is duplicate content, not content under version control.

@domenic
Copy link
Member

domenic commented May 2, 2018

It makes the effects pretty clear:

Google uses the canonical pages on your site as the gold standard of your site's content, as far as evaluating content and quality, and the Google Search result usually points to the canonical page, unless one of the duplicates is explicitly better suited to a user's query

Why should I choose a canonical URL? [...] To specify which URL that you want people to see in search results. To consolidate link signals for similar (emphasis mine) or duplicate pages

@annevk
Copy link
Member

annevk commented May 2, 2018

I still think it would be better to avoid indexing it at all. "Similar" is not defined and if it turns out to be false at some point in the future we might end up with a weird alternate URL for a standard if it had gotten linked a ton for some reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants