add rel="canonical" to snapshots so search engines know what they should index. #33

phonedude · 2017-08-10T16:29:03Z

forbidding indexing by robots.txt is 1/2 a solution. using rel="canonical" to provide a suggestion to SEs is the other 1/2: it will prevent indexing of the snapshot and inform the SE what they should index. it also aligns with industry standard practice, see the wikipedia example at: http://ws-dl.blogspot.com/2017/08/2017-08-07-relcanonical-does-not-mean.html

annevk · 2017-08-10T17:11:25Z

I don't think Wikipedia is doing the right thing.

hvdsomp · 2017-08-10T22:54:44Z

Your comment suggests that your thinking is authoritative without justification. Can't wait for actual justification.

phonedude · 2017-08-11T03:20:12Z

While not definitive, there is some pretty strong evidence that Google worked with wikia.com (and thus transitively mediawiki/Wikipedia) on early rel="canonical" implementations. I guess it's possible wikia.com coordinated on one aspect of rel="canonical" and then went rogue on another aspect, but that seems unlikely.

whatwg/html#2899 (comment)

annevk · 2017-08-11T05:47:57Z

There's nothing in the definition of canonical that suggests that the canonical version of a dated resource is its maintained variant.

domenic · 2017-08-11T06:01:54Z

The issue may be that the definition of canonical is wrong then. The authors of the dated resources we are examining would rather have search engines index the maintained variant. canonical accomplishes this. If the definition of canonical does not support that usage, we should fix its definition. Can you help suggest new text?

annevk · 2017-08-11T06:55:04Z

Is that the dominant pattern though? What about folks using it per the definition? If you want to change anything here you'd first have to do some kind of analysis of the landscape.

phonedude · 2017-08-11T13:43:58Z

here's another example:

inside:
https://www.w3.org/TR/2017/REC-shacl-20170720/
and
https://www.w3.org/TR/2017/PR-shacl-20170608/
etc.

there's:

domenic · 2017-08-11T19:07:47Z

Is that the dominant pattern though? What about folks using it per the definition? If you want to change anything here you'd first have to do some kind of analysis of the landscape.

I would be extraordinarily surprised if it wasn't the dominant pattern, given the many many articles on SEO explaining how to use it in that fashion, and its widely-known benefits for popular search engines.

That said, we can probably do some HTTP archive analysis if you think that's necessary...

domenic · 2018-05-01T19:14:05Z

@annevk any further thoughts on this? Especially given upcoming review drafts, I'd really like to direct search engines to the Living Standard, if they encounter any incoming links to snapshots or review drafts.

annevk · 2018-05-02T06:49:23Z

@domenic I still think the Living Standard is not the canonical representation of a snapshot. And I also think that if we adjust robots.txt to include review-drafts/ (as I'm planning to at least) it won't be a problem.

phonedude · 2018-05-02T14:10:48Z

just a reminder, the options are between:

follow ~10 years of established practice by Google, mediawiki/Wikipedia/wikia, W3C, etc.
create a new method

phonedude · 2018-05-02T14:43:10Z

I'll try another pass.

Perhaps it's the overloaded word "canonical" that is the problem. Let's replace all instances of "canoncial" with "9f3fda2fef6dda85970e12ce9a9b8cbe", the md5 hash of "canonical":

$ echo -n "canonical" | md5
9f3fda2fef6dda85970e12ce9a9b8cbe

there are browser extensions to replace strings with other strings so you never have to see them, so for us all the W3C, Wikipedia, etc. pages now say things like:

etc.

Now decide if the interactions between Google and these pages produce the desired semantics (i.e., dated variants hinting "don't index me, index my undated friend here")

then rel="9f3fda2fef6dda85970e12ce9a9b8cbe" is the rel type you should use.

domenic · 2018-05-02T15:05:51Z

@annevk As noted previously, I don't think "canonical representation" is a useful definition for rel=canonical. The useful definition (i.e. the one used by implementers) is "what should I put in my search engine index when I see this page."

In the short term, I'd like to implement rel=canonical in our review drafts, without you blocking me. In the longer term, I'd welcome your help in changing the definition of rel=canonical to match implementations.

As for rel=canonical vs. robots.txt, I think it's better to have a crawler be able to follow incoming links and go to the right place, than to block crawlers entirely.

annevk · 2018-05-02T16:15:29Z

Per https://webmasters.googleblog.com/2013/04/5-common-mistakes-with-relcanonical.html the pages have to be more or less the same. So whenever we do a major refactoring we'd be abusing it, no? Is there some URL that backs up your point of view?

domenic · 2018-05-02T16:18:14Z

Sure, the first link in that blog post takes us to https://support.google.com/webmasters/answer/139066?visit_id=1-636608746222517583-624958929&rd=1 which has more discussion.

annevk · 2018-05-02T16:23:50Z

Yeah, and all that talks about is duplicate content, not content under version control.

domenic · 2018-05-02T16:34:54Z

It makes the effects pretty clear:

Google uses the canonical pages on your site as the gold standard of your site's content, as far as evaluating content and quality, and the Google Search result usually points to the canonical page, unless one of the duplicates is explicitly better suited to a user's query

Why should I choose a canonical URL? [...] To specify which URL that you want people to see in search results. To consolidate link signals for similar (emphasis mine) or duplicate pages

annevk · 2018-05-02T16:37:49Z

I still think it would be better to avoid indexing it at all. "Similar" is not defined and if it turns out to be false at some point in the future we might end up with a weird alternate URL for a standard if it had gotten linked a ton for some reason.

phonedude mentioned this issue Aug 10, 2017

Add robots.txt to forbid indexing commit-snapshots #32

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add rel="canonical" to snapshots so search engines know what they should index. #33

add rel="canonical" to snapshots so search engines know what they should index. #33

phonedude commented Aug 10, 2017

annevk commented Aug 10, 2017

hvdsomp commented Aug 10, 2017

phonedude commented Aug 11, 2017 •

edited

Loading

annevk commented Aug 11, 2017

domenic commented Aug 11, 2017

annevk commented Aug 11, 2017

phonedude commented Aug 11, 2017 •

edited

Loading

domenic commented Aug 11, 2017

domenic commented May 1, 2018

annevk commented May 2, 2018

phonedude commented May 2, 2018

phonedude commented May 2, 2018 •

edited

Loading

domenic commented May 2, 2018

annevk commented May 2, 2018

domenic commented May 2, 2018

annevk commented May 2, 2018

domenic commented May 2, 2018

annevk commented May 2, 2018

add rel="canonical" to snapshots so search engines know what they *should* index. #33

add rel="canonical" to snapshots so search engines know what they *should* index. #33

Comments

phonedude commented Aug 10, 2017

annevk commented Aug 10, 2017

hvdsomp commented Aug 10, 2017

phonedude commented Aug 11, 2017 • edited Loading

annevk commented Aug 11, 2017

domenic commented Aug 11, 2017

annevk commented Aug 11, 2017

phonedude commented Aug 11, 2017 • edited Loading

domenic commented Aug 11, 2017

domenic commented May 1, 2018

annevk commented May 2, 2018

phonedude commented May 2, 2018

phonedude commented May 2, 2018 • edited Loading

domenic commented May 2, 2018

annevk commented May 2, 2018

domenic commented May 2, 2018

annevk commented May 2, 2018

domenic commented May 2, 2018

annevk commented May 2, 2018

add rel="canonical" to snapshots so search engines know what they should index. #33

add rel="canonical" to snapshots so search engines know what they should index. #33

phonedude commented Aug 11, 2017 •

edited

Loading

phonedude commented Aug 11, 2017 •

edited

Loading

phonedude commented May 2, 2018 •

edited

Loading