-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add rel="canonical" to snapshots so search engines know what they *should* index. #33
Comments
I don't think Wikipedia is doing the right thing. |
Your comment suggests that your thinking is authoritative without justification. Can't wait for actual justification. |
While not definitive, there is some pretty strong evidence that Google worked with wikia.com (and thus transitively mediawiki/Wikipedia) on early rel="canonical" implementations. I guess it's possible wikia.com coordinated on one aspect of rel="canonical" and then went rogue on another aspect, but that seems unlikely. |
There's nothing in the definition of canonical that suggests that the canonical version of a dated resource is its maintained variant. |
The issue may be that the definition of canonical is wrong then. The authors of the dated resources we are examining would rather have search engines index the maintained variant. canonical accomplishes this. If the definition of canonical does not support that usage, we should fix its definition. Can you help suggest new text? |
Is that the dominant pattern though? What about folks using it per the definition? If you want to change anything here you'd first have to do some kind of analysis of the landscape. |
here's another example: inside: there's: <link rel="canonical" href="https://www.w3.org/TR/shacl/"> |
I would be extraordinarily surprised if it wasn't the dominant pattern, given the many many articles on SEO explaining how to use it in that fashion, and its widely-known benefits for popular search engines. That said, we can probably do some HTTP archive analysis if you think that's necessary... |
@annevk any further thoughts on this? Especially given upcoming review drafts, I'd really like to direct search engines to the Living Standard, if they encounter any incoming links to snapshots or review drafts. |
@domenic I still think the Living Standard is not the canonical representation of a snapshot. And I also think that if we adjust robots.txt to include |
just a reminder, the options are between:
|
I'll try another pass. Perhaps it's the overloaded word "canonical" that is the problem. Let's replace all instances of "canoncial" with "9f3fda2fef6dda85970e12ce9a9b8cbe", the md5 hash of "canonical": $ echo -n "canonical" | md5 there are browser extensions to replace strings with other strings so you never have to see them, so for us all the W3C, Wikipedia, etc. pages now say things like: <link rel="9f3fda2fef6dda85970e12ce9a9b8cbe" href="https://www.w3.org/TR/shacl/"> etc. Now decide if the interactions between Google and these pages produce the desired semantics (i.e., dated variants hinting "don't index me, index my undated friend here") then rel="9f3fda2fef6dda85970e12ce9a9b8cbe" is the rel type you should use. |
@annevk As noted previously, I don't think "canonical representation" is a useful definition for rel=canonical. The useful definition (i.e. the one used by implementers) is "what should I put in my search engine index when I see this page." In the short term, I'd like to implement rel=canonical in our review drafts, without you blocking me. In the longer term, I'd welcome your help in changing the definition of rel=canonical to match implementations. As for rel=canonical vs. robots.txt, I think it's better to have a crawler be able to follow incoming links and go to the right place, than to block crawlers entirely. |
Per https://webmasters.googleblog.com/2013/04/5-common-mistakes-with-relcanonical.html the pages have to be more or less the same. So whenever we do a major refactoring we'd be abusing it, no? Is there some URL that backs up your point of view? |
Sure, the first link in that blog post takes us to https://support.google.com/webmasters/answer/139066?visit_id=1-636608746222517583-624958929&rd=1 which has more discussion. |
Yeah, and all that talks about is duplicate content, not content under version control. |
It makes the effects pretty clear:
|
I still think it would be better to avoid indexing it at all. "Similar" is not defined and if it turns out to be false at some point in the future we might end up with a weird alternate URL for a standard if it had gotten linked a ton for some reason. |
forbidding indexing by robots.txt is 1/2 a solution. using rel="canonical" to provide a suggestion to SEs is the other 1/2: it will prevent indexing of the snapshot and inform the SE what they should index. it also aligns with industry standard practice, see the wikipedia example at: http://ws-dl.blogspot.com/2017/08/2017-08-07-relcanonical-does-not-mean.html
The text was updated successfully, but these errors were encountered: