Canonical Link Element


Download English captions
Download Spanish subtitles

English captions:

00:00:07.380,00:00:12.080
Hi everybody. Welcome back to another video. We're doing this thing where when we speak at a conference

00:00:12.190,00:00:17.470
and we talk about something substantial, not just questions and answers, we talk through our presentation later

00:00:17.560,00:00:21.530
and put it up so people can follow along, watch the slides, and hopefully learn a little bit.

00:00:21.730,00:00:28.560
So today I wanted to talk about the canonical link element. And that's something that Google, Yahoo!, and Microsoft

00:00:28.720,00:00:36.930
all announced that they will support in the future at SMX West. So, the date that we had this announcement was

00:00:37.040,00:00:45.210
February 12, 2009, and the funny thing about it is that Charles Darwin was born exactly 200 years ago that day.

00:00:45.320,00:00:50.710
So I started out with a slide where I made a corny joke and I said, whether you think the web was intelligently

00:00:50.910,00:00:57.710
designed by Tim Berners-Lee, or whether you think the web needs to evolve, either way this is an open standard which

00:00:57.820,00:01:05.240
helps people improve the web. And so we sort of said, what is a big problem that faces people today,

00:01:05.080,00:01:11.050
webmasters, SEOs, site owners on the web? And it's pretty clear that duplicate content is one of the things that

00:01:11.130,00:01:18.630
people care about the most. So what is duplicate content? Well, I've got a slide here where I show I think eight

00:01:18.720,00:01:26.300
different URLs, you know every single one of these URLs could return completely different content. In practice, we

00:01:26.380,00:01:34.810
as humans whenever we look at www.example.com or just regular example.com or /index or home.asp, we think of it as

00:01:34.910,00:01:41.440
the same page. And in practice, it usually is the same page. So technically it doesn't have to be, but almost always

00:01:41.530,00:01:46.400
web servers will return the same content for like these eight different versions of the URL.

00:01:46.500,00:01:52.860
So, that can cause a lot of problems in search engines if rather than having your backlinks all go to one page,

00:01:53.010,00:01:59.500
instead it's split between a www and a non-www version. And it's a really big headache. How do people solve this?

00:01:59.620,00:02:05.610
How do people fix this? Well, it turns out, and I'll dwell on this slide for just a few minutes, there are a lot

00:02:05.740,00:02:11.260
of ways to fix it. So, some people have joked that this canonical link element is kind of like, you know,

00:02:11.350,00:02:17.440
Spackle that fixes over the appearance of all the cracks in the wall. And the fact is there are a lot of

00:02:17.570,00:02:23.620
ways that you can fix things first and foremost, from the beginning, upstream where you don't need to fix it downstream

00:02:23.720,00:02:29.040
later on. There was a really funny quote by Jill Whalen at the conference where she said,

00:02:29.160,00:02:31.490
"Developers keep SEOs in business."

00:02:31.830,00:02:36.850
Right? And so whether you're a developer or an SEO there are some best practices that can make things a little bit

00:02:36.960,00:02:41.650
easier for your system so that you don't have to worry about this issue of duplicate content at all.

00:02:41.940,00:02:48.480
So, one is to try to make sure that your URLs are standardized, Microsoft sometimes calls them normalized,

00:02:48.640,00:02:55.660
in essence there's only one way to get to the content. If your content management system always generates consistent

00:02:55.750,00:03:00.620
URLs, and they're completely uniform, and you don't have to worry about having eight different versions in the

00:03:00.750,00:03:05.460
first place, that just saves you a lot of trouble. You don't have to worry about the issue coming up at all.

00:03:05.610,00:03:11.750
So one way to do that is to fix your content management system or your software so that you only generate these URLs

00:03:11.860,00:03:19.310
in a very consistent way. Another thing to do is to think about your site. Suppose you have www.example.com and

00:03:19.380,00:03:26.170
non-www, just plain old example.com. Well if you link to www sometimes and non-www sometimes, it's natural that

00:03:26.290,00:03:33.790
search engines might get a little bit confused. So linking consistently, saying okay, my homepage is going

00:03:33.870,00:03:40.290
to be www.example.com/. Nothing else, that's it. And then making sure that all of your internal linking is consistent,

00:03:40.460,00:03:45.570
that alone can make a really big difference, so that you don't end up with two, three, four copies of each page.

00:03:45.760,00:03:56.280
If you do have, you know, home.asp or index.html, you can rewrite such that all those other URLs are 301 redirects

00:03:56.420,00:04:02.170
to a single URL. So, it's great if you can fix it at the beginning, it's great if you can link consistently so the

00:04:02.320,00:04:08.910
issue never comes up, but if duplicate URLs do occur, then you can use a 301, a permanent redirect as we refer to it,

00:04:09.070,00:04:15.130
to sort of standardize and glom together all of those URLs. And search engines will follow that 301 redirect,

00:04:15.290,00:04:21.740
and typically group them all together. Google also does a couple of extra things that some search engines don't do.

00:04:21.900,00:04:28.440
So, in our Webmaster Tools, our webmaster console, which is totally free, doesn't cost anything at all,

00:04:28.600,00:04:35.900
you can specify, for example my site is mattcutts.com, you can specify if you prefer www.mattcutts.com or non-www,

00:04:36.080,00:04:42.250
so just mattcutts.com. That's a very easy setting, and that solves a lot of duplicate content issues right there.

00:04:42.600,00:04:48.160
And a little-known fact, not everybody realizes this, is that whenever you submit your URLs in what

00:04:48.290,00:04:53.760
we call a Sitemap, which is another standard that's supported by many major search engines, and it's a very simple

00:04:53.900,00:04:59.940
file, it can be as simple as a list of URLs, we take that list of URLs that you submit, and we say to ourselves,

00:05:00.080,00:05:06.140
oh, if we see a URL in that list, and then we see another version of it that's not in the list, we will prefer

00:05:06.280,00:05:12.350
URLs in the list that you gave us. So we sort of use it to break ties whenever you submit URLs from a Sitemap.

00:05:12.540,00:05:17.320
So there's at least a couple ways that you can give Google hints that try to help out with duplicate content.

00:05:18.370,00:05:26.530
But, that said, there will probably always be duplicate content issues that you can't fix. So, just to run through

00:05:26.640,00:05:32.980
a few example ones. Sometimes, you can't generate a permanent or 301 redirect. For example, at my old school account,

00:05:33.160,00:05:39.550
cs.unc.edu, I don't run the web server there. So I'd have to open a ticket or drop an email to the people that

00:05:39.700,00:05:45.790
administer that system and say hey, can you add a 301 redirect from this page to that page. A lot of free hosts,

00:05:45.990,00:05:52.350
you might not be able to generate a 301 redirect. And you can't help how people link to you. So for example,

00:05:52.530,00:05:58.920
you know, even if you link consistently to just the www version of your website, some other people might link to

00:05:59.050,00:06:03.320
the non-www version. And you can't really control that at all.

00:06:03.580,00:06:12.860
Uppercase versus lowercase paths. Microsoft IIS will support showing pages whether you link to home.asp capitalized

00:06:13.030,00:06:19.680
or lowercase, and sometimes even mixed case. And so if people link to different versions that are uppercase and

00:06:19.820,00:06:25.930
lowercase mixed, that can cause some issues. Session IDs are another really big factor. So I have seen,

00:06:26.100,00:06:33.890
at least in some search search engines, a site with a one-page privacy policy. And that privacy policy was indexed

00:06:33.980,00:06:40.880
three thousand times, each time with a different session ID, because the privacy policy was slightly different each time.

00:06:41.320,00:06:47.100
So, you know, session IDs in general if you can avoid them are great. But sometimes you as the

00:06:47.200,00:06:51.740
search engine optimizer or the person who is responsible for the site can't get rid of them entirely.

00:06:52.320,00:06:58.660
Tracking codes, you know, if you're buying ads. Analytics, you know the UTM parameter, landing pages where they

00:06:58.810,00:07:04.210
have to be different landing pages for different ads, those are the sort of things that you sometimes can't get rid of.

00:07:04.540,00:07:10.200
And if you run an e-commerce site, suppose you have different products. You might have sort by descending price

00:07:10.310,00:07:16.190
or sort by ascending price, and sometimes you need to have different facets, different views of your data, and

00:07:16.360,00:07:20.490
conceptually it's really the same thing, it's just a different way to slice and dice it.

00:07:21.230,00:07:28.430
Finally, there's breadcrumbs. So breadcrumbs are how did I get to this page? Am I coming to this red tent example

00:07:28.560,00:07:34.260
via tents, or am I coming to it via colors, or did I come to it because I was interested in accessories?

00:07:34.580,00:07:41.110
How did I land on this page? Even Google's own webmaster help documentation sometimes has a CTX parameter that says

00:07:41.230,00:07:50.540
here's how we got to this page. And that day, it was kind of funny, the Queen had just launched a new website:

00:07:50.660,00:07:59.200
royal.gov.uk. And so I wish the Queen the best, I want her to live long, and I wish the British monarchy the best,

00:07:59.480,00:08:07.180
however, someone at the Telegraph, telegraph.co.uk, had done an SEO audit of this site, and they had found

00:08:07.300,00:08:15.790
duplicate content issues. So you can see right here, just slash, royal.gov.uk/Home.aspx, and then at the very bottom

00:08:15.920,00:08:22.990
I almost made a ransom note style where I mixed uppercase and lowercase. And the royal website returned the same page

00:08:23.100,00:08:29.380
for all three of those URLs. So that was just a very simple example to illustrate that anybody can have these

00:08:29.550,00:08:30.940
sorts of issues.

00:08:31.180,00:08:36.840
So what's the answer? Lets, you know, I've buried the lead enough, how do people solve this particular problem?

00:08:37.040,00:08:41.870
Well, assuming you can't solve it any other way, and absolutely I encourage you to try to fix it upstream,

00:08:42.020,00:08:47.140
to try to link consistently. This not something that you should just say, oh, now all my problems are solved,

00:08:47.260,00:08:52.670
I don't have to worry about anything else. But, if you can't solve your problems in other ways, there's a very

00:08:52.840,00:08:59.830
simple element, link element, where you can say my canonical, and that's a long word that means you know, my preferred,

00:09:00.180,00:09:07.090
or the primary, or the clean, the pretty version of the URL that I want to use, is not this ugly URL with a tracking

00:09:07.210,00:09:14.280
code or a session ID, it's this pretty URL right over here. And all you have to do is in the head element of this

00:09:14.380,00:09:20.830
document say you know what, even though this has a weird session ID, the pretty version, the canonical version of

00:09:20.920,00:09:28.290
this URL, is over here. And that's literally all it is. It's a very simple open standard. It's one simple element

00:09:28.450,00:09:30.910
that you add to the head of your document.

00:09:31.770,00:09:38.480
Some interesting little tidbits. This is the director's cut so you get a little bit of extra info. Is this a tag?

00:09:38.780,00:09:45.190
Well, it's kind of, the technical name I believe is "element." But we're all friends here, nobody's going to abuse

00:09:45.310,00:09:51.920
you or you know make fun of you if you call it a canonical link tag versus a canonical link element. People often

00:09:52.050,00:09:59.550
speak about meta tags, right? And so meta tags are things that go in the head of the document as well. And so, if

00:09:59.670,00:10:05.530
a meta tag has a value that is a hyperlink, I think the most correct thing is not for it to be meta, but for it to

00:10:05.690,00:10:12.410
be called "link." And so that's why you see link rel="canonical" href= and the value. So now you know the official

00:10:12.510,00:10:17.080
name, but nobody's going to care if you just call it the canonical link tag.

00:10:18.230,00:10:24.250
One thing that's kind of interesting about this tag, let's just talk about a few high-order bits.

00:10:25.080,00:10:31.440
We don't promise we're going to abide by this 100%. Right? You know, if we see a webmaster and they've accidentally

00:10:31.540,00:10:36.900
shot themselves in the foot, you know maybe they've created an infinite loop, and it's very easy to create an

00:10:37.010,00:10:42.690
infinite loop, we reserve the right to do what we think is best. At least at Google, we are going to treat this as

00:10:42.790,00:10:49.490
a very strong hint. So unless we see some weird corner case or something where you're probably hurting your own site,

00:10:49.650,00:10:56.790
we probably would expect to respect this tag. So I think that in most cases, it will work quite well. But we do have

00:10:56.930,00:11:02.190
to reserve the final, sort of bottom-line ability to say no, we don't think this is what's best for the users.

00:11:03.130,00:11:09.490
Again, if you can fix it yourself upstream, that's much better. So look at all the other alternatives, the other

00:11:09.570,00:11:14.340
choices before you use this tag. Don't just say, oh, I can just slap everything with a canonical link tag and

00:11:14.470,00:11:16.060
boom, I'm done.

00:11:17.160,00:11:24.570
If you're a regular user, just like a mom-and-pop and you use WordPress or you use some shopping cart software,

00:11:24.880,00:11:30.500
it's probably best not to just roll up your sleeves and go digging into it and trying to fix it all yourself,

00:11:30.600,00:11:36.140
at least not quite yet. Wait a little while, because I think plugins will come out, people are talking about hey,

00:11:36.230,00:11:41.590
is WordPress able to add this to the core software, so maybe you don't even need a plugin? So if you're just a regular

00:11:41.720,00:11:47.820
user and you wait a few months, things should be fine. You know it's a brand-new element, so there's time for you

00:11:47.910,00:11:54.830
to sit down and cautiously deliberate and say okay, what kinds of duplicate content do I have, how can I fix it?

00:11:55.030,00:12:00.270
Take a little bit of time. Don't just jump right in and start, oh I'm going to point everywhere, I'm going to do everything.

00:12:00.410,00:12:05.230
There's enough time where this will be supported so you can plan ahead a little bit.

00:12:05.950,00:12:11.770
And as always, if we see people abusing it, we do reserve the right to change how we treat the tag, or to

00:12:11.940,00:12:20.020
not respect the tag. There is a nice way that we try to prevent abuse. We allow things within the same domain,

00:12:20.230,00:12:27.750
but we don't allow things to cross domains. So with 301s, there's always been this notion of can I hijack a site by

00:12:27.880,00:12:34.710
doing weird 301s, and can I steal the reputation of some other site? And at least right now, this element is not

00:12:34.820,00:12:41.170
really subject to that because you can only use it within the same domain. Now a natural question right after that,

00:12:41.290,00:12:45.660
is well, what about subdomains? Can I, you know, do things across different hostnames?

00:12:45.750,00:12:51.110
And the answer is yes, you can. So, I was talking to Tony Hsieh from Zappos, and they were talking about duplicate

00:12:51.230,00:12:56.660
content. And they have a server called zeta.zappos.com, which is sort of their staging software and might be the

00:12:56.750,00:13:03.470
next version. And they were saying, well, can I send my canonicalness, can I splat it from zeta.zappos.com to

00:13:03.620,00:13:07.750
www.zappos.com? And the answer is yes, you absolutely can.

00:13:08.380,00:13:15.920
Can you use it from https and send that to http? Totally, works great for that. It's on the same domain, so it's

00:13:16.040,00:13:19.350
no problem at all, at least within Google to use it for that purpose.

00:13:19.960,00:13:25.890
And then what's the difference between this and a 301 or a permanent redirect? There's really not that much,

00:13:26.000,00:13:33.610
other than this is restricted to one domain. So 301s can cross domains; this is all within the same domain.

00:13:33.830,00:13:40.440
In fact, whenever I think about it, the mental model that I have is that this is essentially like a little mini

00:13:40.560,00:13:48.260
301 redirect that you can generate with this link element. So, you know, if you think about how Google handles 301s,

00:13:48.430,00:13:52.960
that's probably a pretty good guess of how we'll handle this particular element.

00:13:54.290,00:14:00.940
So, a few more questions, since you've got the time, you're watching the video. Do the page have to be identical?

00:14:01.140,00:14:08.270
Bit for bit identical? No, they do not. Think again about this case where you have a catalog page and you can sort

00:14:08.370,00:14:14.800
by increasing price or decreasing price, those are conceptually pretty close to the same page. So if you want to say

00:14:14.990,00:14:21.740
map this to the same URL, and don't worry about the sort by parameter, you're more than welcome to do that.

00:14:22.110,00:14:26.250
They should be similar. You know, if we see, this is the only thing I can think of where there could be abuse,

00:14:26.380,00:14:31.160
is if you've got a cartoon page over here, and you've got something that's completely irrelevent to cartoons over

00:14:31.280,00:14:36.160
here and you try to combine them together. And you're not really gaining any advantage because you had PageRank on

00:14:36.280,00:14:41.880
this page and on that page. So it really doesn't make sense to combine them, but we do recommend that you use them

00:14:42.030,00:14:45.500
for similar pages. They don't have to be identical, but they should be similar.

00:14:46.400,00:14:55.390
A few sort of niggly bits. How about relative URLs versus absolute URLs? The answer to that is you can use either one.

00:14:55.670,00:15:02.010
We recommend absolute URLs. And there's a very simple reason. When you have relative URLs, you can move a URL and

00:15:02.100,00:15:10.650
everything stays the same relative to that URL. So essentially, you know the homepage can say /images or images.

00:15:10.770,00:15:17.150
And that will move it relative to that particular page. But it's better to have an absolute URL because this is

00:15:17.280,00:15:22.930
a powerful tool, and you really want to say this URL goes to exactly this URL. So you want to specify that.

00:15:23.120,00:15:27.920
Whereas if it's relative, if you mess it up here, then you might mess it up somewhere else as well.

00:15:28.850,00:15:35.440
Can you follow a chain of canonical tags, or canonical elements, just like you can follow a chain of 301 redirects?

00:15:35.820,00:15:41.750
Yes, but again I don't recommmend that, because if you have a big site and you have a big chain of 301 redirects,

00:15:41.880,00:15:47.850
it's easy for something to break. So, it's similar, something can break and you don't intend to have the consequences

00:15:47.990,00:15:55.330
that you wanted to, so what I would recommend is absolute URLs, and going from the old URL to the new URL, one hop

00:15:55.500,00:16:01.840
and that's all you do. It's just simpler that way, and you know you want to play it safe. You don't want to

00:16:01.960,00:16:07.180
accidentally shoot yourself in the foot. So what are some ways you can shoot yourself in the foot? Well, what if

00:16:07.290,00:16:13.530
you say my canonical is over here, and that's a 404 page? Right, the page might not exist. What if you had an

00:16:13.620,00:16:18.750
infinite loop? This is canonical. No, this is canonical. And we've all seen those happen, you know, what is the

00:16:18.830,00:16:22.930
Civil War? Look up the War Between the States. What is the War Between the States? Look up the Civil War.

00:16:23.080,00:16:28.100
You know, and now you have to put the dictionary down and your head hurts. So try to avoid infinite loops.

00:16:28.250,00:16:34.480
What if I point to a URL that hasn't been crawled? You know, we'll try to crawl that URL, but that corner case,

00:16:34.630,00:16:41.970
what if I told in the webmaster console, oh yeah, everything should be www.example.com, but then you specify your

00:16:42.070,00:16:48.170
canonicals as non-www, or without the www. So you can do all these sorts of things to almost shoot yourself in the

00:16:48.280,00:16:54.140
foot, and the answer is we will try to handle all of these corner cases in a reasonable way. The slide has some

00:16:54.260,00:17:00.700
Ghostbusters because there's the old saying, "Don't cross the streams," right? So think about this, take some time,

00:17:00.810,00:17:06.090
don't just throw canonical tags on willy-nilly on your site, you know, try to plan it out a little bit so that you

00:17:06.210,00:17:08.080
don't run into these corner cases.

00:17:09.050,00:17:14.240
So we're getting towards the end of the presentation. I just really wanted to send a shout out to Joachim, who is

00:17:14.360,00:17:19.640
the Google engineer who really did all the implementation, all the heavy lifting on this. Made sure that it worked

00:17:19.790,00:17:25.480
very nicely within a 301, and thought about all the corner cases. So, for example, someone said, well what if

00:17:25.590,00:17:31.520
I have a canonical, and I point to myself? Does that work? Yep, that works fine. What if I have a canonical and my

00:17:31.670,00:17:38.490
href is empty? Well, it turns out that parses as an error, which turns out to point to itself. So all this stuff

00:17:38.610,00:17:44.700
still works because Joachim did a really good design, but again, try to make sure that it's all absolute URLs and

00:17:44.830,00:17:51.240
everything's specified well. Also, I'd love to send a shout out to Greg Grothaus. It turns out when you dig into this,

00:17:51.390,00:17:58.960
a lot of people have proposed similar ideas. I saw at least one post out on the general web after we'd started

00:17:59.070,00:18:05.330
exploring this that said, hey, why don't you do this kind of a proposal? But Greg was really one of the people who

00:18:05.440,00:18:11.890
sparked the discussion at Google, who really pushed for it and had a great idea, and so I sort of think of him as

00:18:11.990,00:18:17.580
at least within Google, he really got the ball rolling and really sparked the wave of work on this, so I really

00:18:17.750,00:18:23.730
appreciate that. And of course all the people, you know, from Maile and Wysz and Adam and Riona who have worked on

00:18:23.840,00:18:29.970
the messaging and reached out to different people. At Yahoo!, Priyank, and a ton of people at Microsoft,

00:18:30.130,00:18:35.480
Nathan Buggia and a bunch of other people as well. My hope is that lots of search engines will support this.

00:18:35.630,00:18:41.680
So, Yahoo! and Microsoft have announced that they will support it, let's keep our fingers crossed for Ask, I'd love

00:18:41.760,00:18:49.140
for them to join in as well. Wikia, so Artur at Wikia had emailed us and sort of asked about doing canonical tags

00:18:49.260,00:18:54.110
anyway. And so it was really great that they could test it out while we were trying it out ourselves.

00:18:54.340,00:18:59.050
And then a ton of webmasters who always give us this sort of feedback on what they'd like to see.

00:18:59.350,00:19:06.430
On this last slide, I just list a bunch of resources, so Google, Yahoo!, and Microsoft all did blog posts about it.

00:19:06.580,00:19:13.110
There's an official Help Center documentation page. And, what we saw was, as people would come and have duplicate

00:19:13.250,00:19:19.620
content questions, Joost had come and sort of asked about an interesting corner case, we just said, hey, you know

00:19:19.740,00:19:23.800
what? We've got this thing coming out that might help with this. And so it was a very nice way to just do a sort

00:19:23.910,00:19:30.550
of very quiet beta test and see how well it worked. So, Joost happened to email just a few days before we were

00:19:30.670,00:19:35.210
ready to announce support, and so we gave him a heads-up about the possibility of this, and he turned around

00:19:35.340,00:19:41.530
plugins not just for WordPress, but also for Magento, which is an e-commerce shopping software, and Drupal, which

00:19:41.660,00:19:46.510
is another open-source content management system, which I think the White House just rolled out using Drupal.

00:19:46.660,00:19:53.960
So really appreciate the work that he's done as well. And in general, you know, be careful, be cautious, plan out

00:19:54.090,00:19:58.880
how you want to use this tag. But we don't intend to make any money off of it, we think it's just good for the web,

00:19:59.020,00:20:06.220
It'll lead to less duplicate content. It's an open standard, so any search engine that crawls the web can use this

00:20:06.310,00:20:09.510
information to help, you know, make the web more relevant and increase the relevancy of their search results.

00:20:09.650,00:20:14.410
And now you know as much as the audience knows when they attended SMX West.

00:20:14.550,00:20:17.930
Thanks very much for listening, and talk to you soon.


Attachments (2)

  • Cm9onOGTgeM-en.txt - on Jun 29, 2009 5:07 PM by Michael Wyszomierski (version 1)
    26k Download
  • Cm9onOGTgeM-es.txt - on Jun 29, 2009 5:07 PM by Michael Wyszomierski (version 1)
    25k Download