Download English captions Download Spanish subtitles English captions: 00:00:07.380,00:00:12.080 Hi everybody. Welcome back to another video. We're doing this thing where when we speak at a conference 00:00:12.190,00:00:17.470 and we talk about something substantial, not just questions and answers, we talk through our presentation later 00:00:17.560,00:00:21.530 and put it up so people can follow along, watch the slides, and hopefully learn a little bit. 00:00:21.730,00:00:28.560 So today I wanted to talk about the canonical link element. And that's something that Google, Yahoo!, and Microsoft 00:00:28.720,00:00:36.930 all announced that they will support in the future at SMX West. So, the date that we had this announcement was 00:00:37.040,00:00:45.210 February 12, 2009, and the funny thing about it is that Charles Darwin was born exactly 200 years ago that day. 00:00:45.320,00:00:50.710 So I started out with a slide where I made a corny joke and I said, whether you think the web was intelligently 00:00:50.910,00:00:57.710 designed by Tim Berners-Lee, or whether you think the web needs to evolve, either way this is an open standard which 00:00:57.820,00:01:05.240 helps people improve the web. And so we sort of said, what is a big problem that faces people today, 00:01:05.080,00:01:11.050 webmasters, SEOs, site owners on the web? And it's pretty clear that duplicate content is one of the things that 00:01:11.130,00:01:18.630 people care about the most. So what is duplicate content? Well, I've got a slide here where I show I think eight 00:01:18.720,00:01:26.300 different URLs, you know every single one of these URLs could return completely different content. In practice, we 00:01:26.380,00:01:34.810 as humans whenever we look at www.example.com or just regular example.com or /index or home.asp, we think of it as 00:01:34.910,00:01:41.440 the same page. And in practice, it usually is the same page. So technically it doesn't have to be, but almost always 00:01:41.530,00:01:46.400 web servers will return the same content for like these eight different versions of the URL. 00:01:46.500,00:01:52.860 So, that can cause a lot of problems in search engines if rather than having your backlinks all go to one page, 00:01:53.010,00:01:59.500 instead it's split between a www and a non-www version. And it's a really big headache. How do people solve this? 00:01:59.620,00:02:05.610 How do people fix this? Well, it turns out, and I'll dwell on this slide for just a few minutes, there are a lot 00:02:05.740,00:02:11.260 of ways to fix it. So, some people have joked that this canonical link element is kind of like, you know, 00:02:11.350,00:02:17.440 Spackle that fixes over the appearance of all the cracks in the wall. And the fact is there are a lot of 00:02:17.570,00:02:23.620 ways that you can fix things first and foremost, from the beginning, upstream where you don't need to fix it downstream 00:02:23.720,00:02:29.040 later on. There was a really funny quote by Jill Whalen at the conference where she said, 00:02:29.160,00:02:31.490 "Developers keep SEOs in business." 00:02:31.830,00:02:36.850 Right? And so whether you're a developer or an SEO there are some best practices that can make things a little bit 00:02:36.960,00:02:41.650 easier for your system so that you don't have to worry about this issue of duplicate content at all. 00:02:41.940,00:02:48.480 So, one is to try to make sure that your URLs are standardized, Microsoft sometimes calls them normalized, 00:02:48.640,00:02:55.660 in essence there's only one way to get to the content. If your content management system always generates consistent 00:02:55.750,00:03:00.620 URLs, and they're completely uniform, and you don't have to worry about having eight different versions in the 00:03:00.750,00:03:05.460 first place, that just saves you a lot of trouble. You don't have to worry about the issue coming up at all. 00:03:05.610,00:03:11.750 So one way to do that is to fix your content management system or your software so that you only generate these URLs 00:03:11.860,00:03:19.310 in a very consistent way. Another thing to do is to think about your site. Suppose you have www.example.com and 00:03:19.380,00:03:26.170 non-www, just plain old example.com. Well if you link to www sometimes and non-www sometimes, it's natural that 00:03:26.290,00:03:33.790 search engines might get a little bit confused. So linking consistently, saying okay, my homepage is going 00:03:33.870,00:03:40.290 to be www.example.com/. Nothing else, that's it. And then making sure that all of your internal linking is consistent, 00:03:40.460,00:03:45.570 that alone can make a really big difference, so that you don't end up with two, three, four copies of each page. 00:03:45.760,00:03:56.280 If you do have, you know, home.asp or index.html, you can rewrite such that all those other URLs are 301 redirects 00:03:56.420,00:04:02.170 to a single URL. So, it's great if you can fix it at the beginning, it's great if you can link consistently so the 00:04:02.320,00:04:08.910 issue never comes up, but if duplicate URLs do occur, then you can use a 301, a permanent redirect as we refer to it, 00:04:09.070,00:04:15.130 to sort of standardize and glom together all of those URLs. And search engines will follow that 301 redirect, 00:04:15.290,00:04:21.740 and typically group them all together. Google also does a couple of extra things that some search engines don't do. 00:04:21.900,00:04:28.440 So, in our Webmaster Tools, our webmaster console, which is totally free, doesn't cost anything at all, 00:04:28.600,00:04:35.900 you can specify, for example my site is mattcutts.com, you can specify if you prefer www.mattcutts.com or non-www, 00:04:36.080,00:04:42.250 so just mattcutts.com. That's a very easy setting, and that solves a lot of duplicate content issues right there. 00:04:42.600,00:04:48.160 And a little-known fact, not everybody realizes this, is that whenever you submit your URLs in what 00:04:48.290,00:04:53.760 we call a Sitemap, which is another standard that's supported by many major search engines, and it's a very simple 00:04:53.900,00:04:59.940 file, it can be as simple as a list of URLs, we take that list of URLs that you submit, and we say to ourselves, 00:05:00.080,00:05:06.140 oh, if we see a URL in that list, and then we see another version of it that's not in the list, we will prefer 00:05:06.280,00:05:12.350 URLs in the list that you gave us. So we sort of use it to break ties whenever you submit URLs from a Sitemap. 00:05:12.540,00:05:17.320 So there's at least a couple ways that you can give Google hints that try to help out with duplicate content. 00:05:18.370,00:05:26.530 But, that said, there will probably always be duplicate content issues that you can't fix. So, just to run through 00:05:26.640,00:05:32.980 a few example ones. Sometimes, you can't generate a permanent or 301 redirect. For example, at my old school account, 00:05:33.160,00:05:39.550 cs.unc.edu, I don't run the web server there. So I'd have to open a ticket or drop an email to the people that 00:05:39.700,00:05:45.790 administer that system and say hey, can you add a 301 redirect from this page to that page. A lot of free hosts, 00:05:45.990,00:05:52.350 you might not be able to generate a 301 redirect. And you can't help how people link to you. So for example, 00:05:52.530,00:05:58.920 you know, even if you link consistently to just the www version of your website, some other people might link to 00:05:59.050,00:06:03.320 the non-www version. And you can't really control that at all. 00:06:03.580,00:06:12.860 Uppercase versus lowercase paths. Microsoft IIS will support showing pages whether you link to home.asp capitalized 00:06:13.030,00:06:19.680 or lowercase, and sometimes even mixed case. And so if people link to different versions that are uppercase and 00:06:19.820,00:06:25.930 lowercase mixed, that can cause some issues. Session IDs are another really big factor. So I have seen, 00:06:26.100,00:06:33.890 at least in some search search engines, a site with a one-page privacy policy. And that privacy policy was indexed 00:06:33.980,00:06:40.880 three thousand times, each time with a different session ID, because the privacy policy was slightly different each time. 00:06:41.320,00:06:47.100 So, you know, session IDs in general if you can avoid them are great. But sometimes you as the 00:06:47.200,00:06:51.740 search engine optimizer or the person who is responsible for the site can't get rid of them entirely. 00:06:52.320,00:06:58.660 Tracking codes, you know, if you're buying ads. Analytics, you know the UTM parameter, landing pages where they 00:06:58.810,00:07:04.210 have to be different landing pages for different ads, those are the sort of things that you sometimes can't get rid of. 00:07:04.540,00:07:10.200 And if you run an e-commerce site, suppose you have different products. You might have sort by descending price 00:07:10.310,00:07:16.190 or sort by ascending price, and sometimes you need to have different facets, different views of your data, and 00:07:16.360,00:07:20.490 conceptually it's really the same thing, it's just a different way to slice and dice it. 00:07:21.230,00:07:28.430 Finally, there's breadcrumbs. So breadcrumbs are how did I get to this page? Am I coming to this red tent example 00:07:28.560,00:07:34.260 via tents, or am I coming to it via colors, or did I come to it because I was interested in accessories? 00:07:34.580,00:07:41.110 How did I land on this page? Even Google's own webmaster help documentation sometimes has a CTX parameter that says 00:07:41.230,00:07:50.540 here's how we got to this page. And that day, it was kind of funny, the Queen had just launched a new website: 00:07:50.660,00:07:59.200 royal.gov.uk. And so I wish the Queen the best, I want her to live long, and I wish the British monarchy the best, 00:07:59.480,00:08:07.180 however, someone at the Telegraph, telegraph.co.uk, had done an SEO audit of this site, and they had found 00:08:07.300,00:08:15.790 duplicate content issues. So you can see right here, just slash, royal.gov.uk/Home.aspx, and then at the very bottom 00:08:15.920,00:08:22.990 I almost made a ransom note style where I mixed uppercase and lowercase. And the royal website returned the same page 00:08:23.100,00:08:29.380 for all three of those URLs. So that was just a very simple example to illustrate that anybody can have these 00:08:29.550,00:08:30.940 sorts of issues. 00:08:31.180,00:08:36.840 So what's the answer? Lets, you know, I've buried the lead enough, how do people solve this particular problem? 00:08:37.040,00:08:41.870 Well, assuming you can't solve it any other way, and absolutely I encourage you to try to fix it upstream, 00:08:42.020,00:08:47.140 to try to link consistently. This not something that you should just say, oh, now all my problems are solved, 00:08:47.260,00:08:52.670 I don't have to worry about anything else. But, if you can't solve your problems in other ways, there's a very 00:08:52.840,00:08:59.830 simple element, link element, where you can say my canonical, and that's a long word that means you know, my preferred, 00:09:00.180,00:09:07.090 or the primary, or the clean, the pretty version of the URL that I want to use, is not this ugly URL with a tracking 00:09:07.210,00:09:14.280 code or a session ID, it's this pretty URL right over here. And all you have to do is in the head element of this 00:09:14.380,00:09:20.830 document say you know what, even though this has a weird session ID, the pretty version, the canonical version of 00:09:20.920,00:09:28.290 this URL, is over here. And that's literally all it is. It's a very simple open standard. It's one simple element 00:09:28.450,00:09:30.910 that you add to the head of your document. 00:09:31.770,00:09:38.480 Some interesting little tidbits. This is the director's cut so you get a little bit of extra info. Is this a tag? 00:09:38.780,00:09:45.190 Well, it's kind of, the technical name I believe is "element." But we're all friends here, nobody's going to abuse 00:09:45.310,00:09:51.920 you or you know make fun of you if you call it a canonical link tag versus a canonical link element. People often 00:09:52.050,00:09:59.550 speak about meta tags, right? And so meta tags are things that go in the head of the document as well. And so, if 00:09:59.670,00:10:05.530 a meta tag has a value that is a hyperlink, I think the most correct thing is not for it to be meta, but for it to 00:10:05.690,00:10:12.410 be called "link." And so that's why you see link rel="canonical" href= and the value. So now you know the official 00:10:12.510,00:10:17.080 name, but nobody's going to care if you just call it the canonical link tag. 00:10:18.230,00:10:24.250 One thing that's kind of interesting about this tag, let's just talk about a few high-order bits. 00:10:25.080,00:10:31.440 We don't promise we're going to abide by this 100%. Right? You know, if we see a webmaster and they've accidentally 00:10:31.540,00:10:36.900 shot themselves in the foot, you know maybe they've created an infinite loop, and it's very easy to create an 00:10:37.010,00:10:42.690 infinite loop, we reserve the right to do what we think is best. At least at Google, we are going to treat this as 00:10:42.790,00:10:49.490 a very strong hint. So unless we see some weird corner case or something where you're probably hurting your own site, 00:10:49.650,00:10:56.790 we probably would expect to respect this tag. So I think that in most cases, it will work quite well. But we do have 00:10:56.930,00:11:02.190 to reserve the final, sort of bottom-line ability to say no, we don't think this is what's best for the users. 00:11:03.130,00:11:09.490 Again, if you can fix it yourself upstream, that's much better. So look at all the other alternatives, the other 00:11:09.570,00:11:14.340 choices before you use this tag. Don't just say, oh, I can just slap everything with a canonical link tag and 00:11:14.470,00:11:16.060 boom, I'm done. 00:11:17.160,00:11:24.570 If you're a regular user, just like a mom-and-pop and you use WordPress or you use some shopping cart software, 00:11:24.880,00:11:30.500 it's probably best not to just roll up your sleeves and go digging into it and trying to fix it all yourself, 00:11:30.600,00:11:36.140 at least not quite yet. Wait a little while, because I think plugins will come out, people are talking about hey, 00:11:36.230,00:11:41.590 is WordPress able to add this to the core software, so maybe you don't even need a plugin? So if you're just a regular 00:11:41.720,00:11:47.820 user and you wait a few months, things should be fine. You know it's a brand-new element, so there's time for you 00:11:47.910,00:11:54.830 to sit down and cautiously deliberate and say okay, what kinds of duplicate content do I have, how can I fix it? 00:11:55.030,00:12:00.270 Take a little bit of time. Don't just jump right in and start, oh I'm going to point everywhere, I'm going to do everything. 00:12:00.410,00:12:05.230 There's enough time where this will be supported so you can plan ahead a little bit. 00:12:05.950,00:12:11.770 And as always, if we see people abusing it, we do reserve the right to change how we treat the tag, or to 00:12:11.940,00:12:20.020 not respect the tag. There is a nice way that we try to prevent abuse. We allow things within the same domain, 00:12:20.230,00:12:27.750 but we don't allow things to cross domains. So with 301s, there's always been this notion of can I hijack a site by 00:12:27.880,00:12:34.710 doing weird 301s, and can I steal the reputation of some other site? And at least right now, this element is not 00:12:34.820,00:12:41.170 really subject to that because you can only use it within the same domain. Now a natural question right after that, 00:12:41.290,00:12:45.660 is well, what about subdomains? Can I, you know, do things across different hostnames? 00:12:45.750,00:12:51.110 And the answer is yes, you can. So, I was talking to Tony Hsieh from Zappos, and they were talking about duplicate 00:12:51.230,00:12:56.660 content. And they have a server called zeta.zappos.com, which is sort of their staging software and might be the 00:12:56.750,00:13:03.470 next version. And they were saying, well, can I send my canonicalness, can I splat it from zeta.zappos.com to 00:13:03.620,00:13:07.750 www.zappos.com? And the answer is yes, you absolutely can. 00:13:08.380,00:13:15.920 Can you use it from https and send that to http? Totally, works great for that. It's on the same domain, so it's 00:13:16.040,00:13:19.350 no problem at all, at least within Google to use it for that purpose. 00:13:19.960,00:13:25.890 And then what's the difference between this and a 301 or a permanent redirect? There's really not that much, 00:13:26.000,00:13:33.610 other than this is restricted to one domain. So 301s can cross domains; this is all within the same domain. 00:13:33.830,00:13:40.440 In fact, whenever I think about it, the mental model that I have is that this is essentially like a little mini 00:13:40.560,00:13:48.260 301 redirect that you can generate with this link element. So, you know, if you think about how Google handles 301s, 00:13:48.430,00:13:52.960 that's probably a pretty good guess of how we'll handle this particular element. 00:13:54.290,00:14:00.940 So, a few more questions, since you've got the time, you're watching the video. Do the page have to be identical? 00:14:01.140,00:14:08.270 Bit for bit identical? No, they do not. Think again about this case where you have a catalog page and you can sort 00:14:08.370,00:14:14.800 by increasing price or decreasing price, those are conceptually pretty close to the same page. So if you want to say 00:14:14.990,00:14:21.740 map this to the same URL, and don't worry about the sort by parameter, you're more than welcome to do that. 00:14:22.110,00:14:26.250 They should be similar. You know, if we see, this is the only thing I can think of where there could be abuse, 00:14:26.380,00:14:31.160 is if you've got a cartoon page over here, and you've got something that's completely irrelevent to cartoons over 00:14:31.280,00:14:36.160 here and you try to combine them together. And you're not really gaining any advantage because you had PageRank on 00:14:36.280,00:14:41.880 this page and on that page. So it really doesn't make sense to combine them, but we do recommend that you use them 00:14:42.030,00:14:45.500 for similar pages. They don't have to be identical, but they should be similar. 00:14:46.400,00:14:55.390 A few sort of niggly bits. How about relative URLs versus absolute URLs? The answer to that is you can use either one. 00:14:55.670,00:15:02.010 We recommend absolute URLs. And there's a very simple reason. When you have relative URLs, you can move a URL and 00:15:02.100,00:15:10.650 everything stays the same relative to that URL. So essentially, you know the homepage can say /images or images. 00:15:10.770,00:15:17.150 And that will move it relative to that particular page. But it's better to have an absolute URL because this is 00:15:17.280,00:15:22.930 a powerful tool, and you really want to say this URL goes to exactly this URL. So you want to specify that. 00:15:23.120,00:15:27.920 Whereas if it's relative, if you mess it up here, then you might mess it up somewhere else as well. 00:15:28.850,00:15:35.440 Can you follow a chain of canonical tags, or canonical elements, just like you can follow a chain of 301 redirects? 00:15:35.820,00:15:41.750 Yes, but again I don't recommmend that, because if you have a big site and you have a big chain of 301 redirects, 00:15:41.880,00:15:47.850 it's easy for something to break. So, it's similar, something can break and you don't intend to have the consequences 00:15:47.990,00:15:55.330 that you wanted to, so what I would recommend is absolute URLs, and going from the old URL to the new URL, one hop 00:15:55.500,00:16:01.840 and that's all you do. It's just simpler that way, and you know you want to play it safe. You don't want to 00:16:01.960,00:16:07.180 accidentally shoot yourself in the foot. So what are some ways you can shoot yourself in the foot? Well, what if 00:16:07.290,00:16:13.530 you say my canonical is over here, and that's a 404 page? Right, the page might not exist. What if you had an 00:16:13.620,00:16:18.750 infinite loop? This is canonical. No, this is canonical. And we've all seen those happen, you know, what is the 00:16:18.830,00:16:22.930 Civil War? Look up the War Between the States. What is the War Between the States? Look up the Civil War. 00:16:23.080,00:16:28.100 You know, and now you have to put the dictionary down and your head hurts. So try to avoid infinite loops. 00:16:28.250,00:16:34.480 What if I point to a URL that hasn't been crawled? You know, we'll try to crawl that URL, but that corner case, 00:16:34.630,00:16:41.970 what if I told in the webmaster console, oh yeah, everything should be www.example.com, but then you specify your 00:16:42.070,00:16:48.170 canonicals as non-www, or without the www. So you can do all these sorts of things to almost shoot yourself in the 00:16:48.280,00:16:54.140 foot, and the answer is we will try to handle all of these corner cases in a reasonable way. The slide has some 00:16:54.260,00:17:00.700 Ghostbusters because there's the old saying, "Don't cross the streams," right? So think about this, take some time, 00:17:00.810,00:17:06.090 don't just throw canonical tags on willy-nilly on your site, you know, try to plan it out a little bit so that you 00:17:06.210,00:17:08.080 don't run into these corner cases. 00:17:09.050,00:17:14.240 So we're getting towards the end of the presentation. I just really wanted to send a shout out to Joachim, who is 00:17:14.360,00:17:19.640 the Google engineer who really did all the implementation, all the heavy lifting on this. Made sure that it worked 00:17:19.790,00:17:25.480 very nicely within a 301, and thought about all the corner cases. So, for example, someone said, well what if 00:17:25.590,00:17:31.520 I have a canonical, and I point to myself? Does that work? Yep, that works fine. What if I have a canonical and my 00:17:31.670,00:17:38.490 href is empty? Well, it turns out that parses as an error, which turns out to point to itself. So all this stuff 00:17:38.610,00:17:44.700 still works because Joachim did a really good design, but again, try to make sure that it's all absolute URLs and 00:17:44.830,00:17:51.240 everything's specified well. Also, I'd love to send a shout out to Greg Grothaus. It turns out when you dig into this, 00:17:51.390,00:17:58.960 a lot of people have proposed similar ideas. I saw at least one post out on the general web after we'd started 00:17:59.070,00:18:05.330 exploring this that said, hey, why don't you do this kind of a proposal? But Greg was really one of the people who 00:18:05.440,00:18:11.890 sparked the discussion at Google, who really pushed for it and had a great idea, and so I sort of think of him as 00:18:11.990,00:18:17.580 at least within Google, he really got the ball rolling and really sparked the wave of work on this, so I really 00:18:17.750,00:18:23.730 appreciate that. And of course all the people, you know, from Maile and Wysz and Adam and Riona who have worked on 00:18:23.840,00:18:29.970 the messaging and reached out to different people. At Yahoo!, Priyank, and a ton of people at Microsoft, 00:18:30.130,00:18:35.480 Nathan Buggia and a bunch of other people as well. My hope is that lots of search engines will support this. 00:18:35.630,00:18:41.680 So, Yahoo! and Microsoft have announced that they will support it, let's keep our fingers crossed for Ask, I'd love 00:18:41.760,00:18:49.140 for them to join in as well. Wikia, so Artur at Wikia had emailed us and sort of asked about doing canonical tags 00:18:49.260,00:18:54.110 anyway. And so it was really great that they could test it out while we were trying it out ourselves. 00:18:54.340,00:18:59.050 And then a ton of webmasters who always give us this sort of feedback on what they'd like to see. 00:18:59.350,00:19:06.430 On this last slide, I just list a bunch of resources, so Google, Yahoo!, and Microsoft all did blog posts about it. 00:19:06.580,00:19:13.110 There's an official Help Center documentation page. And, what we saw was, as people would come and have duplicate 00:19:13.250,00:19:19.620 content questions, Joost had come and sort of asked about an interesting corner case, we just said, hey, you know 00:19:19.740,00:19:23.800 what? We've got this thing coming out that might help with this. And so it was a very nice way to just do a sort 00:19:23.910,00:19:30.550 of very quiet beta test and see how well it worked. So, Joost happened to email just a few days before we were 00:19:30.670,00:19:35.210 ready to announce support, and so we gave him a heads-up about the possibility of this, and he turned around 00:19:35.340,00:19:41.530 plugins not just for WordPress, but also for Magento, which is an e-commerce shopping software, and Drupal, which 00:19:41.660,00:19:46.510 is another open-source content management system, which I think the White House just rolled out using Drupal. 00:19:46.660,00:19:53.960 So really appreciate the work that he's done as well. And in general, you know, be careful, be cautious, plan out 00:19:54.090,00:19:58.880 how you want to use this tag. But we don't intend to make any money off of it, we think it's just good for the web, 00:19:59.020,00:20:06.220 It'll lead to less duplicate content. It's an open standard, so any search engine that crawls the web can use this 00:20:06.310,00:20:09.510 information to help, you know, make the web more relevant and increase the relevancy of their search results. 00:20:09.650,00:20:14.410 And now you know as much as the audience knows when they attended SMX West. 00:20:14.550,00:20:17.930 Thanks very much for listening, and talk to you soon. |