Mistyped URL? Duplicate content? .htaccess and mod_rewrite to the rescue

Posted on by Chris

It’s been a while since I’ve posted, so I thought I’d jot down a couple of ways Apache and mod_rewrite can save your life. Not literally of course, unless your website’s been linked to your life-support system by a crazed psychopath – but it should make your readers’ lives easier. Isn’t that what we’re supposed to be doing here?

Duplicate Content, Google and You

Google, like other search engines, cannot tell the difference between the URLs yourdomain.com and www.yourdomain.com. Usually there’s none, which means Google is seeing those addresses – which actually lead to the same page – as two separate pages with identical content. This is far from optimal.

While Google takes into account that many people’s sites are set up this way and doesn’t penalize for duplicate content in these cases, it’s still not ideal because people will, at some point, link to your site. If half of them are linking to yourdomain.com and half are linking to www.yourdomain.com, the link popularity will be watered down and shared between these two “versions” of your homepage (don’t forget that to Google, these are two separate pages that happen to have the same content). It’s far better to consolidate this into one canonical version of the page which will receive all of that precious linkjuice, here’s how it’s done.

Firstly, it helps if your site uses the Canonical URL element. Within the <head> of your document, place the following:

<link rel="canonical" href="http://www.yourdomain.com/" />

When the page is accessed via yourdomain.com, this will tell Google that the definitive URL for this content is found at www.yourdomain.com. You can do this by hand if your site is static (X)HTML or if you’re using a blog or CMS platform there is likely to be a plugin such as this one for WordPress.

But wait! There’s more!

“But wait!” I hear you cry, “What if somebody’s already come to my site via yourdomain.com? What if they bookmark the wrong address, or link to it, or spontanaeously combust?”

I’m glad you asked. If you’re hosting your own blog or website on a server running Apache with mod_rewrite enabled, you can do a little behind-the-scenes jiggerypokery to help this consolidation go a little smoother. Here’s what you need to do: first, you need to find your domain’s top-level .htaccess file – this insanely useful beastie lives in your web root and is invisible by default in most FTP clients so if this is the first time you’ve tried to edit it, you may need to force your FTP client to show dotfiles (in my favourite client, the excellent FileZilla, it can be toggled by accessing “Force showing hidden files” from the Server menu).

1
2
3
4
RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_HOST} ^yourdomain.com
RewriteRule (.*) http://www.yourdomain.com/$1 [R=301,L]

The first line tells Apache to parse all requests through the mod_rewrite module. The second line sets the base URL for per-directory rewrites – we won’t be using it here, but it’s best practice to include it anyway. Here’s where it gets interesting: the third line uses a regular expression to match all URLs requested which start with yourdomain.com and the fourth line rewrites these to www.yourdomain.com and returns a HTTP 301 Redirect. If your visitor is human, this will rewrite the URL in your visitor’s address bar to include the www, so the right address can be bookmarked. If the visitor is a search engine, it will inform the search engine that the content formerly found at yourdomain.com has permanently moved to www.yourdomain.com. Google will update any links to reflect this.

Mistyped URLs

We’ve all done it at some point. It’s too easy to type ww.somedomain.com or wwww.somedomain.com, which will usually lead to a Not Found error, a sigh of annoyance and precious seconds spent correcting the typo. Either that or they won’t notice their mistake and will simply assume that your website has ceased to exist. Far from ideal, so let’s dip into .htaccess again and see what we can do to improve things.

1
2
3
4
5
6
RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_HOST} ^yourdomain\.com$ [NC,OR]
RewriteCond %{HTTP_HOST} ^ww\.yourdomain\.com$ [NC,OR]
RewriteCond %{HTTP_HOST} ^wwww\.yourdomain\.com$ [NC]
RewriteRule (.*) http://www.yourdomain.com/$1 [R=301,L]

Here we’ve expanded the original rewrite rule to take care of the most common mistyped addresses. NB: for this to work, you need to have wildcard DNS enabled on your domain – your host should be able to enable this for you.

Image hotlinking

Hotlinking (where someone links to one of your images from their content, usually without your permission) is annoying. It adds to the load on your server, adds to your bandwidth bill and generally messes with your digital life. Not any more though, because the humble .htaccess file can help you with this one too. Going back to our file from earlier, we would need to add these rules:

7
8
9
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(www\.)?yourdomain.com/.*$ [NC]
RewriteRule \.(gif|jpe?g|png|bmp)$ - [F]

The above construction listens out for requests for files with the extensions gif, jpg, jpeg, png or bmp from any domain other than your own, and serves a HTTP 403 (Forbidden) response.

A Little More Evil

Of course, you might want to do a little more than that. The following method is a slightly more proactive method used by some website owners. First, create an appropriate image. Save this image on your server with a made-up extension (don’t forget that your rewrite script won’t let it be served if it’s gif, jpg/jpeg, png or bmp). The following example assumes your image is saved as /images/noleeching.fu.

7
8
9
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(www\.)?yourdomain.com/.*$ [NC]
RewriteRule \.(gif|jpe?g|png|bmp)$ /images/noleeching.fu [L] [F]

Here’s one for inspiration:

This entry was posted in Apache and tagged , , , , . Bookmark the permalink.

7 Responses to Mistyped URL? Duplicate content? .htaccess and mod_rewrite to the rescue

  1. Hi, merely a short comment to swing in and say thx for the insights in this article. I somehow came across your weblog while searching for exercise related things in Bing… guess I more or less lost my focus! Anyways I’ll be returning in the near future to check out your posts down the road. All the best!

  2. I used the hotlinking button provided in my hosting service Control Panel (cPanel) which creates a .htaccess file. I expect most hosting services make it easier to code the .htaccess file like this because they will sweep up all your domains and addon domains to avoid the image block or change. The result was basically as shown in this blog, but when I tested on a browser which had had its cache cleared, the substitute image showed for .jpg and .gif but a .png image was not changed, it showed in its original form. The rewrite rule was
    RewriteRule .*\.(jpg|jpeg|gif|png|bmp)$ http://www.my-domain.com/images/xxxxxx.fu [R,NC]
    which is substantially the same as in the blog above
    RewriteRule \.(gif|jpe?g|png|bmp)$ /images/noleeching.fu [L] [F]
    both include png as file extensions to be disallowed so I don’t know why png was not affected. Note minor difference with a * at the beginning and different letters at the end.

  3. Fascinating. I can’t see any reason for the png to be handled any differently – as you say, it appears in both rules. I’ll do some testing when I have time.

  4. Sorry, false alarm. I tried with a different png image and that one was substituted but again the original one wasn’t. In my next test the page rendered slowly for one of the images but they were all substituted. I had cleared the cache between tests. Perhaps .htaccess or my server finds it difficult to deal with several substitutions of images on the same page as I was testing one jpg, one gif and two png images from urls on one of my other domains.

  5. Top tip on redirecting the mistyped www dots to your preferred version of the site – not seen that mentioned before. Going to start adding that one into all my sites.

    With regards to the canonical url tag, that’s not strictly classed as definitive url as far as I’m aware. Last time I checked, Matt Cutts described it as a “very strong signal to Google” to use that url for the page in question. I’m pretty sure Google does some checking in the background, further site analysis and doesn’t just use the url because it’s specified in a meta tag.

    Good blog btw, going to start following it.

Leave a Reply

Your email address will not be published. Required fields are marked *