Mistyped URL? Duplicate content? .htaccess and mod_rewrite to the rescue
It’s been a while since I’ve posted, so I thought I’d jot down a couple of ways Apache and mod_rewrite can save your life. Not literally of course, unless your website’s been linked to your life-support system by a crazed psychopath – but it should make your readers’ lives easier. Isn’t that what we’re supposed to be doing here?
Duplicate Content, Google and You
Google, like other search engines, cannot tell the difference between the URLs yourdomain.com
and www.yourdomain.com
. Usually there’s none, which means Google is seeing those addresses – which actually lead to the same page – as two separate pages with identical content. This is far from optimal.
While Google takes into account that many people’s sites are set up this way and doesn’t penalize for duplicate content in these cases, it’s still not ideal because people will, at some point, link to your site. If half of them are linking to yourdomain.com
and half are linking to www.yourdomain.com
, the link popularity will be watered down and shared between these two “versions” of your homepage (don’t forget that to Google, these are two separate pages that happen to have the same content). It’s far better to consolidate this into one canonical version of the page which will receive all of that precious linkjuice, here’s how it’s done.
Firstly, it helps if your site uses the Canonical URL element. Within the <head>
of your document, place the following:
When the page is accessed via yourdomain.com
, this will tell Google that the definitive URL for this content is found at www.yourdomain.com
. You can do this by hand if your site is static (X)HTML or if you’re using a blog or CMS platform there is likely to be a plugin such as this one for WordPress.
But wait! There’s more!
“But wait!”
I hear you cry, “What if somebody’s already come to my site via
yourdomain.com
? What if they bookmark the wrong address, or link to it, or spontanaeously combust?”
I’m glad you asked. If you’re hosting your own blog or website on a server running Apache with mod_rewrite enabled, you can do a little behind-the-scenes jiggerypokery to help this consolidation go a little smoother. Here’s what you need to do: first, you need to find your domain’s top-level .htaccess file – this insanely useful beastie lives in your web root and is invisible by default in most FTP clients so if this is the first time you’ve tried to edit it, you may need to force your FTP client to show dotfiles (in my favourite client, the excellent FileZilla, it can be toggled by accessing “Force showing hidden files” from the Server menu).
1 2 3 4 | RewriteEngine On RewriteBase / RewriteCond %{HTTP_HOST} ^yourdomain.com RewriteRule (.*) http://www.yourdomain.com/$1 [R=301,L] |
The first line tells Apache to parse all requests through the mod_rewrite module. The second line sets the base URL for per-directory rewrites – we won’t be using it here, but it’s best practice to include it anyway. Here’s where it gets interesting: the third line uses a regular expression to match all URLs requested which start with yourdomain.com
and the fourth line rewrites these to www.yourdomain.com
and returns a HTTP 301 Redirect. If your visitor is human, this will rewrite the URL in your visitor’s address bar to include the www, so the right address can be bookmarked. If the visitor is a search engine, it will inform the search engine that the content formerly found at yourdomain.com
has permanently moved to www.yourdomain.com
. Google will update any links to reflect this.
Mistyped URLs
We’ve all done it at some point. It’s too easy to type ww.somedomain.com or wwww.somedomain.com, which will usually lead to a Not Found error, a sigh of annoyance and precious seconds spent correcting the typo. Either that or they won’t notice their mistake and will simply assume that your website has ceased to exist. Far from ideal, so let’s dip into .htaccess
again and see what we can do to improve things.
1 2 3 4 5 6 | RewriteEngine On RewriteBase / RewriteCond %{HTTP_HOST} ^yourdomain\.com$ [NC,OR] RewriteCond %{HTTP_HOST} ^ww\.yourdomain\.com$ [NC,OR] RewriteCond %{HTTP_HOST} ^wwww\.yourdomain\.com$ [NC] RewriteRule (.*) http://www.yourdomain.com/$1 [R=301,L] |
Here we’ve expanded the original rewrite rule to take care of the most common mistyped addresses. NB: for this to work, you need to have wildcard DNS enabled on your domain – your host should be able to enable this for you.
Image hotlinking
Hotlinking (where someone links to one of your images from their content, usually without your permission) is annoying. It adds to the load on your server, adds to your bandwidth bill and generally messes with your digital life. Not any more though, because the humble .htaccess file can help you with this one too. Going back to our file from earlier, we would need to add these rules:
7 8 9 | RewriteCond %{HTTP_REFERER} !^$ RewriteCond %{HTTP_REFERER} !^http://(www\.)?yourdomain.com/.*$ [NC] RewriteRule \.(gif|jpe?g|png|bmp)$ - [F] |
The above construction listens out for requests for files with the extensions gif, jpg, jpeg, png or bmp from any domain other than your own, and serves a HTTP 403 (Forbidden) response.
A Little More Evil
Of course, you might want to do a little more than that. The following method is a slightly more proactive method used by some website owners. First, create an appropriate image. Save this image on your server with a made-up extension (don’t forget that your rewrite script won’t let it be served if it’s gif, jpg/jpeg, png or bmp). The following example assumes your image is saved as /images/noleeching.fu
.
7 8 9 | RewriteCond %{HTTP_REFERER} !^$ RewriteCond %{HTTP_REFERER} !^http://(www\.)?yourdomain.com/.*$ [NC] RewriteRule \.(gif|jpe?g|png|bmp)$ /images/noleeching.fu [L] [F] |
great post, thanks for sharing