MainDoctor WhoMusicSoftware
Main Page

Alden Bates' Weblog

Feigning normality since 1973

Yahoo's search bot and your web site

Filed in: Website Management.

A year or so ago now I moved the NZDWFC page from http://www.tetrap.com/ drwho/nzdwfc/ to nzdwfc.tetrap.com. To ease the transition, I placed a permanent 301 redirection from the old URL to the new one. Anyone going to the old URL gets bounced to the new URL without having to do anything.

Yahoo uses a spider called "Yahoo Slurp" to crawl the web looking for pages to add to the search index. Slurp hits http://www.tetrap.com/ drwho/nzdwfc/ and gets redirected to nzdwfc.tetrap.com like everyone else. Unfortunately Slurp has a bug in it, and adds the page to Yahoo's search index under the old URL.

Most of the NZDWFC page is indexed in Yahoo under the old URL, even pages I've added since the move (There are, in fact, only 5 pages in the Yahoo index for the nzdwfc.tetrap.com subdomain). This means that if the NZDWFC page comes up in a search and the user clicks on it, my server has to redirect them to the new page.

Last month the main domain www.tetrap.com got 4633 hits which resulted in redirects, a good number of those caused by people coming from Yahoo search results. Obviously I want to reduce this number so my web server has less work to do - the trouble is how to tell Yahoo Slurp not to index the old URL without breaking the redirection for users who surf in.

To do this I use Apache's rewrite engine like so:

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} help\.yahoo\.com [NC]
RewriteCond %{REQUEST_URI} ^/drwho/nzdwfc/.*$ [NC]
RewriteRule ^.* - [G,L]

This all goes in the .htaccess file which sits in my root directory. The lines work as follows:

  1. Turns the rewrite engine on. Kinda essential.
  2. Checks the user-agent of the bot for the string "help.yahoo.com". Slurp uses this in its user agent.
  3. Matches any file they request in the /drwho/nzdwfc/ directory or below.
  4. Tells Apache to send back a 410 response. 410 means "it's gone, matey, and it ain't coming back". Additionally the L indicates to Apache not to process any more Rewrite stuff because we're finished.

So put them together, and the server tells Slurp that the file it is requesting is gone, but lets anyone else through to hit the redirection. There are still a lot of redirections happening, but hopefully Yahoo will gradually drop the old URLs in favour of the new ones, and the redirections will decrease.

That's the theory, anyway. I'll update this weblog with the results in a few month's time, hopefully...

The Apache rewrite engine is a great and powerful thing, but also a dangerous thing.

Posted September 10, 2004 9:18 PM

Post a comment





Tetrap.com Site Map