Banned Chinese robot

Searching for information on the Internet is based on robot harvesting for these pages. But all robots are not simply respecting the etiquette, for the first time tonight I have banned a robot from my blogs

Robots are scanning the web pages in order to index their content. This allows human users to make request and _try to_ find the content they are looking for. Every search engine is doing this hard job of looking at all pages on the web, following links and reading content, tradition use to impose them to harvest slowly each web site in order not to overwhelm these

Following these rules allows a regulation directly at the source (the robot), but in some cases each web site can install filter in order to protect part of the content or suppress a specific robot to dig. Normally the first job of a robot is to check whether or not it can scan the web site pages, but some are not playing this game well.

This bad behaviour is the one of qihoo a Chinese robot and search engine. It is visiting my blogs on a regular basis but scanning pages too quickly (every 3 to 5 seconds a page is scanned). Since I don’t write specific content for Chinese people at this time and in order to limit qihoo robot scans, I installed some weeks ago a robots.txt filter, but the QihooBot robot is not taking it into account. Since I am not the only one complaining about this specific behaviour, I decided tonight to escalate the filtering through a mod_rewrite Apache rule.

 

RewriteCond  %{HTTP_USER_AGENT}  ^.*QihooBot.*
RewriteRule  ^.*html$       /norobots.html  [L]

 

 

 

I am not used to do such things, a lot of robots are scanning the Internet all day long and we don’t always know what for, but I do think some of these robots are preparing the next Google or Exalead. But this time it was enough. I home my filter list will not be expanded in the future.

See also: “Crawl politely or don’t crawl at all“.

Leave a Reply



Photo of Alexandre Chauvin-HameauAlexandre Chauvin-Hameauach@meta-x.org
Work(Preferred): +33 426 903 783
Cell: +33 609 573 932
130 Rue Duguesclin
Lyon, 69006 France