Blocking the 80legs shitbot from a website

Here’s the scenario; a website gets slow, or server gets slow, or you simply notice a ton of random connections to a site that normally is not that busy.  You look in the logs and find thousands of lines like the following:

37.204.237.178 – – [17/Feb/2014:10:26:17 -0500] “GET /blog/hello-world/ HTTP/1.1” 200 31235 “-” “Mozilla/5.0 (compatible; 008/0.85; http://www.80legs.com/webcrawler.html) Gecko/2008032620”
188.187.209.252 – – [17/Feb/2014:10:26:16 -0500] “GET /blog/hello-world/ HTTP/1.1” 200 31235 “-” “Mozilla/5.0 (compatible; 008/0.85; http://www.80legs.com/webcrawler.html) Gecko/2008032620”

What is this crap you may ask?  Well it’s a big smelly piece shit known as 80Legs that masquerades as a legit web crawler.  They use some type of distributed system where thousands of random computers make a few requests at a time, but they all do it concurrently, so sites serving dynamically generated content, such as blogs, php apps, shopping carts, etc. tend to overload when hit with hundreds of connections per second.

They claim to honor robots.txt, but I’ve found that most of the time they don’t.  If you do want to attempt that exercise in futility, here is what they say to add:

User-agent: 008
Disallow: /

A better solution is to block their bot via user agent string in your server config or an .htaccess file.  That code is either of the following:

SetEnvIfNoCase ^User-Agent$ .*(80legs) ShitBot
Deny from env=ShitBot

or

RewriteCond %{HTTP_USER_AGENT} 80legs [NC]
RewriteRule . - [F,L]

Finally, if you want to keep the requests out during an active 80legs attack, here’s a perl script you can cat your access log into, and then pipe the active log into, to block all requests from them. It is currently IPv4 specific as I have never seen them crawl using IPv6, so I haven’t spent the time to make it support both protocols yet. So what you’d do, to get the blocking going, is:

  1. Copy the following script and paste into a file:
    #!/usr/bin/perl -w
    
     my %blockedhash = ();
     #@blocked = `/bin/netstat -nr|/bin/grep '!H'|/usr/bin/cut -f 1 -d ' '|/bin/sort -g`;
     @blocked = `iptables -n --list INPUT|grep DROP|cut -c 21-|cut -f 1 -d ' '|/bin/sort -g`;
    
     foreach $blockedip (@blocked) {
      chomp $blockedip;
      print "Already blocked $blockedip\n";
      $blockedhash {"$blockedip"} = '1';
     }
    
     while( <STDIN> ) {
      if ( /^(\d+)\.(\d+)\.(\d+)\.(\d+) - - \[.* \"-\"(.*)\"$/i ) {
       $ip = "$1.$2.$3.$4";
       $useragent = $5;
       if ($useragent =~ /(80legs|googlebawt|looksmart|avast|x11; u| ru;|ru\)|cock|suck|zyborg|webdav|crawler|fast-webcrawl|H010818|Mozilla\/4.75|FunWebProducts-MyWay)/i) {
        if ( ! $blockedhash{"$ip"} ) {
         system("/sbin/iptables -A INPUT -s $ip -d 0/0 -j DROP");
         print "Blocking: $ip ($useragent)\n";
         $blockedhash {"$ip"} = '1';
        }
       }
      }
     }
    
  2. chmod 700 on the file to make it executable.
  3. Assuming your access log is named /var/log/httpd/access_log and your perl script is named 80legs.pl, you’d run these commands:
    1. cat /var/log/httpd/access_log | ./80legs.pl
    2. tail -f /var/log/httpd/access_log | ./80legs.pl
  4. The first command will block all 80legs IP’s that have been seen up to that point using Linux’s iptables functionality, the second command will watch all further incoming requests that match 80legs and will then block those as well.  It will spit out activity as it occurs so you’ll know it’s working.  It should look something like this when re-running it the second time against the active log:

Already blocked 217.73.83.58
Already blocked 217.73.85.213
Already blocked 217.74.34.5
Already blocked 217.77.215.117
Already blocked 217.77.215.174
Already blocked 217.79.27.175
Already blocked 217.84.75.176
Already blocked 218.111.187.81
Already blocked 220.233.131.62
Blocking: 178.20.41.9 ( “Mozilla/5.0 (compatible; 008/0.85; http://www.80legs.com/webcrawler.html) Gecko/2008032620)
Blocking: 89.31.37.223 ( “Mozilla/5.0 (compatible; 008/0.85; http://www.80legs.com/webcrawler.html) Gecko/2008032620)
Blocking: 95.72.250.125 ( “Mozilla/5.0 (compatible; 008/0.85; http://www.80legs.com/webcrawler.html) Gecko/2008032620)
Blocking: 46.200.84.180 ( “Mozilla/5.0 (compatible; 008/0.85; http://www.80legs.com/webcrawler.html) Gecko/2008032620)
Blocking: 188.123.248.86 ( “Mozilla/5.0 (compatible; 008/0.85; http://www.80legs.com/webcrawler.html) Gecko/2008032620)
Blocking: 178.168.45.6 ( “Mozilla/5.0 (compatible; 008/0.85; http://www.80legs.com/webcrawler.html) Gecko/2008032620)

Leave a Reply

Your email address will not be published. Required fields are marked *