[Koha-devel] Web crawlers hammered our Koha multi server

Wed Jan 13 07:48:03 CET 2010

Hi all

We are running several instances of Koha on the one box using
Linux Vserver.  The other night the server was brought to its
knees and mysql ran out of free connections. Further investigation
found over 80 instances of perl + Apache running OPAC search queries.
There were many attendant instances of Zebra spawned as well, seeing
to these searches.

Once our daily backup kicked in at the same time, all hades broke lose.
We were DoS'd at one stage and had to remote reboot the thing.

How did the web crawlers find our obscure site? Probably due to a URL
containing a search being posted to a web site.

After some thought, two of us came up with a simple solution to solve
this situation.

The Problem: the OPAC search and OPAC advanced searches are accessible
by the public from the Koha OPAC home page. Consequently, an over zealous
web crawler indexing the site using the opac-search.pl script can
impact the performance of the Koha system. In the extreme, an under-
resourced system can experience a DoS when the number of searches
exceeds the capacity of the system.

The Solution: modify the opac-search.pl script in the following manner:

(A) Only allow queries using the POST method; otherwise if GET is used
     return a simple page with "No search result found".

(B) Exception: do allow GET queries but only if the HTTP_REFERER
     matches the SERVER_NAME. This allows all the searches to work
     via web site links.

Here is the small code segment added to opac-search.pl, immediately after
the BEGIN block:

if ($ENV{HTTP_REFERER} !~ /$ENV{SERVER_NAME}/ && $ENV{REQUEST_METHOD} ne "POST")
{
         print "Content-type: text/html\n\n";
	print "<h1>Search Results</h1>Nothing found.\n";
         exit;
}

CAVEAT: This solution does not allow one to paste an "opac_search.pl"
   link into the browser and have it work as previously expected. But
   this was the cause of the problem in the first place. A better solution
   is to require a user to login to the OPAC before allowing a search.

Addendum: also install a robots.txt file at the following location
in the Koha source tree:

    opac/htdocs/robots.txt

The robots.txt file should contain the following contents, which deny all
access to indexing engines. You can learn more about robots.txt on the
web, and configure it to allow some indexing if you wish.

-----------------------------
User-agent:  *
  Disallow: /
-----------------------------

I plan to submit a bug report regarding this situation, but first open it up
for discussion here.

cheers
rickw

-- 
_________________________________
Rick Welykochy || Praxis Services

If you have any trouble sounding condescending, find a Unix user to show
you how it's done.
      -- Scott Adams