[Koha-devel] Web crawlers hammered our Koha multi server

Wed Jan 13 08:21:37 CET 2010

Hi RIck

This of course should be an option. There are many libraries who would
like their catalogue indexed by search engines and if the server has
the capacity to do it, it should be allowed.
So whatever changes made to opac-search.pl should be under the control
of a systempreference.

Also this solution will stop anyone ever sending a link of a search
result to someone else.

Chris

2010/1/13 Rick Welykochy <rick at praxis.com.au>:
> Hi all
>
> We are running several instances of Koha on the one box using
> Linux Vserver.  The other night the server was brought to its
> knees and mysql ran out of free connections. Further investigation
> found over 80 instances of perl + Apache running OPAC search queries.
> There were many attendant instances of Zebra spawned as well, seeing
> to these searches.
>
> Once our daily backup kicked in at the same time, all hades broke lose.
> We were DoS'd at one stage and had to remote reboot the thing.
>
> How did the web crawlers find our obscure site? Probably due to a URL
> containing a search being posted to a web site.
>
> After some thought, two of us came up with a simple solution to solve
> this situation.
>
> The Problem: the OPAC search and OPAC advanced searches are accessible
> by the public from the Koha OPAC home page. Consequently, an over zealous
> web crawler indexing the site using the opac-search.pl script can
> impact the performance of the Koha system. In the extreme, an under-
> resourced system can experience a DoS when the number of searches
> exceeds the capacity of the system.
>
> The Solution: modify the opac-search.pl script in the following manner:
>
> (A) Only allow queries using the POST method; otherwise if GET is used
>     return a simple page with "No search result found".
>
> (B) Exception: do allow GET queries but only if the HTTP_REFERER
>     matches the SERVER_NAME. This allows all the searches to work
>     via web site links.
>
> Here is the small code segment added to opac-search.pl, immediately after
> the BEGIN block:
>
>
> if ($ENV{HTTP_REFERER} !~ /$ENV{SERVER_NAME}/ && $ENV{REQUEST_METHOD} ne "POST")
> {
>         print "Content-type: text/html\n\n";
>        print "<h1>Search Results</h1>Nothing found.\n";
>         exit;
> }
>
>
> CAVEAT: This solution does not allow one to paste an "opac_search.pl"
>   link into the browser and have it work as previously expected. But
>   this was the cause of the problem in the first place. A better solution
>   is to require a user to login to the OPAC before allowing a search.
>
> Addendum: also install a robots.txt file at the following location
> in the Koha source tree:
>
>    opac/htdocs/robots.txt
>
> The robots.txt file should contain the following contents, which deny all
> access to indexing engines. You can learn more about robots.txt on the
> web, and configure it to allow some indexing if you wish.
>
> -----------------------------
> User-agent:  *
>  Disallow: /
> -----------------------------
>
> I plan to submit a bug report regarding this situation, but first open it up
> for discussion here.
>
>
> cheers
> rickw
>
>
> --
> _________________________________
> Rick Welykochy || Praxis Services
>
> If you have any trouble sounding condescending, find a Unix user to show
> you how it's done.
>      -- Scott Adams
> _______________________________________________
> Koha-devel mailing list
> Koha-devel at lists.koha.org
> http://lists.koha.org/mailman/listinfo/koha-devel
>