[Koha-devel] Web crawlers hammered our Koha multi server

Rick Welykochy rick at praxis.com.au
Wed Jan 13 08:49:08 CET 2010


Chris Cormack wrote:

> This of course should be an option. There are many libraries who would
> like their catalogue indexed by search engines and if the server has
> the capacity to do it, it should be allowed.
> So whatever changes made to opac-search.pl should be under the control
> of a systempreference.

Good idea. I can add that to the code.


> Also this solution will stop anyone ever sending a link of a search
> result to someone else.

There is a quite different solution to this problem. It addresses the
web crawler problem plus the problem of "form spammers" that fill in
every field they find in an HTML form.

The real problem lies in the nature of bots of any species that find
a form to fill and and hit your website with all possible values being
selected, one by one.

This is the behaviour I saw in our Apache logs. EVERY possibility for the
advanced search was being requested and presumably indexed. IMHO, this is
a heinous action and should be disallowed. And in one case, a search engine
was firing MULTIPLE bots at us from different servers simultaneously, which
is another heinous action punishable by banishment from the Interweb.

Here is an alternative to be discussed. And it does allow people to
share links. It involves a change to the template, CSS and code.

ref: http://www.geekwisdom.com/dyn/antispam_hidden_form_field


1. In the template, add a field called, perhaps, URL, which a bot
    will be tempted to fill in.

    <input class="cgirequired" type=text name="URL"/>


2. In the CSS, hide the field. The class "cgirequired" is actually not
    required by the CGI :)

    .required { display:none; visibility:hidden; }


3. In the perl code, if the field "URL" is not empty, deny the search.


This solution allows all GET and POST initiated by a human user to
proceed. URLs can be shared. But a bot that fills in the hidden "URL"
field with something will be given the turf toute suite.


cheers
rick


-- 
_________________________________
Rick Welykochy || Praxis Services

A computer is a state machine. Threads are for people who can't program state machines.
       -- Alan Cox



More information about the Koha-devel mailing list