[Koha-bugs] [Bug 4042] New: Public OPAC search can fall prey to web crawlers

Thu Jan 14 00:23:44 CET 2010

http://bugs.koha.org/cgi-bin/bugzilla3/show_bug.cgi?id=4042

           Summary: Public OPAC search can fall prey to web crawlers
           Product: Koha
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P5
         Component: OPAC
        AssignedTo: jmf at liblime.com
        ReportedBy: rick at praxis.com.au
   Estimated Hours: 0.0
 Change sponsored?: ---

The OPAC search and OPAC advanced searches are accessible
by the public from the Koha OPAC home page. Consequently, an over zealous
web crawler indexing the site using the opac-search.pl script can
impact the performance of the Koha system. In the extreme, an under-
resourced system can experience a DoS when the number of searches
exceeds the capacity of the system.

Proposed Solution: modify the opac-search.pl script in the following manner:

(A) Only allow queries using the POST method; otherwise if GET is used
     return a simple page with "No search result found".

(B) Exception: do allow GET queries but only if the HTTP_REFERER
     matches the SERVER_NAME. This allows all the searches to work
     via web site links.

(C) Make this behavior optional by adding a new flag to the system prefs.

Here is the small code segment added to opac-search.pl, immediately after
the BEGIN block:

if ($ENV{HTTP_REFERER} !~ /$ENV{SERVER_NAME}/ && $ENV{REQUEST_METHOD} ne
"POST")
{
         print "Content-type: text/html\n\n";
        print "<h1>Search Results</h1>Nothing found.\n";
         exit;
}

CAVEAT: This solution does not allow one to paste an "opac_search.pl"
   link into the browser and have it work as previously expected. But
   this was the cause of the problem in the first place. A better solution
   is to require a user to login to the OPAC before allowing a search.

Addendum: also install a robots.txt file at the following location
in the Koha source tree to stop web crawlers from using the OPAC search.

    opac/htdocs/robots.txt

The robots.txt file should contain the following contents, which deny all
access to indexing engines. You can learn more about robots.txt on the
web, and configure it to allow some indexing if you wish.

-----------------------------
User-agent:  *
  Disallow: /
-----------------------------

-- 
Configure bugmail: http://bugs.koha.org/cgi-bin/bugzilla3/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching all bug changes.