[Koha-bugs] [Bug 4042] New: Public OPAC search can fall prey to web crawlers
bugzilla-daemon at kohaorg.ec2.liblime.com
bugzilla-daemon at kohaorg.ec2.liblime.com
Thu Jan 14 00:23:44 CET 2010
http://bugs.koha.org/cgi-bin/bugzilla3/show_bug.cgi?id=4042
Summary: Public OPAC search can fall prey to web crawlers
Product: Koha
Version: unspecified
Platform: All
OS/Version: All
Status: NEW
Severity: enhancement
Priority: P5
Component: OPAC
AssignedTo: jmf at liblime.com
ReportedBy: rick at praxis.com.au
Estimated Hours: 0.0
Change sponsored?: ---
The OPAC search and OPAC advanced searches are accessible
by the public from the Koha OPAC home page. Consequently, an over zealous
web crawler indexing the site using the opac-search.pl script can
impact the performance of the Koha system. In the extreme, an under-
resourced system can experience a DoS when the number of searches
exceeds the capacity of the system.
Proposed Solution: modify the opac-search.pl script in the following manner:
(A) Only allow queries using the POST method; otherwise if GET is used
return a simple page with "No search result found".
(B) Exception: do allow GET queries but only if the HTTP_REFERER
matches the SERVER_NAME. This allows all the searches to work
via web site links.
(C) Make this behavior optional by adding a new flag to the system prefs.
Here is the small code segment added to opac-search.pl, immediately after
the BEGIN block:
if ($ENV{HTTP_REFERER} !~ /$ENV{SERVER_NAME}/ && $ENV{REQUEST_METHOD} ne
"POST")
{
print "Content-type: text/html\n\n";
print "<h1>Search Results</h1>Nothing found.\n";
exit;
}
CAVEAT: This solution does not allow one to paste an "opac_search.pl"
link into the browser and have it work as previously expected. But
this was the cause of the problem in the first place. A better solution
is to require a user to login to the OPAC before allowing a search.
Addendum: also install a robots.txt file at the following location
in the Koha source tree to stop web crawlers from using the OPAC search.
opac/htdocs/robots.txt
The robots.txt file should contain the following contents, which deny all
access to indexing engines. You can learn more about robots.txt on the
web, and configure it to allow some indexing if you wish.
-----------------------------
User-agent: *
Disallow: /
-----------------------------
--
Configure bugmail: http://bugs.koha.org/cgi-bin/bugzilla3/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching all bug changes.
More information about the Koha-bugs
mailing list