[Koha-bugs] [Bug 4042] Public OPAC search can fall prey to web crawlers

bugzilla-daemon at bugs.koha-community.org bugzilla-daemon at bugs.koha-community.org
Thu Jan 23 13:44:00 CET 2020


https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=4042

Barry Cannon <bc at interleaf.ie> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bc at interleaf.ie

--- Comment #12 from Barry Cannon <bc at interleaf.ie> ---
Reading over this bug it occurs to me that some of the steps we have taken to
mitigate this problem might help others. I will post high level information
here, in case it might help some people along.


The first step we took (after adding robots.txt) was to add a small piece of
code to the OPACHeader syspref that appended a “hidden” a href tag to a script
(sneakysnare.pl) on the site. This would be only visible to bots and once this
link was followed a page would show with a lot of useless text. In the
background though the script grabbed the source IP address and pushed it into a
deny rule on the firewall. This worked well for bots that blindly followed all
links from the page. Shortly after we noticed that not all bots were following
all links.

Our next step was to check the useragent of the incoming traffic. We noticed
that there were a lot of useragent strings causing issues. Configuring apache
for a CustomLog of “time,IP,METHOD,URI” we configured a script to run regularly
to parse this file for know “bad” useragent strings. We were then able to add
these IPs to the firewall to be dropped.

Our current setup is expanded from above: all our servers use csf/lfd for local
firewall and intrusion detection. We also use ipset extensively for other
non-Koha related services. Csf can be configured to offload deny chains to
ipset. This helps iptables and lowers the resource strain on the server. Csf
can be configured to use an Include file to deny hosts. By expanding on the
sneakysnare script and the useragent apache log we create a small job to bring
all this together and manage an csf include file. The job checks this new file
and if a new IP address has appeared it will add that IP to the deny set.

In some cases we have observed the server being slowly scrapped. This insidious
scrapping is harder to detect immediately and often slows/hogs resources over a
longer period. Quite often the source of these connections are from a
particular geographical region. If this happens - often enough - we can employ
geoblocking. Csf can be configured to use Maxmind GeoIP database lookup. Using
the configuration file we specify the country code we want to block. For
example to block all Irish and British traffic (not that we would!) we enter
the “IE,GB” into the config file. Once the daemon is restarted the GeoIP
database is referenced and all known CIDR blocks for those nountries are loaded
into ipset’s deny set. Csf can also be configured to “deny all, except”. It
this setup placing “IE” in the config file would only allow traffic from
Ireland and deny all other traffic.

There are pros and cons to all of the above and consideration should be given
before implementation.

Third party services are also very useful and moving traffic via CDN providers
(and using their security services) will greatly reduce bots, DDOS and other
hassle.

Other helpful methods include reverse proxies and mitigating at that level.

-- 
You are receiving this mail because:
You are watching all bug changes.
You are the QA Contact for the bug.


More information about the Koha-bugs mailing list