[Koha-devel] Single character searches are slow when queryfuzzy and/or querystemming are enabled.

Mon Mar 6 20:43:55 CET 2017

I've got an issue that we've seen periodically at ByWater; I want to file a
bug, but I don't have a clear idea of how to replicate the issue, so I'm
trying to solicit information from the community on how to narrow the scope
of the problem and/or measure the performance problems that I'm seeing.

The problem is that keyword searches (as you'd see from an un-modified
masthead search on the opac, or the "search the catalog" search on the
staff client) are markedly slow on some sites when

a) The search term contains single letter words like "a" in "Once Upon A
Time" and
b) One or both of the QueryFuzzy or QueryStemming system preferences is
enabled.

We first ran across this issue when we moved a number our Koha instances to
slower drives, and found that some (but not all) would time out when
searching for "Once upon a time", "A swiftly tilting planet", "Hitchiker's
Guide to the Galaxy" (where the 's' after the apostrophe is counted as a
single word).

I turned on 'request' logging in zebra, here's the query that timed out:

find @attrset Bib-1 @or @or @or @or @or @or @attr 1=36 @attr 4=1 @attr 6=3
@attr 9=32 @attr 2=102 "once upon a time" @attr 1=4 @attr 4=1 @attr 6=3
@attr 9=28 @attr 2=102 "once upon a time" @attr 1=36 @attr 4=1 @attr 9=26
@attr 2=102 "once upon a time" @attr 1=4 @attr 4=6 @attr 9=24 @attr 2=102
"once upon a time" @attr 4=6 @attr 5=103 @attr 9=16 @attr 2=102 "once upon
a time" @attr 4=6 @attr 5=1 @attr 9=14 @attr 2=102 "onc? upon? a time? "
@attr 4=6 @attr 9=14 @attr 2=102 "once upon a time"
Sent searchRequest.

I ran the PQF in zebra:

Z> base biblios
Z> find @attrset Bib-1 @or @or @or @or @or @or @attr 1=36 @attr 4=1 @attr
6=3 @attr 9=32 @attr 2=102 "once upon a time" @attr 1=4 @attr 4=1 @attr 6=3
@attr 9=28 @attr 2=102 "once upon a time" @attr 1=36 @attr 4=1 @attr 9=26
@attr 2=102 "once upon a time" @attr 1=4 @attr 4=6 @attr 9=24 @attr 2=102
"once upon a time" @attr 4=6 @attr 5=103 @attr 9=16 @attr 2=102 "once upon
a time" @attr 4=6 @attr 5=1 @attr 9=14 @attr 2=102 "onc? upon? a time? "
@attr 4=6 @attr 9=14 @attr 2=102 "once upon a time"
Sent searchRequest.
Received SearchResponse.
Search was a success.
Number of hits: 119, setno 1
SearchResult-1: term=once upon a time cnt=3, term=once upon a time cnt=5,
term=once cnt=94, term=upon cnt=63, term=a cnt=4876, term=time cnt=515,
term=once cnt=168, term=upon cnt=125, term=a cnt=19801, term=time cnt=1507,
term=once cnt=9830, term=upon cnt=960, term=a cnt=82237, term=time
cnt=8524, term=onc cnt=1915, term=upon cnt=914, term=a cnt=82237, term=time
cnt=7709, term=once cnt=1302, term=upon cnt=914, term=a cnt=68283,
term=time cnt=5442
records returned: 0
Elapsed: 98.163583

The elapsed time is just under 100 seconds; I think Koha times out after 60.

We looked at the disk I/O and found that the disk where the searches were
occurring was getting hammered, so we migrated the sites that were having
issues onto a server that had a much faster drive, and this did, at least
solve the time-out issue.

The issue keeps on popping up, however -- if not in actual timeouts, then
at least in complaints of slowness in search.

My hypothesis is that either QueryFuzzy or QueryStemming is expanding the
one letter words, i.e. a search for "a" is returning results for either all
words that start with "a" or all words contaning "a", and that all of these
results are written to disk before any further filtering is done.

When we were seeing timeout issues, I'm not clear on which sites were
likely to have issues -- sites with a very small number of bibs didn't have
issues, but the problem wasn't solely by collection size. So here's what
I'd like to know, for the purposes of further testing, and/or to be able to
replicate the issue:

1) Is there anyone who knows all about PQF who can tell me why the query
above would run so slowly?
2) Is there part of the PQF that seems to be behaving particularly badly?
Which parts of the query are returning "term=a cnt=82237", and can we avoid
having that called twice?
3) How do I go about constructing a data set that illustrates this problem?
4) What's the best way to measure the performance problems and/or define
what the problem is when I file a bug?
5) Are there any work-arounds, given that certain sites really want query
stemming and queryfuzzy enabled?

Thanks,

--Barton
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.koha-community.org/pipermail/koha-devel/attachments/20170306/06815f6e/attachment.html>