[Koha-devel] Single character searches are slow when queryfuzzy and/or querystemming are enabled.

Fridolin SOMERS fridolin.somers at biblibre.com
Tue May 9 11:49:04 CEST 2017


Yep,

We had this issue with QueryStemming.
We solved it by setting in zebra-biblios-dom.cfg a lower value 10000 for 
truncmax, 1000000000 by default.
Of course results are not very good when there is more than 10k results 
but it works realy faster.

Regards,

Le 06/03/2017 à 20:43, Barton Chittenden a écrit :
> I've got an issue that we've seen periodically at ByWater; I want to file a
> bug, but I don't have a clear idea of how to replicate the issue, so I'm
> trying to solicit information from the community on how to narrow the scope
> of the problem and/or measure the performance problems that I'm seeing.
>
> The problem is that keyword searches (as you'd see from an un-modified
> masthead search on the opac, or the "search the catalog" search on the
> staff client) are markedly slow on some sites when
>
> a) The search term contains single letter words like "a" in "Once Upon A
> Time" and
> b) One or both of the QueryFuzzy or QueryStemming system preferences is
> enabled.
>
> We first ran across this issue when we moved a number our Koha instances to
> slower drives, and found that some (but not all) would time out when
> searching for "Once upon a time", "A swiftly tilting planet", "Hitchiker's
> Guide to the Galaxy" (where the 's' after the apostrophe is counted as a
> single word).
>
> I turned on 'request' logging in zebra, here's the query that timed out:
>
> find @attrset Bib-1 @or @or @or @or @or @or @attr 1=36 @attr 4=1 @attr 6=3
> @attr 9=32 @attr 2=102 "once upon a time" @attr 1=4 @attr 4=1 @attr 6=3
> @attr 9=28 @attr 2=102 "once upon a time" @attr 1=36 @attr 4=1 @attr 9=26
> @attr 2=102 "once upon a time" @attr 1=4 @attr 4=6 @attr 9=24 @attr 2=102
> "once upon a time" @attr 4=6 @attr 5=103 @attr 9=16 @attr 2=102 "once upon
> a time" @attr 4=6 @attr 5=1 @attr 9=14 @attr 2=102 "onc? upon? a time? "
> @attr 4=6 @attr 9=14 @attr 2=102 "once upon a time"
> Sent searchRequest.
>
>
> I ran the PQF in zebra:
>
> Z> base biblios
> Z> find @attrset Bib-1 @or @or @or @or @or @or @attr 1=36 @attr 4=1 @attr
> 6=3 @attr 9=32 @attr 2=102 "once upon a time" @attr 1=4 @attr 4=1 @attr 6=3
> @attr 9=28 @attr 2=102 "once upon a time" @attr 1=36 @attr 4=1 @attr 9=26
> @attr 2=102 "once upon a time" @attr 1=4 @attr 4=6 @attr 9=24 @attr 2=102
> "once upon a time" @attr 4=6 @attr 5=103 @attr 9=16 @attr 2=102 "once upon
> a time" @attr 4=6 @attr 5=1 @attr 9=14 @attr 2=102 "onc? upon? a time? "
> @attr 4=6 @attr 9=14 @attr 2=102 "once upon a time"
> Sent searchRequest.
> Received SearchResponse.
> Search was a success.
> Number of hits: 119, setno 1
> SearchResult-1: term=once upon a time cnt=3, term=once upon a time cnt=5,
> term=once cnt=94, term=upon cnt=63, term=a cnt=4876, term=time cnt=515,
> term=once cnt=168, term=upon cnt=125, term=a cnt=19801, term=time cnt=1507,
> term=once cnt=9830, term=upon cnt=960, term=a cnt=82237, term=time
> cnt=8524, term=onc cnt=1915, term=upon cnt=914, term=a cnt=82237, term=time
> cnt=7709, term=once cnt=1302, term=upon cnt=914, term=a cnt=68283,
> term=time cnt=5442
> records returned: 0
> Elapsed: 98.163583
>
>
> The elapsed time is just under 100 seconds; I think Koha times out after 60.
>
> We looked at the disk I/O and found that the disk where the searches were
> occurring was getting hammered, so we migrated the sites that were having
> issues onto a server that had a much faster drive, and this did, at least
> solve the time-out issue.
>
> The issue keeps on popping up, however -- if not in actual timeouts, then
> at least in complaints of slowness in search.
>
> My hypothesis is that either QueryFuzzy or QueryStemming is expanding the
> one letter words, i.e. a search for "a" is returning results for either all
> words that start with "a" or all words contaning "a", and that all of these
> results are written to disk before any further filtering is done.
>
> When we were seeing timeout issues, I'm not clear on which sites were
> likely to have issues -- sites with a very small number of bibs didn't have
> issues, but the problem wasn't solely by collection size. So here's what
> I'd like to know, for the purposes of further testing, and/or to be able to
> replicate the issue:
>
> 1) Is there anyone who knows all about PQF who can tell me why the query
> above would run so slowly?
> 2) Is there part of the PQF that seems to be behaving particularly badly?
> Which parts of the query are returning "term=a cnt=82237", and can we avoid
> having that called twice?
> 3) How do I go about constructing a data set that illustrates this problem?
> 4) What's the best way to measure the performance problems and/or define
> what the problem is when I file a bug?
> 5) Are there any work-arounds, given that certain sites really want query
> stemming and queryfuzzy enabled?
>
> Thanks,
>
> --Barton
>
>
>
> _______________________________________________
> Koha-devel mailing list
> Koha-devel at lists.koha-community.org
> http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-devel
> website : http://www.koha-community.org/
> git : http://git.koha-community.org/
> bugs : http://bugs.koha-community.org/
>

-- 
Fridolin SOMERS
Biblibre - Pôles support et système
fridolin.somers at biblibre.com


More information about the Koha-devel mailing list