[Koha-devel] Default Elasticsearch behaviour wrong with Chinese: can't find complete title

Nicolas Legrand nicolas.legrand at bulac.fr
Tue Apr 10 11:04:20 CEST 2018


Removing  QueryAutoTruncate was the trick, everything looks pertinent now!

2018-04-05 11:09 GMT+02:00 Nicolas Legrand <nicolas.legrand at bulac.fr>:

> For what it's worth, we also use Latin script language and find the
> results more relevant without a star, or at least with the queries of 17.05
> :).
>
> 2018-04-04 13:10 GMT+02:00 Nick Clemens <nick at bywatersolutions.com>:
>
>> Interesting, yes, the star was added to support auto_truncation and
>> enabled by default. For languages using latin scripts we need the star,
>> otherwise a search for "cat" will not return results containing "cats"
>>
>> I am not sure what the path to correcting this is - I think you should
>> file a bug report with this info and we can take a deeper look into how we
>> are building our searches and what we can do.
>>
>> On Tue, Apr 3, 2018 at 10:22 AM Nicolas Legrand <nicolas.legrand at bulac.fr>
>> wrote:
>>
>>> Good day devs,
>>>
>>> Nick spotted these one during last Marseille Hackfest. We made some test
>>> with our catalogue on master and find out how to reproduce it, how to break
>>> it and how to fix it, though the inner mechanics remains a mystery and we
>>> are not quite sure about what the default behaviour should be.
>>>
>>> We did our test with 中國翻譯 (Chinese Translators Journal) which have two
>>> words highly present in our Catalog: China and translation.
>>>
>>> First, the default Koha behaviour is to add a "*" at the end of the
>>> searched word, which lead to 0 results. It produces a query looking like
>>> this one:
>>>
>>> $ curl  "http://localhost:9200/koha_robin_biblios/_search?pretty" -d
>>> '{"from": 0, "size": 0,"query":{"query_string":{"query": "中國翻譯*"}}}'
>>> {
>>>   "took" : 1,
>>>   "timed_out" : false,
>>>   "_shards" : {
>>>     "total" : 5,
>>>     "successful" : 5,
>>>     "skipped" : 0,
>>>     "failed" : 0
>>>   },
>>>   "hits" : {
>>>     "total" : 0,
>>>     "max_score" : 0.0,
>>>     "hits" : [ ]
>>>   }
>>> }
>>>
>>> If we quote 中國翻譯 in Koha, it yields one answer, the right one. It
>>> produces a query looking like this one:
>>>
>>> $ curl  "http://bouse02.prive.bulac.fr:9200/koha_robin_biblios/_sear
>>> ch?pretty" -d '{"from": 0, "size": 0,"query":{"query_string":{"query":
>>> "\"中國翻譯\""}}}'
>>> {
>>>   "took" : 5,
>>>   "timed_out" : false,
>>>   "_shards" : {
>>>     "total" : 5,
>>>     "successful" : 5,
>>>     "skipped" : 0,
>>>     "failed" : 0
>>>   },
>>>   "hits" : {
>>>     "total" : 1,
>>>     "max_score" : 0.0,
>>>     "hits" : [ ]
>>>   }
>>> }
>>>
>>> Note that if I write an Elasticsearch query without quotes or star, it
>>> yields too much results (9626), the “right” result isn't in the ten first
>>> results:
>>>
>>> $ curl  "http://bouse02.prive.bulac.fr:9200/koha_robin_biblios/_sear
>>> ch?pretty" -d '{"from": 0, "size": 0,"query":{"query_string":{"query":
>>> "中國翻譯"}}}'
>>> {
>>>   "took" : 16,
>>>   "timed_out" : false,
>>>   "_shards" : {
>>>     "total" : 5,
>>>     "successful" : 5,
>>>     "skipped" : 0,
>>>     "failed" : 0
>>>   },
>>>   "hits" : {
>>>     "total" : 9626,
>>>     "max_score" : 0.0,
>>>     "hits" : [ ]
>>>   }
>>> }
>>>
>>>
>>> I'm not sure what the right behaviour needs to be. We felt adding quotes
>>> added a lot of relevance to our results no matter what the language is.
>>> What is certain is that adding a star to the search by default doesn't help
>>> us. We didn't have the problem with Elasticsearch while playing with it in
>>> 17.05. For us, it is a regression. I add the MARC of our test record.
>>>
>>> What do you think about it?
>>>
>>> Best regards,
>>>
>>> --
>>>
>>> *Nicolas Legrand*
>>> Administration technique et développements du système de gestion de la
>>> bibliothèque
>>>
>>> [image: Logo BULAC]
>>>
>>> Bibliothèque universitaire
>>> des langues et civilisations
>>>
>>> 65 rue des Grands Moulins
>>> <https://maps.google.com/?q=65+rue+des+Grands+Moulins&entry=gmail&source=g>
>>> F-75013 PARIS
>>> T +33 1 81 69 *18 22*
>>> www.bulac.fr
>>> _______________________________________________
>>> Koha-devel mailing list
>>> Koha-devel at lists.koha-community.org
>>> http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-devel
>>> website : http://www.koha-community.org/
>>> git : http://git.koha-community.org/
>>> bugs : http://bugs.koha-community.org/
>>
>> --
>> Nick Clemens
>> Sonic Screwdriver (Development Support)
>> ByWater Solutions
>> IRC: kidclamp
>>
>
>
>
> --
>
> *Nicolas Legrand*
> Administration technique et développements du système de gestion de la
> bibliothèque
>
> [image: Logo BULAC]
>
> Bibliothèque universitaire
> des langues et civilisations
>
> 65 rue des Grands Moulins
> F-75013 PARIS
> T +33 1 81 69 *18 22*
> www.bulac.fr
>



-- 

*Nicolas Legrand*
Administration technique et développements du système de gestion de la
bibliothèque

[image: Logo BULAC]

Bibliothèque universitaire
des langues et civilisations

65 rue des Grands Moulins
F-75013 PARIS
T +33 1 81 69 *18 22*
www.bulac.fr
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.koha-community.org/pipermail/koha-devel/attachments/20180410/74900302/attachment.html>


More information about the Koha-devel mailing list