[Koha-devel] Default Elasticsearch behaviour wrong with Chinese: can't find complete title

Nicolas Legrand nicolas.legrand at bulac.fr
Thu Apr 5 11:09:56 CEST 2018


For what it's worth, we also use Latin script language and find the results
more relevant without a star, or at least with the queries of 17.05 :).

2018-04-04 13:10 GMT+02:00 Nick Clemens <nick at bywatersolutions.com>:

> Interesting, yes, the star was added to support auto_truncation and
> enabled by default. For languages using latin scripts we need the star,
> otherwise a search for "cat" will not return results containing "cats"
>
> I am not sure what the path to correcting this is - I think you should
> file a bug report with this info and we can take a deeper look into how we
> are building our searches and what we can do.
>
> On Tue, Apr 3, 2018 at 10:22 AM Nicolas Legrand <nicolas.legrand at bulac.fr>
> wrote:
>
>> Good day devs,
>>
>> Nick spotted these one during last Marseille Hackfest. We made some test
>> with our catalogue on master and find out how to reproduce it, how to break
>> it and how to fix it, though the inner mechanics remains a mystery and we
>> are not quite sure about what the default behaviour should be.
>>
>> We did our test with 中國翻譯 (Chinese Translators Journal) which have two
>> words highly present in our Catalog: China and translation.
>>
>> First, the default Koha behaviour is to add a "*" at the end of the
>> searched word, which lead to 0 results. It produces a query looking like
>> this one:
>>
>> $ curl  "http://localhost:9200/koha_robin_biblios/_search?pretty" -d
>> '{"from": 0, "size": 0,"query":{"query_string":{"query": "中國翻譯*"}}}'
>> {
>>   "took" : 1,
>>   "timed_out" : false,
>>   "_shards" : {
>>     "total" : 5,
>>     "successful" : 5,
>>     "skipped" : 0,
>>     "failed" : 0
>>   },
>>   "hits" : {
>>     "total" : 0,
>>     "max_score" : 0.0,
>>     "hits" : [ ]
>>   }
>> }
>>
>> If we quote 中國翻譯 in Koha, it yields one answer, the right one. It
>> produces a query looking like this one:
>>
>> $ curl  "http://bouse02.prive.bulac.fr:9200/koha_robin_biblios/_
>> search?pretty" -d '{"from": 0, "size": 0,"query":{"query_string":{"query":
>> "\"中國翻譯\""}}}'
>> {
>>   "took" : 5,
>>   "timed_out" : false,
>>   "_shards" : {
>>     "total" : 5,
>>     "successful" : 5,
>>     "skipped" : 0,
>>     "failed" : 0
>>   },
>>   "hits" : {
>>     "total" : 1,
>>     "max_score" : 0.0,
>>     "hits" : [ ]
>>   }
>> }
>>
>> Note that if I write an Elasticsearch query without quotes or star, it
>> yields too much results (9626), the “right” result isn't in the ten first
>> results:
>>
>> $ curl  "http://bouse02.prive.bulac.fr:9200/koha_robin_biblios/_
>> search?pretty" -d '{"from": 0, "size": 0,"query":{"query_string":{"query":
>> "中國翻譯"}}}'
>> {
>>   "took" : 16,
>>   "timed_out" : false,
>>   "_shards" : {
>>     "total" : 5,
>>     "successful" : 5,
>>     "skipped" : 0,
>>     "failed" : 0
>>   },
>>   "hits" : {
>>     "total" : 9626,
>>     "max_score" : 0.0,
>>     "hits" : [ ]
>>   }
>> }
>>
>>
>> I'm not sure what the right behaviour needs to be. We felt adding quotes
>> added a lot of relevance to our results no matter what the language is.
>> What is certain is that adding a star to the search by default doesn't help
>> us. We didn't have the problem with Elasticsearch while playing with it in
>> 17.05. For us, it is a regression. I add the MARC of our test record.
>>
>> What do you think about it?
>>
>> Best regards,
>>
>> --
>>
>> *Nicolas Legrand*
>> Administration technique et développements du système de gestion de la
>> bibliothèque
>>
>> [image: Logo BULAC]
>>
>> Bibliothèque universitaire
>> des langues et civilisations
>>
>> 65 rue des Grands Moulins
>> <https://maps.google.com/?q=65+rue+des+Grands+Moulins&entry=gmail&source=g>
>> F-75013 PARIS
>> T +33 1 81 69 *18 22*
>> www.bulac.fr
>> _______________________________________________
>> Koha-devel mailing list
>> Koha-devel at lists.koha-community.org
>> http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-devel
>> website : http://www.koha-community.org/
>> git : http://git.koha-community.org/
>> bugs : http://bugs.koha-community.org/
>
> --
> Nick Clemens
> Sonic Screwdriver (Development Support)
> ByWater Solutions
> IRC: kidclamp
>



-- 

*Nicolas Legrand*
Administration technique et développements du système de gestion de la
bibliothèque

[image: Logo BULAC]

Bibliothèque universitaire
des langues et civilisations

65 rue des Grands Moulins
F-75013 PARIS
T +33 1 81 69 *18 22*
www.bulac.fr
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.koha-community.org/pipermail/koha-devel/attachments/20180405/f3bb6087/attachment.html>


More information about the Koha-devel mailing list