[Koha-devel] Default Elasticsearch behaviour wrong with Chinese: can't find complete title

Nicolas Legrand nicolas.legrand at bulac.fr
Tue Apr 3 16:21:43 CEST 2018


Good day devs,

Nick spotted these one during last Marseille Hackfest. We made some test
with our catalogue on master and find out how to reproduce it, how to break
it and how to fix it, though the inner mechanics remains a mystery and we
are not quite sure about what the default behaviour should be.

We did our test with 中國翻譯 (Chinese Translators Journal) which have two
words highly present in our Catalog: China and translation.

First, the default Koha behaviour is to add a "*" at the end of the
searched word, which lead to 0 results. It produces a query looking like
this one:

$ curl  "http://localhost:9200/koha_robin_biblios/_search?pretty" -d
'{"from": 0, "size": 0,"query":{"query_string":{"query": "中國翻譯*"}}}'
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : 0.0,
    "hits" : [ ]
  }
}

If we quote 中國翻譯 in Koha, it yields one answer, the right one. It produces
a query looking like this one:

$ curl  "
http://bouse02.prive.bulac.fr:9200/koha_robin_biblios/_search?pretty" -d
'{"from": 0, "size": 0,"query":{"query_string":{"query": "\"中國翻譯\""}}}'
{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.0,
    "hits" : [ ]
  }
}

Note that if I write an Elasticsearch query without quotes or star, it
yields too much results (9626), the “right” result isn't in the ten first
results:

$ curl  "
http://bouse02.prive.bulac.fr:9200/koha_robin_biblios/_search?pretty" -d
'{"from": 0, "size": 0,"query":{"query_string":{"query": "中國翻譯"}}}'
{
  "took" : 16,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 9626,
    "max_score" : 0.0,
    "hits" : [ ]
  }
}


I'm not sure what the right behaviour needs to be. We felt adding quotes
added a lot of relevance to our results no matter what the language is.
What is certain is that adding a star to the search by default doesn't help
us. We didn't have the problem with Elasticsearch while playing with it in
17.05. For us, it is a regression. I add the MARC of our test record.

What do you think about it?

Best regards,

-- 

*Nicolas Legrand*
Administration technique et développements du système de gestion de la
bibliothèque

[image: Logo BULAC]

Bibliothèque universitaire
des langues et civilisations

65 rue des Grands Moulins
F-75013 PARIS
T +33 1 81 69 *18 22*
www.bulac.fr
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.koha-community.org/pipermail/koha-devel/attachments/20180403/17b4b8fa/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bib-431919.marcxml
Type: application/octet-stream
Size: 11952 bytes
Desc: not available
URL: <http://lists.koha-community.org/pipermail/koha-devel/attachments/20180403/17b4b8fa/attachment-0001.obj>


More information about the Koha-devel mailing list