[Koha-devel] Problematic Zebra Charmaps Equivalences

Wed Nov 18 23:33:05 CET 2015

Yikes, that’s not good. That would be great if you could investigate further, and let us know how it goes, and let Indexdata know as well.

David Cook

Systems Librarian

Prosentient Systems

72/330 Wattle St, Ultimo, NSW 2007

From: koha-devel-bounces at lists.koha-community.org [mailto:koha-devel-bounces at lists.koha-community.org] On Behalf Of Marcel de Rooy
Sent: Wednesday, 18 November 2015 6:13 PM
To: 'Koha-devel' <koha-devel at lists.koha-community.org>
Subject: Re: [Koha-devel] Problematic Zebra Charmaps Equivalences

I recently "downgraded" ICU back to CHR in order to overcome Zebra segmentation faults on a complete reindex.

Should still investigate some further, but have the impression that some Chinese characters made zebraidx crash.

  _____  

Van: koha-devel-bounces at lists.koha-community.org <mailto:koha-devel-bounces at lists.koha-community.org>  [koha-devel-bounces at lists.koha-community.org] namens David Cook [dcook at prosentient.com.au]
Verzonden: woensdag 18 november 2015 1:25
Aan: 'Koha-devel'
Onderwerp: [Koha-devel] Problematic Zebra Charmaps Equivalences

Hi all:

Yet another Zebra email from this guy.

I don’t know how many of you are using CHR vs ICU, but CHR is the default for installs, so I’m guessing that it’s quite a few. 

Well, there are some issues with how we use the equivalent directive. Hopefully the UTF-8 won’t be stripped out of this message, although I’m guessing it might…

Here’s all instances of the directive in word-phrase-utf.chr:

# Characters to be considered equivalent for sorting purposes

equivalent aáàãåâăąȧǎȁȃ

equivalent ӕä(ae)

equivalent ā(aa)

equivalent iíìîịĩĭįǐȉȋ

equivalent ï(ie)

equivalent ī(ii)

equivalent uúùûũŭųűǔȕȗ

equivalent ü(ue)

equivalent ū(uu)

equivalent eéèêẽĕęėěȅȇ

equivalent ëē(ee)

equivalent oóòõôŏǫȯőǒȍȏ

equivalent Œœöø(oe)

equivalent ō(oo)

Firstly, that comment is wrong. “equivalent” isn’t just for sorting purposes. It’s for searching purposes. Indexdata have confirmed that the documentation is wrong about the sorting thing.

So “ie” and ï (if you can’t see this character, it’s the UTF-8 representation of ï) are equivalent. That means searches for “siemon” will get results for “siemon” and “sïmon”. 

Now, there is also a “map” directive:

map ï                                     i

This means that “sïmon” is the same as “simon”. Now, “map” affects both indexing and searching. If you have “sïmon” in a record, you can see that it is actually stored as “simon” in Zebra, if you do a search for it and use “format xml” and “elements zebra::index”. 

So your search for “siemon” will really get results for “siemon” and “simon”. 

This really isn’t ideal. However, you can see why you’d want equivalences. In Scandinavian languages, I think “å” and “aa” are roughly equivalent. They’re spelled differently but they’re the same sound. So if you search for “Gaard”, you might want hits for “Gård” as well. 

But you might not want “career” to be equivalent to “carer” as they’re two different words. Or “choose” to be equivalent to “chose”, “sloop” - "slop”, “reef” - "ref”, etc.

--

Unfortunately, I don’t really know what the solution is. For one client, I’ve disabled the equivalent directive where it creates an equivalence between any two letter combination with a one letter combination, as they only have records in English, and it’ll just cause them headaches.

I can see this being useful for multilingual records… although I think many people with multilingual records use ICU. I don’t know ICU well enough to know how it manages characters that English speakers would think of as accents or ligatures. I know you can provide your own normalization with ICU, but I think it does a fair amount on its own as well…

I think some of the difficulties are mentioned here: http://userguide.icu-project.org/collation/icu-string-search-service. It also mentions the Danish å/aa example. I don’t know how ICU would know how to handle particular languages… that webpage seems to indicate you can provide a locale to deal with it.  

Of course, that doesn’t necessarily solve things. If you have multilingual records with multilingual users, how do you choose your rules? Sure, you might be able to specify a locale at search time (note you can’t do this with Zebra), but what rules did you specify at index time? 

As anyone who has watched this video (https://www.youtube.com/watch?v=0j74jcxSunY) would know, internationalis(z)ing code has many challenges…

--

Anyway, the reason for this email is mostly just to make you all aware of this issue, and how “equivalent” and “map” work in the Charmap files when using CHR indexing.

Oh, also, if you look at “default.idx”, you’ll see that “sort s” references “charmap sort-string-utf.chr”, but I don’t think sort-string-utf.chr actually exists anywhere…

David Cook

Systems Librarian

Prosentient Systems

72/330 Wattle St, Ultimo, NSW 2007

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.koha-community.org/pipermail/koha-devel/attachments/20151119/80a56c3d/attachment.html>