[Koha-devel] Problematic Zebra Charmaps Equivalences

Tue Nov 24 09:07:31 CET 2015

Hie,

Thanks a lot for this complex study.

sort-string-utf.chr exists indeed, in etc/zebradb/lang_defs/xx.
It depends on the language chosen at install because it contains 
prefixes escaped for sort, for example :
   map (^The\s)    @

The CHR is indeed not perfect.
We use ICU, for a French catalog it seems OK.
A good point is that you can really customize tokenization via config 
files xxxx-icu.xml

Regards,

Le 18/11/2015 01:25, David Cook a écrit :
> Hi all:
>
>
>
> Yet another Zebra email from this guy.
>
>
>
> I don’t know how many of you are using CHR vs ICU, but CHR is the default for installs, so I’m guessing that it’s quite a few.
>
>
>
> Well, there are some issues with how we use the equivalent directive. Hopefully the UTF-8 won’t be stripped out of this message, although I’m guessing it might…
>
>
>
> Here’s all instances of the directive in word-phrase-utf.chr:
>
>
>
> # Characters to be considered equivalent for sorting purposes
>
> equivalent aáàãåâăąȧǎȁȃ
>
> equivalent ӕä(ae)
>
> equivalent ā(aa)
>
> equivalent iíìîịĩĭįǐȉȋ
>
> equivalent ï(ie)
>
> equivalent ī(ii)
>
> equivalent uúùûũŭųűǔȕȗ
>
> equivalent ü(ue)
>
> equivalent ū(uu)
>
> equivalent eéèêẽĕęėěȅȇ
>
> equivalent ëē(ee)
>
> equivalent oóòõôŏǫȯőǒȍȏ
>
> equivalent Œœöø(oe)
>
> equivalent ō(oo)
>
>
>
> Firstly, that comment is wrong. “equivalent” isn’t just for sorting purposes. It’s for searching purposes. Indexdata have confirmed that the documentation is wrong about the sorting thing.
>
>
>
> So “ie” and ï (if you can’t see this character, it’s the UTF-8 representation of ï) are equivalent. That means searches for “siemon” will get results for “siemon” and “sïmon”.
>
>
>
> Now, there is also a “map” directive:
>
>
>
> map ï                                     i
>
>
>
> This means that “sïmon” is the same as “simon”. Now, “map” affects both indexing and searching. If you have “sïmon” in a record, you can see that it is actually stored as “simon” in Zebra, if you do a search for it and use “format xml” and “elements zebra::index”.
>
>
>
> So your search for “siemon” will really get results for “siemon” and “simon”.
>
>
>
> This really isn’t ideal. However, you can see why you’d want equivalences. In Scandinavian languages, I think “å” and “aa” are roughly equivalent. They’re spelled differently but they’re the same sound. So if you search for “Gaard”, you might want hits for “Gård” as well.
>
>
>
> But you might not want “career” to be equivalent to “carer” as they’re two different words. Or “choose” to be equivalent to “chose”, “sloop” - "slop”, “reef” - "ref”, etc.
>
>
>
> --
>
>
>
> Unfortunately, I don’t really know what the solution is. For one client, I’ve disabled the equivalent directive where it creates an equivalence between any two letter combination with a one letter combination, as they only have records in English, and it’ll just cause them headaches.
>
>
>
> I can see this being useful for multilingual records… although I think many people with multilingual records use ICU. I don’t know ICU well enough to know how it manages characters that English speakers would think of as accents or ligatures. I know you can provide your own normalization with ICU, but I think it does a fair amount on its own as well…
>
>
>
> I think some of the difficulties are mentioned here: http://userguide.icu-project.org/collation/icu-string-search-service. It also mentions the Danish å/aa example. I don’t know how ICU would know how to handle particular languages… that webpage seems to indicate you can provide a locale to deal with it.
>
>
>
> Of course, that doesn’t necessarily solve things. If you have multilingual records with multilingual users, how do you choose your rules? Sure, you might be able to specify a locale at search time (note you can’t do this with Zebra), but what rules did you specify at index time?
>
>
>
> As anyone who has watched this video (https://www.youtube.com/watch?v=0j74jcxSunY) would know, internationalis(z)ing code has many challenges…
>
>
>
> --
>
>
>
> Anyway, the reason for this email is mostly just to make you all aware of this issue, and how “equivalent” and “map” work in the Charmap files when using CHR indexing.
>
>
>
> Oh, also, if you look at “default.idx”, you’ll see that “sort s” references “charmap sort-string-utf.chr”, but I don’t think sort-string-utf.chr actually exists anywhere…
>
>
>
> David Cook
>
> Systems Librarian
>
> Prosentient Systems
>
> 72/330 Wattle St, Ultimo, NSW 2007
>
>
>
>
>
>
> _______________________________________________
> Koha-devel mailing list
> Koha-devel at lists.koha-community.org
> http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-devel
> website : http://www.koha-community.org/
> git : http://git.koha-community.org/
> bugs : http://bugs.koha-community.org/
>

-- 
Fridolin SOMERS
Biblibre - Pôles support et système
fridolin.somers at biblibre.com