<html dir="ltr">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

<style>@font-face {

        font-family: Cambria Math;

}

@font-face {

        font-family: Calibri;

}

@page WordSection1 {margin: 72.0pt 72.0pt 72.0pt 72.0pt; }

P.MsoNormal {

        FONT-SIZE: 11pt; FONT-FAMILY: "Calibri",sans-serif; MARGIN: 0cm 0cm 0pt

}

LI.MsoNormal {

        FONT-SIZE: 11pt; FONT-FAMILY: "Calibri",sans-serif; MARGIN: 0cm 0cm 0pt

}

DIV.MsoNormal {

        FONT-SIZE: 11pt; FONT-FAMILY: "Calibri",sans-serif; MARGIN: 0cm 0cm 0pt

}

A:link {

        TEXT-DECORATION: underline; COLOR: #0563c1

}

SPAN.MsoHyperlink {

        TEXT-DECORATION: underline; COLOR: #0563c1

}

A:visited {

        TEXT-DECORATION: underline; COLOR: #954f72

}

SPAN.MsoHyperlinkFollowed {

        TEXT-DECORATION: underline; COLOR: #954f72

}

SPAN.EmailStyle17 {

        FONT-FAMILY: "Calibri",sans-serif; COLOR: windowtext

}

.MsoChpDefault {

        FONT-FAMILY: "Calibri",sans-serif

}

</style><style id="owaParaStyle">P {

        MARGIN-BOTTOM: 0px; MARGIN-TOP: 0px

}

</style>

</head>

<body lang="EN-AU" link="#0563c1" vlink="#954f72" fPStyle="1" ocsi="0">

<div style="direction: ltr;font-family: Tahoma;color: #000000;font-size: 10pt;">

<p>I recently "downgraded" ICU back to CHR in order to overcome Zebra segmentation faults on a complete reindex.</p>

<p>Should still investigate some further, but have the impression that some Chinese characters made zebraidx crash.</p>

<p> </p>

<div style="FONT-SIZE: 16px; FONT-FAMILY: Times New Roman; COLOR: #000000">

<hr tabindex="-1">

<div id="divRpF150837" style="DIRECTION: ltr"><font color="#000000" size="2" face="Tahoma"><b>Van:</b> koha-devel-bounces@lists.koha-community.org [koha-devel-bounces@lists.koha-community.org] namens David Cook [dcook@prosentient.com.au]<br>

<b>Verzonden:</b> woensdag 18 november 2015 1:25<br>

<b>Aan:</b> 'Koha-devel'<br>

<b>Onderwerp:</b> [Koha-devel] Problematic Zebra Charmaps Equivalences<br>

</font><br>

</div>

<div></div>

<div>

<div class="WordSection1">

<p class="MsoNormal">Hi all:</p>

<p class="MsoNormal"> </p>

<p class="MsoNormal">Yet another Zebra email from this guy.</p>

<p class="MsoNormal"> </p>

<p class="MsoNormal">I don’t know how many of you are using CHR vs ICU, but CHR is the default for installs, so I’m guessing that it’s quite a few.

</p>

<p class="MsoNormal"> </p>

<p class="MsoNormal">Well, there are some issues with how we use the equivalent directive. Hopefully the UTF-8 won’t be stripped out of this message, although I’m guessing it might…</p>

<p class="MsoNormal"> </p>

<p class="MsoNormal">Here’s all instances of the directive in word-phrase-utf.chr:</p>

<p class="MsoNormal"> </p>

<p class="MsoNormal"># Characters to be considered equivalent for sorting purposes</p>

<p class="MsoNormal">equivalent aáàãåâăąȧǎȁȃ</p>

<p class="MsoNormal">equivalent ӕä(ae)</p>

<p class="MsoNormal">equivalent ā(aa)</p>

<p class="MsoNormal">equivalent iíìîịĩĭįǐȉȋ</p>

<p class="MsoNormal">equivalent ï(ie)</p>

<p class="MsoNormal">equivalent ī(ii)</p>

<p class="MsoNormal">equivalent uúùûũŭųűǔȕȗ</p>

<p class="MsoNormal">equivalent ü(ue)</p>

<p class="MsoNormal">equivalent ū(uu)</p>

<p class="MsoNormal">equivalent eéèêẽĕęėěȅȇ</p>

<p class="MsoNormal">equivalent ëē(ee)</p>

<p class="MsoNormal">equivalent oóòõôŏǫȯőǒȍȏ</p>

<p class="MsoNormal">equivalent Œœöø(oe)</p>

<p class="MsoNormal">equivalent ō(oo)</p>

<p class="MsoNormal"> </p>

<p class="MsoNormal">Firstly, that comment is wrong. “equivalent” isn’t just for sorting purposes. It’s for searching purposes. Indexdata have confirmed that the documentation is wrong about the sorting thing.</p>

<p class="MsoNormal"> </p>

<p class="MsoNormal">So “ie” and ï (if you can’t see this character, it’s the UTF-8 representation of &iuml;) are equivalent. That means searches for “siemon” will get results for “siemon” and “sïmon”.

</p>

<p class="MsoNormal"> </p>

<p class="MsoNormal">Now, there is also a “map” directive:</p>

<p class="MsoNormal"> </p>

<p class="MsoNormal">map ï                                     i</p>

<p class="MsoNormal"> </p>

<p class="MsoNormal">This means that “sïmon” is the same as “simon”. Now, “map” affects both indexing and searching. If you have “sïmon” in a record, you can see that it is actually stored as “simon” in Zebra, if you do a search for it and use “format xml”

 and “elements zebra::index”. </p>

<p class="MsoNormal"> </p>

<p class="MsoNormal">So your search for “siemon” will really get results for “siemon” and “simon”.

</p>

<p class="MsoNormal"> </p>

<p class="MsoNormal">This really isn’t ideal. However, you can see why you’d want equivalences. In Scandinavian languages, I think “å” and “aa” are roughly equivalent. They’re spelled differently but they’re the same sound. So if you search for “Gaard”, you

 might want hits for “Gård” as well. </p>

<p class="MsoNormal"> </p>

<p class="MsoNormal">But you might not want “career” to be equivalent to “carer” as they’re two different words. Or “choose” to be equivalent to “chose”, “sloop” - "slop”, “reef” - "ref”, etc.</p>

<p class="MsoNormal"> </p>

<p class="MsoNormal">--</p>

<p class="MsoNormal"> </p>

<p class="MsoNormal">Unfortunately, I don’t really know what the solution is. For one client, I’ve disabled the equivalent directive where it creates an equivalence between any two letter combination with a one letter combination, as they only have records

 in English, and it’ll just cause them headaches.</p>

<p class="MsoNormal"> </p>

<p class="MsoNormal">I can see this being useful for multilingual records… although I think many people with multilingual records use ICU. I don’t know ICU well enough to know how it manages characters that English speakers would think of as accents or ligatures.

 I know you can provide your own normalization with ICU, but I think it does a fair amount on its own as well…</p>

<p class="MsoNormal"> </p>

<p class="MsoNormal">I think some of the difficulties are mentioned here: <a href="http://userguide.icu-project.org/collation/icu-string-search-service" target="_blank">

http://userguide.icu-project.org/collation/icu-string-search-service</a>. It also mentions the Danish å/aa example. I don’t know how ICU would know how to handle particular languages… that webpage seems to indicate you can provide a locale to deal with it.

  </p>

<p class="MsoNormal"> </p>

<p class="MsoNormal">Of course, that doesn’t necessarily solve things. If you have multilingual records with multilingual users, how do you choose your rules? Sure, you might be able to specify a locale at search time (note you can’t do this with Zebra), but

 what rules did you specify at index time? </p>

<p class="MsoNormal"> </p>

<p class="MsoNormal">As anyone who has watched this video (<a href="https://www.youtube.com/watch?v=0j74jcxSunY" target="_blank">https://www.youtube.com/watch?v=0j74jcxSunY</a>) would know, internationalis(z)ing code has many challenges…</p>

<p class="MsoNormal"> </p>

<p class="MsoNormal">--</p>

<p class="MsoNormal"> </p>

<p class="MsoNormal">Anyway, the reason for this email is mostly just to make you all aware of this issue, and how “equivalent” and “map” work in the Charmap files when using CHR indexing.</p>

<p class="MsoNormal"> </p>

<p class="MsoNormal">Oh, also, if you look at “default.idx”, you’ll see that “sort s” references “charmap sort-string-utf.chr”, but I don’t think sort-string-utf.chr actually exists anywhere…</p>

<p class="MsoNormal"> </p>

<p class="MsoNormal"><span>David Cook</span></p>

<p class="MsoNormal"><span>Systems Librarian</span></p>

<p class="MsoNormal"><span>Prosentient Systems</span></p>

<p class="MsoNormal"><span>72/330 Wattle St, Ultimo, NSW 2007</span></p>

<p class="MsoNormal"> </p>

</div>

</div>

</div>

</div>

</body>

</html>