[Koha-bugs] [Bug 26390] Add transliteration of Ž in ICU chains

bugzilla-daemon at bugs.koha-community.org bugzilla-daemon at bugs.koha-community.org
Tue Sep 22 02:13:54 CEST 2020


https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=26390

--- Comment #10 from David Cook <dcook at prosentient.com.au> ---
(In reply to Katrin Fischer from comment #8)
> You have to look at the full example in the links I posted. 3 lines:
> 
>     <transform rule="NFD"/>
>     <transform rule="[:Nonspacing Mark:] Remove"/>
>     <transform rule="NFC"/>
> 
> So yes, but then it uses that form to remove the diacritics:
> https://www.compart.com/en/unicode/category/Mn

Ahhh right. I should've been more thorough.

I was thinking recently about how Zebra ICU has been seen as inferior to
Elasticsearch ICU on the listserv. 

Looking at
ftp://ftp.software.ibm.com/software/globalization/icu/3.6/icu-3_6-userguide.pdf,
it looks like ICU actually originated in Java (ICU4J) and was later ported to
C++ and C (ICU4C). 

According to
https://wiki.koha-community.org/wiki/Record_Indexing_and_Retrieval_Options_for_Koha,
the Zebra use of libicu is inferior to Lucence ICU which uses ICU4J. There's no
evidence given for the claim, but it seems believable (especially considering
global prominence of Solr and Elasticsearch).

Looking at https://lucene.apache.org/core/4_4_0/analyzers-icu/index.html, it
seems that writing systems can use dictionary based algorithms (for systems
like Thai script, Chinese, etc). That explains a lot. I know a bit of Chinese,
and I've wondered how indexers could handle such a context-dependent
language...

-- 
You are receiving this mail because:
You are watching all bug changes.


More information about the Koha-bugs mailing list