[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

bugzilla-daemon at bugs.koha-community.org bugzilla-daemon at bugs.koha-community.org
Fri Dec 11 01:42:33 CET 2015


http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #20 from Yuval Hager <yhager at yhager.com> ---
> I suspect that will make the output of Text::Unaccent and
> Text::Unaccent::PurePerl the same. 
>

Not really, it stays the same garbled mess.

> unac_debug($Text::Unaccent::DEBUG_HIGH);
> 
> That will also tell you what Text::Unaccent is doing (or probably not doing).

I tested on one string:

unac.c:13708: unac_data0[7] & unac_positions[0][8]: 0x05e7 => untouched
unac.c:13708: unac_data0[24] & unac_positions[0][25]: 0x05b8 => untouched
unac.c:13708: unac_data0[30] & unac_positions[0][31]: 0x05de => untouched
unac.c:13708: unac_data0[24] & unac_positions[0][25]: 0x05b8 => untouched
unac.c:13708: unac_data0[5] & unac_positions[0][6]: 0x05e5 => untouched
Text::Unaccent           - קָמָץ => קָ×ָץ


> Note that nothing seems to happen with the (Japanese?) ideograms that Galen
> tested. I wonder if accents are even a thing with CJK languages...

I am definitely not an authoritative source, but I know a tiny bit of Japanese.
The letters above are Kanji alphabet, and to the best of my knowledge do not
have diacritics. BUT Japanese has two more alphabets, Hiragana and Katakana,
both use diacritics, which CANNOT be removed, or they change the sound (and
potentially the meaning).
For example, in the word Hiragana, the first syllable is ひ (Hi, pronounce Hee).
This same syllable, with two ticks is び, and it sounds like Bee. A circle makes
it ぴ - sounds like Pee. Testing those three:

Text::Unaccent           - ひびぴ => ã²ã²ã²
Text::Unaccent::PurePerl - ひびぴ => ひひひ
Strip NonspacingMark     - ひびぴ => ひひひ

So we've changed 'Hee Bee Pee' to 'Hee Hee Hee'. The same result (and same
syllables) for Katakana:

Text::Unaccent           - ヒビピ => ããã
Text::Unaccent::PurePerl - ヒビピ => ヒヒヒ
Strip NonspacingMark     - ヒビピ => ヒヒヒ

So diacritics, at least in those two alphabets, should not be removed, to the
best of my knowledge.

-- 
You are receiving this mail because:
You are watching all bug changes.


More information about the Koha-bugs mailing list