[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

Tue Dec 8 01:59:52 CET 2015

http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #10 from David Cook <dcook at prosentient.com.au> ---
So I think I've tracked down the C code behind Text::Unaccent:

https://github.com/gitpan/Text-Unaccent/blob/master/unac.c

The only reference I see to "damma" is in the U+FE70...U+FEFF code point range
which appears to list isolated forms which is not what we're dealing with in
these examples.

While I haven't reviewed the code extensively, it looks like the tables used
for Text::Unaccent are lacking...

If you replace the following line in Galen's script:

use Text::Unaccent qw//;

with

use Text::Unaccent qw/unac_debug/;
unac_debug($Text::Unaccent::DEBUG_HIGH);

You'll get more details of how Text::Unaccent is working (or not working as it
were).

Here's the output I get for the Arabic:

unac.c:13708: unac_data0[5] & unac_positions[0][6]: 0x0645 => untouched
unac.c:13708: unac_data0[15] & unac_positions[0][16]: 0x064f => untouched
unac.c:13708: unac_data34[15] & unac_positions[34][16]: 0x062f => untouched
unac.c:13708: unac_data0[14] & unac_positions[0][15]: 0x064e => untouched
unac.c:13708: unac_data34[17] & unac_positions[34][18]: 0x0631 => untouched
unac.c:13708: unac_data0[16] & unac_positions[0][17]: 0x0650 => untouched
unac.c:13708: unac_data0[17] & unac_positions[0][18]: 0x0651 => untouched
unac.c:13708: unac_data34[19] & unac_positions[34][20]: 0x0633 => untouched
unac.c:13708: unac_data0[14] & unac_positions[0][15]: 0x064e => untouched
unac.c:13708: unac_data34[9] & unac_positions[34][10]: 0x0629 => untouched
Text::Unaccent           - مُدَرِّسَة => Ù�Ù�Ø¯Ù�Ø±Ù�Ù�Ø³Ù�Ø©
Strip NonspacingMark     - مُدَرِّسَة => مدرسة

Here's the output I get for the Greek:
unac.c:13708: unac_data21[6] & unac_positions[21][7]: 0x0386 => 0x0391
unac.c:13708: unac_data22[12] & unac_positions[22][13]: 0x03ac => 0x03b1
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[8] & unac_positions[21][9]: 0x0388 => 0x0395
unac.c:13708: unac_data22[13] & unac_positions[22][14]: 0x03ad => 0x03b5
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[9] & unac_positions[21][10]: 0x0389 => 0x0397
unac.c:13708: unac_data22[14] & unac_positions[22][15]: 0x03ae => 0x03b7
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[10] & unac_positions[21][11]: 0x038a => 0x0399
unac.c:13708: unac_data22[15] & unac_positions[22][16]: 0x03af => 0x03b9
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[12] & unac_positions[21][13]: 0x038c => 0x039f
unac.c:13708: unac_data23[12] & unac_positions[23][13]: 0x03cc => 0x03bf
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[14] & unac_positions[21][15]: 0x038e => 0x03a5
unac.c:13708: unac_data23[13] & unac_positions[23][14]: 0x03cd => 0x03c5
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[15] & unac_positions[21][16]: 0x038f => 0x03a9
unac.c:13708: unac_data23[14] & unac_positions[23][15]: 0x03ce => 0x03c9
Text::Unaccent           - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => Î�Î± Î�Îµ Î�Î· Î�Î¹ Î�Î¿
Î¥Ï� Î©Ï�
Strip NonspacingMark     - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => Αα Εε Ηη Ιι Οο Υυ Ωω

Interestingly we can see a reference to the Greek cpaital letter alpha with the
tonos diacritic: 

* 0386 GREEK CAPITAL LETTER ALPHA WITH TONOS
*       0391 GREEK CAPITAL LETTER ALPHA

Indeed, in the output, we can see that 0x0386 was changed to 0x0391... although
admittedly I don't know exactly how. It looks like a binary operation that uses
a bitmask to produce a certain value... we don't need to know 100% how that
mechanism is working right now... just that it works as described above.

--

So in the Arabic example... everything was "untouched" and yet the output is
garbled. That's certainly an encoding issue... 

Indeed, look at the following:

dcook at koha:~/experiments> echo "مُدَرِّسَة" | iconv -f latin1 -t utf-8
Ù�Ù�Ø¯Ù�Ø±Ù�Ù�Ø³Ù�Ø©

That is the same output as Text::Unaccent:

Text::Unaccent           - مُدَرِّسَة => Ù�Ù�Ø¯Ù�Ø±Ù�Ù�Ø³Ù�Ø©

So somewhere along the line that UTF-8 string is getting double-encoded.

Check this out:

dcook at koha:~/experiments> echo "مُدَرِّسَة" | iconv -f latin1 -t utf-8 | iconv -f
utf-8 -t latin1
مُدَرِّسَة

I think the double-encoding is down to us using "binmode STDOUT, ':utf8';"
(which tells Perl to output UTF-8 encoded bytes instead of Latin-1 (or some
other single byte encoding it normally uses) and "use utf8" which tells Perl
that the source code uses UTF-8...

Removing those gets us the following:

Text::Unaccent           - été => ete

Strip NonspacingMark     - été => A▒tA▒
Text::Unaccent           - umlaüt => umlaut
Wide character in print at unaccent.pl line 47.
Strip NonspacingMark     - umlaÃ¼t => umlaA1⁄4t
Text::Unaccent           - עברית => עברית

Strip NonspacingMark     - עברית => עב▒ י▒a
Text::Unaccent           - חוֹלָם => חוֹלָם
Strip NonspacingMark     - חוֹלָם => חוO1לO ם
Text::Unaccent           - 北京市 => 北京市

Strip NonspacingMark     - 北京市 => a▒▒ao▒a ▒
Text::Unaccent           - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => Αα Εε Ηη Ιι Οο Υυ Ωω

Strip NonspacingMark     - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => I▒I▒ I▒I▒ I▒I▒ I▒I  I▒I▒
I▒I▒ I▒I▒
Text::Unaccent           - مُدَرِّسَة => مُدَرِّسَة

Strip NonspacingMark     - مُدَرِّسَة => U▒U▒▒ U▒رU▒U▒▒3U▒ة

At a glance, Text::Unaccent looks like it works for French, German, and
Greek... but doesn't touch Hebrew, Japanese(?), or Arabic.

Here's that output again with the debugging:

unac.c:13708: unac_data3[10] & unac_positions[3][10]: 0x00e9 => 0x0065
unac.c:13708: unac_data0[20] & unac_positions[0][21]: 0x0074 => untouched
unac.c:13708: unac_data3[10] & unac_positions[3][10]: 0x00e9 => 0x0065
Text::Unaccent           - été => ete

Strip NonspacingMark     - été => A▒tA▒
unac.c:13708: unac_data0[21] & unac_positions[0][22]: 0x0075 => untouched
unac.c:13708: unac_data0[13] & unac_positions[0][14]: 0x006d => untouched
unac.c:13708: unac_data0[12] & unac_positions[0][13]: 0x006c => untouched
unac.c:13708: unac_data0[1] & unac_positions[0][2]: 0x0061 => untouched
unac.c:13708: unac_data3[29] & unac_positions[3][29]: 0x00fc => 0x0075
unac.c:13708: unac_data0[20] & unac_positions[0][21]: 0x0074 => untouched
Text::Unaccent           - umlaüt => umlaut
Wide character in print at unaccent.pl line 47.
Strip NonspacingMark     - umlaÃ¼t => umlaA1⁄4t
unac.c:13708: unac_data0[2] & unac_positions[0][3]: 0x05e2 => untouched
unac.c:13708: unac_data0[17] & unac_positions[0][18]: 0x05d1 => untouched
unac.c:13708: unac_data0[8] & unac_positions[0][9]: 0x05e8 => untouched
unac.c:13708: unac_data0[25] & unac_positions[0][26]: 0x05d9 => untouched
unac.c:13708: unac_data0[10] & unac_positions[0][11]: 0x05ea => untouched
Text::Unaccent           - עברית => עברית

Strip NonspacingMark     - עברית => עב▒ י▒a
unac.c:13708: unac_data0[23] & unac_positions[0][24]: 0x05d7 => untouched
unac.c:13708: unac_data0[21] & unac_positions[0][22]: 0x05d5 => untouched
unac.c:13708: unac_data0[25] & unac_positions[0][26]: 0x05b9 => untouched
unac.c:13708: unac_data0[28] & unac_positions[0][29]: 0x05dc => untouched
unac.c:13708: unac_data0[24] & unac_positions[0][25]: 0x05b8 => untouched
unac.c:13708: unac_data0[29] & unac_positions[0][30]: 0x05dd => untouched
Text::Unaccent           - חוֹלָם => חוֹלָם
Strip NonspacingMark     - חוֹלָם => חוO1לO ם
unac.c:13708: unac_data0[23] & unac_positions[0][24]: 0x5317 => untouched
unac.c:13708: unac_data0[12] & unac_positions[0][13]: 0x4eac => untouched
unac.c:13708: unac_data0[2] & unac_positions[0][3]: 0x5e02 => untouched
Text::Unaccent           - 北京市 => 北京市

Strip NonspacingMark     - 北京市 => a▒▒ao▒a ▒
unac.c:13708: unac_data21[6] & unac_positions[21][7]: 0x0386 => 0x0391
unac.c:13708: unac_data22[12] & unac_positions[22][13]: 0x03ac => 0x03b1
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[8] & unac_positions[21][9]: 0x0388 => 0x0395
unac.c:13708: unac_data22[13] & unac_positions[22][14]: 0x03ad => 0x03b5
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[9] & unac_positions[21][10]: 0x0389 => 0x0397
unac.c:13708: unac_data22[14] & unac_positions[22][15]: 0x03ae => 0x03b7
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[10] & unac_positions[21][11]: 0x038a => 0x0399
unac.c:13708: unac_data22[15] & unac_positions[22][16]: 0x03af => 0x03b9
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[12] & unac_positions[21][13]: 0x038c => 0x039f
unac.c:13708: unac_data23[12] & unac_positions[23][13]: 0x03cc => 0x03bf
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[14] & unac_positions[21][15]: 0x038e => 0x03a5
unac.c:13708: unac_data23[13] & unac_positions[23][14]: 0x03cd => 0x03c5
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[15] & unac_positions[21][16]: 0x038f => 0x03a9
unac.c:13708: unac_data23[14] & unac_positions[23][15]: 0x03ce => 0x03c9
Text::Unaccent           - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => Αα Εε Ηη Ιι Οο Υυ Ωω

Strip NonspacingMark     - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => I▒I▒ I▒I▒ I▒I▒ I▒I  I▒I▒
I▒I▒ I▒I▒
unac.c:13708: unac_data0[5] & unac_positions[0][6]: 0x0645 => untouched
unac.c:13708: unac_data0[15] & unac_positions[0][16]: 0x064f => untouched
unac.c:13708: unac_data34[15] & unac_positions[34][16]: 0x062f => untouched
unac.c:13708: unac_data0[14] & unac_positions[0][15]: 0x064e => untouched
unac.c:13708: unac_data34[17] & unac_positions[34][18]: 0x0631 => untouched
unac.c:13708: unac_data0[16] & unac_positions[0][17]: 0x0650 => untouched
unac.c:13708: unac_data0[17] & unac_positions[0][18]: 0x0651 => untouched
unac.c:13708: unac_data34[19] & unac_positions[34][20]: 0x0633 => untouched
unac.c:13708: unac_data0[14] & unac_positions[0][15]: 0x064e => untouched
unac.c:13708: unac_data34[9] & unac_positions[34][10]: 0x0629 => untouched
Text::Unaccent           - مُدَرِّسَة => مُدَرِّسَة

Strip NonspacingMark     - مُدَرِّسَة => U▒U▒▒ U▒رU▒U▒▒3U▒ة

-- 
You are receiving this mail because:
You are watching all bug changes.