[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent
bugzilla-daemon at bugs.koha-community.org
bugzilla-daemon at bugs.koha-community.org
Tue Dec 8 01:59:52 CET 2015
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759
--- Comment #10 from David Cook <dcook at prosentient.com.au> ---
So I think I've tracked down the C code behind Text::Unaccent:
https://github.com/gitpan/Text-Unaccent/blob/master/unac.c
The only reference I see to "damma" is in the U+FE70...U+FEFF code point range
which appears to list isolated forms which is not what we're dealing with in
these examples.
While I haven't reviewed the code extensively, it looks like the tables used
for Text::Unaccent are lacking...
If you replace the following line in Galen's script:
use Text::Unaccent qw//;
with
use Text::Unaccent qw/unac_debug/;
unac_debug($Text::Unaccent::DEBUG_HIGH);
You'll get more details of how Text::Unaccent is working (or not working as it
were).
Here's the output I get for the Arabic:
unac.c:13708: unac_data0[5] & unac_positions[0][6]: 0x0645 => untouched
unac.c:13708: unac_data0[15] & unac_positions[0][16]: 0x064f => untouched
unac.c:13708: unac_data34[15] & unac_positions[34][16]: 0x062f => untouched
unac.c:13708: unac_data0[14] & unac_positions[0][15]: 0x064e => untouched
unac.c:13708: unac_data34[17] & unac_positions[34][18]: 0x0631 => untouched
unac.c:13708: unac_data0[16] & unac_positions[0][17]: 0x0650 => untouched
unac.c:13708: unac_data0[17] & unac_positions[0][18]: 0x0651 => untouched
unac.c:13708: unac_data34[19] & unac_positions[34][20]: 0x0633 => untouched
unac.c:13708: unac_data0[14] & unac_positions[0][15]: 0x064e => untouched
unac.c:13708: unac_data34[9] & unac_positions[34][10]: 0x0629 => untouched
Text::Unaccent - مُدَرِّسَة => Ù�Ù�دÙ�رÙ�Ù�سÙ�Ø©
Strip NonspacingMark - مُدَرِّسَة => مدرسة
Here's the output I get for the Greek:
unac.c:13708: unac_data21[6] & unac_positions[21][7]: 0x0386 => 0x0391
unac.c:13708: unac_data22[12] & unac_positions[22][13]: 0x03ac => 0x03b1
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[8] & unac_positions[21][9]: 0x0388 => 0x0395
unac.c:13708: unac_data22[13] & unac_positions[22][14]: 0x03ad => 0x03b5
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[9] & unac_positions[21][10]: 0x0389 => 0x0397
unac.c:13708: unac_data22[14] & unac_positions[22][15]: 0x03ae => 0x03b7
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[10] & unac_positions[21][11]: 0x038a => 0x0399
unac.c:13708: unac_data22[15] & unac_positions[22][16]: 0x03af => 0x03b9
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[12] & unac_positions[21][13]: 0x038c => 0x039f
unac.c:13708: unac_data23[12] & unac_positions[23][13]: 0x03cc => 0x03bf
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[14] & unac_positions[21][15]: 0x038e => 0x03a5
unac.c:13708: unac_data23[13] & unac_positions[23][14]: 0x03cd => 0x03c5
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[15] & unac_positions[21][16]: 0x038f => 0x03a9
unac.c:13708: unac_data23[14] & unac_positions[23][15]: 0x03ce => 0x03c9
Text::Unaccent - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => Î�α Î�ε Î�η Î�ι Î�ο
Υ� Ω�
Strip NonspacingMark - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => Αα Εε Ηη Ιι Οο Υυ Ωω
Interestingly we can see a reference to the Greek cpaital letter alpha with the
tonos diacritic:
* 0386 GREEK CAPITAL LETTER ALPHA WITH TONOS
* 0391 GREEK CAPITAL LETTER ALPHA
Indeed, in the output, we can see that 0x0386 was changed to 0x0391... although
admittedly I don't know exactly how. It looks like a binary operation that uses
a bitmask to produce a certain value... we don't need to know 100% how that
mechanism is working right now... just that it works as described above.
--
So in the Arabic example... everything was "untouched" and yet the output is
garbled. That's certainly an encoding issue...
Indeed, look at the following:
dcook at koha:~/experiments> echo "مُدَرِّسَة" | iconv -f latin1 -t utf-8
��د�ر��س�ة
That is the same output as Text::Unaccent:
Text::Unaccent - مُدَرِّسَة => Ù�Ù�دÙ�رÙ�Ù�سÙ�Ø©
So somewhere along the line that UTF-8 string is getting double-encoded.
Check this out:
dcook at koha:~/experiments> echo "مُدَرِّسَة" | iconv -f latin1 -t utf-8 | iconv -f
utf-8 -t latin1
مُدَرِّسَة
I think the double-encoding is down to us using "binmode STDOUT, ':utf8';"
(which tells Perl to output UTF-8 encoded bytes instead of Latin-1 (or some
other single byte encoding it normally uses) and "use utf8" which tells Perl
that the source code uses UTF-8...
Removing those gets us the following:
Text::Unaccent - été => ete
Strip NonspacingMark - été => A▒tA▒
Text::Unaccent - umlaüt => umlaut
Wide character in print at unaccent.pl line 47.
Strip NonspacingMark - umlaüt => umlaA1⁄4t
Text::Unaccent - עברית => עברית
Strip NonspacingMark - עברית => עב▒ י▒a
Text::Unaccent - חוֹלָם => חוֹלָם
Strip NonspacingMark - חוֹלָם => חוO1לO ם
Text::Unaccent - 北京市 => 北京市
Strip NonspacingMark - 北京市 => a▒▒ao▒a ▒
Text::Unaccent - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => Αα Εε Ηη Ιι Οο Υυ Ωω
Strip NonspacingMark - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => I▒I▒ I▒I▒ I▒I▒ I▒I I▒I▒
I▒I▒ I▒I▒
Text::Unaccent - مُدَرِّسَة => مُدَرِّسَة
Strip NonspacingMark - مُدَرِّسَة => U▒U▒▒ U▒رU▒U▒▒3U▒ة
At a glance, Text::Unaccent looks like it works for French, German, and
Greek... but doesn't touch Hebrew, Japanese(?), or Arabic.
Here's that output again with the debugging:
unac.c:13708: unac_data3[10] & unac_positions[3][10]: 0x00e9 => 0x0065
unac.c:13708: unac_data0[20] & unac_positions[0][21]: 0x0074 => untouched
unac.c:13708: unac_data3[10] & unac_positions[3][10]: 0x00e9 => 0x0065
Text::Unaccent - été => ete
Strip NonspacingMark - été => A▒tA▒
unac.c:13708: unac_data0[21] & unac_positions[0][22]: 0x0075 => untouched
unac.c:13708: unac_data0[13] & unac_positions[0][14]: 0x006d => untouched
unac.c:13708: unac_data0[12] & unac_positions[0][13]: 0x006c => untouched
unac.c:13708: unac_data0[1] & unac_positions[0][2]: 0x0061 => untouched
unac.c:13708: unac_data3[29] & unac_positions[3][29]: 0x00fc => 0x0075
unac.c:13708: unac_data0[20] & unac_positions[0][21]: 0x0074 => untouched
Text::Unaccent - umlaüt => umlaut
Wide character in print at unaccent.pl line 47.
Strip NonspacingMark - umlaüt => umlaA1⁄4t
unac.c:13708: unac_data0[2] & unac_positions[0][3]: 0x05e2 => untouched
unac.c:13708: unac_data0[17] & unac_positions[0][18]: 0x05d1 => untouched
unac.c:13708: unac_data0[8] & unac_positions[0][9]: 0x05e8 => untouched
unac.c:13708: unac_data0[25] & unac_positions[0][26]: 0x05d9 => untouched
unac.c:13708: unac_data0[10] & unac_positions[0][11]: 0x05ea => untouched
Text::Unaccent - עברית => עברית
Strip NonspacingMark - עברית => עב▒ י▒a
unac.c:13708: unac_data0[23] & unac_positions[0][24]: 0x05d7 => untouched
unac.c:13708: unac_data0[21] & unac_positions[0][22]: 0x05d5 => untouched
unac.c:13708: unac_data0[25] & unac_positions[0][26]: 0x05b9 => untouched
unac.c:13708: unac_data0[28] & unac_positions[0][29]: 0x05dc => untouched
unac.c:13708: unac_data0[24] & unac_positions[0][25]: 0x05b8 => untouched
unac.c:13708: unac_data0[29] & unac_positions[0][30]: 0x05dd => untouched
Text::Unaccent - חוֹלָם => חוֹלָם
Strip NonspacingMark - חוֹלָם => חוO1לO ם
unac.c:13708: unac_data0[23] & unac_positions[0][24]: 0x5317 => untouched
unac.c:13708: unac_data0[12] & unac_positions[0][13]: 0x4eac => untouched
unac.c:13708: unac_data0[2] & unac_positions[0][3]: 0x5e02 => untouched
Text::Unaccent - 北京市 => 北京市
Strip NonspacingMark - 北京市 => a▒▒ao▒a ▒
unac.c:13708: unac_data21[6] & unac_positions[21][7]: 0x0386 => 0x0391
unac.c:13708: unac_data22[12] & unac_positions[22][13]: 0x03ac => 0x03b1
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[8] & unac_positions[21][9]: 0x0388 => 0x0395
unac.c:13708: unac_data22[13] & unac_positions[22][14]: 0x03ad => 0x03b5
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[9] & unac_positions[21][10]: 0x0389 => 0x0397
unac.c:13708: unac_data22[14] & unac_positions[22][15]: 0x03ae => 0x03b7
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[10] & unac_positions[21][11]: 0x038a => 0x0399
unac.c:13708: unac_data22[15] & unac_positions[22][16]: 0x03af => 0x03b9
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[12] & unac_positions[21][13]: 0x038c => 0x039f
unac.c:13708: unac_data23[12] & unac_positions[23][13]: 0x03cc => 0x03bf
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[14] & unac_positions[21][15]: 0x038e => 0x03a5
unac.c:13708: unac_data23[13] & unac_positions[23][14]: 0x03cd => 0x03c5
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[15] & unac_positions[21][16]: 0x038f => 0x03a9
unac.c:13708: unac_data23[14] & unac_positions[23][15]: 0x03ce => 0x03c9
Text::Unaccent - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => Αα Εε Ηη Ιι Οο Υυ Ωω
Strip NonspacingMark - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => I▒I▒ I▒I▒ I▒I▒ I▒I I▒I▒
I▒I▒ I▒I▒
unac.c:13708: unac_data0[5] & unac_positions[0][6]: 0x0645 => untouched
unac.c:13708: unac_data0[15] & unac_positions[0][16]: 0x064f => untouched
unac.c:13708: unac_data34[15] & unac_positions[34][16]: 0x062f => untouched
unac.c:13708: unac_data0[14] & unac_positions[0][15]: 0x064e => untouched
unac.c:13708: unac_data34[17] & unac_positions[34][18]: 0x0631 => untouched
unac.c:13708: unac_data0[16] & unac_positions[0][17]: 0x0650 => untouched
unac.c:13708: unac_data0[17] & unac_positions[0][18]: 0x0651 => untouched
unac.c:13708: unac_data34[19] & unac_positions[34][20]: 0x0633 => untouched
unac.c:13708: unac_data0[14] & unac_positions[0][15]: 0x064e => untouched
unac.c:13708: unac_data34[9] & unac_positions[34][10]: 0x0629 => untouched
Text::Unaccent - مُدَرِّسَة => مُدَرِّسَة
Strip NonspacingMark - مُدَرِّسَة => U▒U▒▒ U▒رU▒U▒▒3U▒ة
--
You are receiving this mail because:
You are watching all bug changes.
More information about the Koha-bugs
mailing list