[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

Tue Dec 8 03:50:44 CET 2015

http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #11 from David Cook <dcook at prosentient.com.au> ---
Analyzing what "use utf8" does and it's... interesting.

#use utf8;
#binmode STDOUT, ':utf8';
say "Hex = ".unpack("H*",$_);

Hex = d985d98fd8afd98ed8b1d990d991d8b3d98ed8a9
Text::Unaccent           - مُدَرِّسَة => مُدَرِّسَة

echo "مُدَرِّسَة" | xxd -p
d985d98fd8afd98ed8b1d990d991d8b3d98ed8a90a

[That last 0a byte is just a LF character (ie \n)]

use utf8;
#binmode STDOUT, ':utf8';
say "Hex = ".unpack("H*",$_);

Hex = 454f2f4e315051334e29
Text::Unaccent           - مُدَرِّسَة => Ù�Ù�Ø¯Ù�Ø±Ù�Ù�Ø³Ù�Ø©

#use utf8;
binmode STDOUT, ':utf8';
say "Hex = ".unpack("H*",$_);

Hex = d985d98fd8afd98ed8b1d990d991d8b3d98ed8a9
Text::Unaccent           - Ù�Ù�Ø¯Ù�Ø±Ù�Ù�Ø³Ù�Ø© => Ù�Ù�Ø¯Ù�Ø±Ù�Ù�Ø³Ù�Ø©

use utf8;
binmode STDOUT, ':utf8';
say "Hex = ".unpack("H*",$_);

Hex = 454f2f4e315051334e29
Text::Unaccent           - مُدَرِّسَة => Ù�Ù�Ø¯Ù�Ø±Ù�Ù�Ø³Ù�Ø©
--

I have no idea what 454f2f4e315051334e29 is... it's not UTF-8 or Latin1. In
fact, if you try to read it as either... you'll just read that EO/N1PQ3N).

Ahh, I was missing this error message: Character in 'H' format wrapped in
unpack at unaccent.pl line 46.

Here's some more info using Devel::Peek::Dump():
PV = 0x1ba6b20
"\331\205\331\217\330\257\331\216\330\261\331\220\331\221\330\263\331\216\330\251"\0
[UTF8 "\x{645}\x{64f}\x{62f}\x{64e}\x{631}\x{650}\x{651}\x{633}\x{64e}\x{629}"]

Indeed, if we look back at our UTF-8 table:
http://www.utf8-chartable.de/unicode-utf8-table.pl?start=1536

0645 is the code point for ARABIC LETTER MEEM which would be encoded as d9 85.

454f2f4e315051334e29 is clearly a butchering of the internal string of Unicode
codepoints
"\x{645}\x{64f}\x{62f}\x{64e}\x{631}\x{650}\x{651}\x{633}\x{64e}\x{629}" where
only the low-byte values of the code point is being shown.

--

Ahh... I think I might have figured it out.

When you use "use utf8":

$_ = PV = 0xf25f60
"\331\205\331\217\330\257\331\216\330\261\331\220\331\221\330\263\331\216\330\251"\0
[UTF8 "\x{645}\x{64f}\x{62f}\x{64e}\x{631}\x{650}\x{651}\x{633}\x{64e}\x{629}"]
Text::Unaccent::unac_string('UTF-8', $_) = PV = 0x2a0a0c0
"\331\205\331\217\330\257\331\216\330\261\331\220\331\221\330\263\331\216\330\251"\0

If you print out the content of "Text::Unaccent::unac_string('UTF-8', $_)" on
its own, you'll get مُدَرِّسَة.

However, if you mix $_ and $unaccented in a single concatenated string, you're
going to wind up with a correct $_ but a double-encoded $unaccented.

If you look at the concatenated string, you'll get a PV of:

PV = 0x29028c0 "Text::Unaccent -
\331\205\331\217\330\257\331\216\330\261\331\220\331\221\330\263\331\216\330\251
-
\303\231\302\205\303\231\302\217\303\230\302\257\303\231\302\216\303\230\302\261\303\231\302\220\303\231\302\221\303\230\302\263\303\231\302\216\303\230\302\251
\n"\0 [UTF8 "Text::Unaccent -
\x{645}\x{64f}\x{62f}\x{64e}\x{631}\x{650}\x{651}\x{633}\x{64e}\x{629} -
\x{d9}\x{85}\x{d9}\x{8f}\x{d8}\x{af}\x{d9}\x{8e}\x{d8}\x{b1}\x{d9}\x{90}\x{d9}\x{91}\x{d8}\x{b3}\x{d9}\x{8e}\x{d8}\x{a9}
\n"]

So in that UTF8 section you have $_ represented by Unicode codepoints while the
UTF-8 encoded bytes of $unaccepted have been transformed into a string of
codepoints using a hexadecimal byte for each code point.

If you wanted to concatenate them both in the string, you'd first have to run
"$unaccented = decode('UTF-8', $unaccented)". Then your concatenated string
would internally look like: 

PV = 0x27812a0 "Text::Unaccent -
\331\205\331\217\330\257\331\216\330\261\331\220\331\221\330\263\331\216\330\251
-
\331\205\331\217\330\257\331\216\330\261\331\220\331\221\330\263\331\216\330\251
\n"\0 [UTF8 "Text::Unaccent -
\x{645}\x{64f}\x{62f}\x{64e}\x{631}\x{650}\x{651}\x{633}\x{64e}\x{629} -
\x{645}\x{64f}\x{62f}\x{64e}\x{631}\x{650}\x{651}\x{633}\x{64e}\x{629} \n"]

And that would be correct:

Text::Unaccent - مُدَرِّسَة - مُدَرِّسَة
Strip NonspacingMark     - مُدَرِّسَة => مدرسة

I mean... the output still doesn't do us much good, but that explains the
mangling.

While we gave Text::Unaccent a Perl string with a UTF8 flag set, it took that
string through to some C code using a XS interface, did a few things (depending
on the scenario), and then passed back a Perl string without a UTF8 flag set,
which seems to confuse Perl.

If we do a utf8::upgrade($unaccented) earlier, it still creates a string with
incorrect code points...

-- 
You are receiving this mail because:
You are watching all bug changes.