[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent
bugzilla-daemon at bugs.koha-community.org
bugzilla-daemon at bugs.koha-community.org
Tue Dec 8 03:50:44 CET 2015
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759
--- Comment #11 from David Cook <dcook at prosentient.com.au> ---
Analyzing what "use utf8" does and it's... interesting.
#use utf8;
#binmode STDOUT, ':utf8';
say "Hex = ".unpack("H*",$_);
Hex = d985d98fd8afd98ed8b1d990d991d8b3d98ed8a9
Text::Unaccent - مُدَرِّسَة => مُدَرِّسَة
echo "مُدَرِّسَة" | xxd -p
d985d98fd8afd98ed8b1d990d991d8b3d98ed8a90a
[That last 0a byte is just a LF character (ie \n)]
use utf8;
#binmode STDOUT, ':utf8';
say "Hex = ".unpack("H*",$_);
Hex = 454f2f4e315051334e29
Text::Unaccent - مُدَرِّسَة => Ù�Ù�دÙ�رÙ�Ù�سÙ�Ø©
#use utf8;
binmode STDOUT, ':utf8';
say "Hex = ".unpack("H*",$_);
Hex = d985d98fd8afd98ed8b1d990d991d8b3d98ed8a9
Text::Unaccent - ��د�ر��س�ة => ��د�ر��س�ة
use utf8;
binmode STDOUT, ':utf8';
say "Hex = ".unpack("H*",$_);
Hex = 454f2f4e315051334e29
Text::Unaccent - مُدَرِّسَة => Ù�Ù�دÙ�رÙ�Ù�سÙ�Ø©
--
I have no idea what 454f2f4e315051334e29 is... it's not UTF-8 or Latin1. In
fact, if you try to read it as either... you'll just read that EO/N1PQ3N).
Ahh, I was missing this error message: Character in 'H' format wrapped in
unpack at unaccent.pl line 46.
Here's some more info using Devel::Peek::Dump():
PV = 0x1ba6b20
"\331\205\331\217\330\257\331\216\330\261\331\220\331\221\330\263\331\216\330\251"\0
[UTF8 "\x{645}\x{64f}\x{62f}\x{64e}\x{631}\x{650}\x{651}\x{633}\x{64e}\x{629}"]
Indeed, if we look back at our UTF-8 table:
http://www.utf8-chartable.de/unicode-utf8-table.pl?start=1536
0645 is the code point for ARABIC LETTER MEEM which would be encoded as d9 85.
454f2f4e315051334e29 is clearly a butchering of the internal string of Unicode
codepoints
"\x{645}\x{64f}\x{62f}\x{64e}\x{631}\x{650}\x{651}\x{633}\x{64e}\x{629}" where
only the low-byte values of the code point is being shown.
--
Ahh... I think I might have figured it out.
When you use "use utf8":
$_ = PV = 0xf25f60
"\331\205\331\217\330\257\331\216\330\261\331\220\331\221\330\263\331\216\330\251"\0
[UTF8 "\x{645}\x{64f}\x{62f}\x{64e}\x{631}\x{650}\x{651}\x{633}\x{64e}\x{629}"]
Text::Unaccent::unac_string('UTF-8', $_) = PV = 0x2a0a0c0
"\331\205\331\217\330\257\331\216\330\261\331\220\331\221\330\263\331\216\330\251"\0
If you print out the content of "Text::Unaccent::unac_string('UTF-8', $_)" on
its own, you'll get مُدَرِّسَة.
However, if you mix $_ and $unaccented in a single concatenated string, you're
going to wind up with a correct $_ but a double-encoded $unaccented.
If you look at the concatenated string, you'll get a PV of:
PV = 0x29028c0 "Text::Unaccent -
\331\205\331\217\330\257\331\216\330\261\331\220\331\221\330\263\331\216\330\251
-
\303\231\302\205\303\231\302\217\303\230\302\257\303\231\302\216\303\230\302\261\303\231\302\220\303\231\302\221\303\230\302\263\303\231\302\216\303\230\302\251
\n"\0 [UTF8 "Text::Unaccent -
\x{645}\x{64f}\x{62f}\x{64e}\x{631}\x{650}\x{651}\x{633}\x{64e}\x{629} -
\x{d9}\x{85}\x{d9}\x{8f}\x{d8}\x{af}\x{d9}\x{8e}\x{d8}\x{b1}\x{d9}\x{90}\x{d9}\x{91}\x{d8}\x{b3}\x{d9}\x{8e}\x{d8}\x{a9}
\n"]
So in that UTF8 section you have $_ represented by Unicode codepoints while the
UTF-8 encoded bytes of $unaccepted have been transformed into a string of
codepoints using a hexadecimal byte for each code point.
If you wanted to concatenate them both in the string, you'd first have to run
"$unaccented = decode('UTF-8', $unaccented)". Then your concatenated string
would internally look like:
PV = 0x27812a0 "Text::Unaccent -
\331\205\331\217\330\257\331\216\330\261\331\220\331\221\330\263\331\216\330\251
-
\331\205\331\217\330\257\331\216\330\261\331\220\331\221\330\263\331\216\330\251
\n"\0 [UTF8 "Text::Unaccent -
\x{645}\x{64f}\x{62f}\x{64e}\x{631}\x{650}\x{651}\x{633}\x{64e}\x{629} -
\x{645}\x{64f}\x{62f}\x{64e}\x{631}\x{650}\x{651}\x{633}\x{64e}\x{629} \n"]
And that would be correct:
Text::Unaccent - مُدَرِّسَة - مُدَرِّسَة
Strip NonspacingMark - مُدَرِّسَة => مدرسة
I mean... the output still doesn't do us much good, but that explains the
mangling.
While we gave Text::Unaccent a Perl string with a UTF8 flag set, it took that
string through to some C code using a XS interface, did a few things (depending
on the scenario), and then passed back a Perl string without a UTF8 flag set,
which seems to confuse Perl.
If we do a utf8::upgrade($unaccented) earlier, it still creates a string with
incorrect code points...
--
You are receiving this mail because:
You are watching all bug changes.
More information about the Koha-bugs
mailing list