[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

Tue Dec 8 04:59:25 CET 2015

http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #13 from David Cook <dcook at prosentient.com.au> ---
(In reply to Galen Charlton from comment #5)
> Some conclusions:
> 
> [1] Text::Unaccent mangles non-Latin characters outright; that's enough
> reason to get rid of it.

As I pointed out in my overly long comments, it doesn't appear that
Text::Unaccent is actually mangling non-Latin characters. 

Rather, in your example, it looks like Perl doesn't correctly handle the
concatenated string composed of one string with a UTF8 flag set and one string
without a UTF8 flag set. 

It looks like Perl tries to do a utf8::upgrade() on the string without the UTF8
flag set (ie the one returned from Text::Unaccent's C code), and instead of
reading it as an octet string and correctly translating into a UTF8 string of
corresponding Unicode code points, it reads each octet in as a code point,
which creates a completely different string for display purposes even though
the underlying octets are the same. 

When given the octets d9 and 85 (ie the Arabic letter Meem), it creates a "UTF8
string" with the code points of "\x{d9}\x{85}" when it should create a "UTF8
string" with the code point "\x{645}".

Instead of creating "\x{645}", Perl reads the octets d9 and 85 in as
"\x{d9}\x{85}"

This only appears to be a problem when you put the Text::Unaccent string in the
same string as a Perl string with a UTF8 flag. If you were to break them into
two separate lines, they'd display correctly in the terminal. Or you could use
Encode::decode("UTF-8",$unaccented) to create a Perl string with a UTF8 flag
with the proper code point "\x{645}";

> [2] Both Text::Unaccent::PurePerl and stripping NonspacingMark characters
> are better -- they strip accents from Latin scripts, and don't mangle
> non-Latin.  Removing NonspacingMark characters is more aggressive; I think
> we need input from Arabic, Hebrew, and Greek suers as to whether that is
> acceptable -- or, alternatively, if we need a system preference, or need to
> bite the bullet and package Text::Unaccent::PurePerl.

I suspect that Text::Unaccent and Text::Unaccent::PurePerl are mostly the same,
but that Text::Unaccent::PurePerl doesn't lose the UTF8 flag on the input
string. We could avoid Text::Unaccent::PurePerl if we simply use
"Encode::decode("UTF-8",$unaccented)" when using Text::Unaccent to translate
the internal byte string into an internal UTF8 string. While it might not be
required that we do that, doing so would probably prevent future buggy
behaviour from occurring.

That said, Text::Unaccent and Text::Unaccent::PurePerl don't necessarily look
good enough. They miss diacritics in Arabic at least, although I think we
definitely need input from Arabic, Hebrew, and CJK users regarding how
stripping NonspacingMark affects those strings. My guess is that it's fine to
strip the diacritics out of Arabic, but there are people much more qualified
than me to answer that question on the listserv. 

Greek actually looks OK with Text::Unaccent if the encoding is handled. We can
see that a bit more clearly with the following lines:

use Text::Unaccent qw/unac_debug/;
unac_debug($Text::Unaccent::DEBUG_HIGH);

-- 
You are receiving this mail because:
You are watching all bug changes.