[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

Sat Dec 5 18:27:14 CET 2015

http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #5 from Galen Charlton <gmcharlt at gmail.com> ---
I wrote a little test program to compare the options:

___BEGIN___
#!/usr/bin/perl

use Modern::Perl;
use Text::Unaccent qw//;
use Text::Unaccent::PurePerl qw//;
use utf8;
use Unicode::Normalize;

binmode STDOUT, ':utf8';
my @str = (
    'été',
    'umlaüt',
    'עברית',
    'חוֹלָם',
    '北京市',
    'Άά Έέ Ήή Ίί Όό Ύύ Ώώ',
    'مُدَرِّسَة'
);

sub unaccent {
    my $str = NFKD(shift);
    $str =~ s/\p{NonspacingMark}//g;
    return $str;
}

foreach (@str) {
    if ($_ eq 'مُدَرِّسَة') {
        # special case to avoid locking my terminal session (!)
        print "Text::Unaccent           - $_ => *refusing to let Text::Unaccent
do this*\n";
    } else {
        print "Text::Unaccent           - $_ => " .
Text::Unaccent::unac_string('utf-8', $_) . "\n";
    }
    print "Text::Unaccent::PurePerl - $_ => " .
Text::Unaccent::PurePerl::unac_string($_) . "\n";
    print "Strip NonspacingMark     - $_ => " . unaccent($_) . "\n";
}
___END___

Here's its output:

Text::Unaccent           - été => ete
Text::Unaccent::PurePerl - été => ete
Strip NonspacingMark     - été => ete
Text::Unaccent           - umlaüt => umlaut
Text::Unaccent::PurePerl - umlaüt => umlaut
Strip NonspacingMark     - umlaüt => umlaut
Text::Unaccent           - עברית => ×¢××¨××ª
Text::Unaccent::PurePerl - עברית => עברית
Strip NonspacingMark     - עברית => עברית
Text::Unaccent           - חוֹלָם => ××Ö¹×Ö¸×
Text::Unaccent::PurePerl - חוֹלָם => חוֹלָם
Strip NonspacingMark     - חוֹלָם => חולם
Text::Unaccent           - 北京市 => åäº¬å¸
Text::Unaccent::PurePerl - 北京市 => 北京市
Strip NonspacingMark     - 北京市 => 北京市
Text::Unaccent           - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => ÎÎ± ÎÎµ ÎÎ· ÎÎ¹ ÎÎ¿ Î¥Ï
 Î©Ï
Text::Unaccent::PurePerl - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => Αα Εε Ηη Ιι Οο Υυ Ωω
Strip NonspacingMark     - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => Αα Εε Ηη Ιι Οο Υυ Ωω
Text::Unaccent           - مُدَرِّسَة => *refusing to let Text::Unaccent do this*
Text::Unaccent::PurePerl - مُدَرِّسَة => مُدَرِّسَة
Strip NonspacingMark     - مُدَرِّسَة => مدرسة

Some conclusions:

[1] Text::Unaccent mangles non-Latin characters outright; that's enough reason
to get rid of it.
[2] Both Text::Unaccent::PurePerl and stripping NonspacingMark characters are
better -- they strip accents from Latin scripts, and don't mangle non-Latin. 
Removing NonspacingMark characters is more aggressive; I think we need input
from Arabic, Hebrew, and Greek suers as to whether that is acceptable -- or,
alternatively, if we need a system preference, or need to bite the bullet and
package Text::Unaccent::PurePerl.

-- 
You are receiving this mail because:
You are watching all bug changes.