[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent
bugzilla-daemon at bugs.koha-community.org
bugzilla-daemon at bugs.koha-community.org
Sat Dec 5 18:27:14 CET 2015
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759
--- Comment #5 from Galen Charlton <gmcharlt at gmail.com> ---
I wrote a little test program to compare the options:
___BEGIN___
#!/usr/bin/perl
use Modern::Perl;
use Text::Unaccent qw//;
use Text::Unaccent::PurePerl qw//;
use utf8;
use Unicode::Normalize;
binmode STDOUT, ':utf8';
my @str = (
'été',
'umlaüt',
'עברית',
'חוֹלָם',
'北京市',
'Άά Έέ Ήή Ίί Όό Ύύ Ώώ',
'مُدَرِّسَة'
);
sub unaccent {
my $str = NFKD(shift);
$str =~ s/\p{NonspacingMark}//g;
return $str;
}
foreach (@str) {
if ($_ eq 'مُدَرِّسَة') {
# special case to avoid locking my terminal session (!)
print "Text::Unaccent - $_ => *refusing to let Text::Unaccent
do this*\n";
} else {
print "Text::Unaccent - $_ => " .
Text::Unaccent::unac_string('utf-8', $_) . "\n";
}
print "Text::Unaccent::PurePerl - $_ => " .
Text::Unaccent::PurePerl::unac_string($_) . "\n";
print "Strip NonspacingMark - $_ => " . unaccent($_) . "\n";
}
___END___
Here's its output:
Text::Unaccent - été => ete
Text::Unaccent::PurePerl - été => ete
Strip NonspacingMark - été => ete
Text::Unaccent - umlaüt => umlaut
Text::Unaccent::PurePerl - umlaüt => umlaut
Strip NonspacingMark - umlaüt => umlaut
Text::Unaccent - עברית => ×¢×ר×ת
Text::Unaccent::PurePerl - עברית => עברית
Strip NonspacingMark - עברית => עברית
Text::Unaccent - חוֹלָם => ××Ö¹×Ö¸×
Text::Unaccent::PurePerl - חוֹלָם => חוֹלָם
Strip NonspacingMark - חוֹלָם => חולם
Text::Unaccent - 北京市 => å京å¸
Text::Unaccent::PurePerl - 北京市 => 北京市
Strip NonspacingMark - 北京市 => 北京市
Text::Unaccent - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => Îα Îε Îη Îι Îο Î¥Ï
ΩÏ
Text::Unaccent::PurePerl - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => Αα Εε Ηη Ιι Οο Υυ Ωω
Strip NonspacingMark - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => Αα Εε Ηη Ιι Οο Υυ Ωω
Text::Unaccent - مُدَرِّسَة => *refusing to let Text::Unaccent do this*
Text::Unaccent::PurePerl - مُدَرِّسَة => مُدَرِّسَة
Strip NonspacingMark - مُدَرِّسَة => مدرسة
Some conclusions:
[1] Text::Unaccent mangles non-Latin characters outright; that's enough reason
to get rid of it.
[2] Both Text::Unaccent::PurePerl and stripping NonspacingMark characters are
better -- they strip accents from Latin scripts, and don't mangle non-Latin.
Removing NonspacingMark characters is more aggressive; I think we need input
from Arabic, Hebrew, and Greek suers as to whether that is acceptable -- or,
alternatively, if we need a system preference, or need to bite the bullet and
package Text::Unaccent::PurePerl.
--
You are receiving this mail because:
You are watching all bug changes.
More information about the Koha-bugs
mailing list