[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

bugzilla-daemon at bugs.koha-community.org bugzilla-daemon at bugs.koha-community.org
Tue Dec 8 04:49:49 CET 2015


http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #12 from David Cook <dcook at prosentient.com.au> ---
More interesting things... 

You can still have a Perl string with a UTF8 flag set, even when you're not
using "use utf8"...

My example:

my $arabic =  "\x{0645}";
PV = 0x190e950 "\331\205"\0 [UTF8 "\x{645}"]


Interestingly, if I don't use "use utf8", and use a UTF8 encoded character in
my source code, I get a string without a UTF8 flag:

my $arabic_text = "ﻡ";
PV = 0x1ea92b0 "\331\205"\0

I imagine use of the \x{} construct must do a utf8::upgrade...

--

In any case, if I put $arabic and $arabic_text into the same string, I get the
following:

my $arabic_result = "Arabic = $arabic_text = $arabic";
say $arabic_result;

Arabic = Ù� = م 
PV = 0x29feda0 "Arabic = \303\231\302\205 = \331\205"\0 [UTF8 "Arabic =
\x{d9}\x{85} = \x{645}"]


However, if I try "$arabic_text = decode("UTF-8",$arabic_text")", which
according to http://perldoc.perl.org/Encode.html means: $characters =
decode('UTF-8', $octets), then I get the following:

Arabic = م = م
PV = 0x15efe50 "Arabic = \331\205 = \331\205"\0 [UTF8 "Arabic = \x{645} =
\x{645}"]

Alternatively, I could have done "$arabic = encode("UTF-8",$arabic);", which
would yield this result:

Arabic = م = م
PV = 0x832210 "Arabic = \331\205 = \331\205"\0

This explains the UTF8 flag a bit:
http://perldoc.perl.org/Encode.html#The-UTF8-flag

So yeah... that's cool... who knew that was a thing, eh?

-- 
You are receiving this mail because:
You are watching all bug changes.


More information about the Koha-bugs mailing list