[Koha-devel] Re: [Zebralist] utf-8, probable solution

Wed Feb 15 16:17:08 CET 2006

Paul,

this confirms our impressions at Index Data.. somehow, while PHP has 
managed to approach Unicode in a way that mostly 'just works' (probably 
by not doing more than necessary to it), Perl seems to have all kinds of 
internal logic which has the effect of making  Unicode really, really 
complicated and unintuitive. We had two guys spending a week or so each 
trying to make heads or tails of the UTF-8 tutorial, and still we felt 
at the end like we were fudging around the problem rather than really 
solving it well.

I'm *not* fond of Perl's approach to Unicode.

--Sebastian

Paul POULAIN wrote:

> Thanks to Heikki Levanto, Tümer Garip & Mike Rylander, you pointed 3 
> things useless alone, but very useful when mixed.
>
> I think I have the solution to our problem. It's not a zebra or 
> html::template or marc::record problem, it's a Perl one !
>
> Let me explain :
> I followed my utf-8 string in my perl Code until printed and it was 
> always utf-8 (\x9c...)
> But in firefox, it was iso8859-1.
>
> Heikki told me that the first 255 char were shared by unicode and 
> iso8859-1. So, I told myself : OK, Paul, add a "true utf-8 character 
> to your string". I choose \x{263a} (the smiley, because i'm always 
> optimistic & that is what is used in perluniintro)
>
> Surprise ... now my é was a utf-8 é in firefox !!!!
> Conclusion : perl looked at my string before sendint it, and, as it 
> finds it's not "true utf-8", Perl did something to change it in 
> iso8859-1.
>
> I also had a brand new message in my log :
> >            Wide character in print at ...
>
> Mike R. and Tümer G. suggestions make me investigate perldoc on unicode.
> and here it is :
>
>>        A user of Perl does not normally need to know nor care how 
>> Perl happens to encode its internal strings, but it becomes rele-
>>        vant when outputting Unicode strings to a stream without a 
>> PerlIO layer -- one with the "default" encoding.  In such a case,
>>        the raw bytes used internally (the native character set or 
>> UTF-8, as appropriate for each string) will be used, and a "Wide
>>        character" warning will be issued if those strings contain a 
>> character beyond 0x00FF.
>>        For example,
>>              perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
>>        produces a fairly useless mixture of native bytes and UTF-8, 
>> as well as a warning:
>>             Wide character in print at ...
>>        To output UTF-8, use the ":utf8" output layer.  Prepending
>>              binmode(STDOUT, ":utf8");
>>        to this sample program ensures that the output is completely 
>> UTF-8, and removes the program's warning.
>
>
>
> GOTCHA ! I have added binmode(STDOUT, ":utf8"), and now, even without 
> the smiley, my éà... are correctly shown.
>
> Still having to investigate mySQL utf-8, but it seems that
> > set NAMES=utf8
> is useless.
>
> Thanks everybody for helping me. I'll continue this thread on 
> koha-devel only, as zebra & perl4lib are not interested probably.

-- 
Sebastian Hammer, Index Data
quinn at indexdata.com   www.indexdata.com
Ph: (603) 209-6853