[Koha-devel] UTF-8 problems : a summary and some solutions

Henri-Damien LAURENT laurenthdl at alinto.com
Tue Aug 22 10:14:42 CEST 2006


Hi,
I investigated quite hard into finding a solution to our utf-8 problems.
It seems we CAN come to a solution.

What is the problem ?
------------------------------
On my right :
values from xml records utf-8 encoded.
Correctly displaying them is only a matter of making PERL aware of the
fact that he is using utf-8 values, that is :
- either use perl -CS or perl -CIOE instead of just simple perl on the
first line of the script. That toggles perl into utf-8 for STDINPUT,
STDOUTPUT and STDERR.
- or add binmode STDOUT, ":UTF-8" or binmore STDOUT, ":utf8"
(Be warned that those two different descriptions ARE NOT equivalent. see
http://perldoc.perl.org/Encode.html#UTF-8-vs.-utf8 I would suggest that
we always use utf8 rather than UTF-8)

Simple, clear, efficient (remember though to use libxml2 as stated
Joshua in a previous mail
http://www.nntp.perl.org/group/perl.perl4lib/2369).

On my left :
values from Mysql coded in utf-8 and stored as such in MysqlDatabase.
When using ("SET names=UTF-8") on the database just after the
connection, Everything goes... WELL !!!!

Bibliothèque  
¥ · £ · € · $ · ¢ · ₡ · ₢ · ₣ · ₤ · ₥ · ₦ · ₧ · ₨ · ₩ · ₪ · ₫ · ₭ · ₮ ·


So What is the problem ????

The problem is that left goes well PROVIDED that you DONOT set binmode
STDOUT,":utf8" and PERL believes he is processing ISO-8859-1 data.
If you launch test_mix.pl on your system without perl -CS : xml data wil
be badly displayed BUT Mysql data Well displayed !!!

<datafield tag="200" ind1=" " ind2=" ">
<subfield code="a">(L') �ole des p�es</subfield>
<subfield code="b">R</subfield>
<subfield code="f">Bazin, Herv�/subfield>
</datafield>
<datafield tag="210" ind1=" " ind2=" ">
<subfield code="c">�itions du seuil</subfield>
</datafield>

If you launch test_mix.pl on your system with perl -CS : xml data will
be well displayed BUT Mysql data badly (encoded TWICE) !!!

Bibliothèque &nbsp;
¥ · £ · ⬠· $ · ¢ · ⡠· ⢠· ⣠· ⤠· ⥠· ⦠· ⧠·
⨠· ⩠· ⪠· ⫠· ⭠· ⮠·


OK. we've got a problem. Where does it come from ?
It comes from DBD::mysql and DBI which provides utf-8 data to PERL
without Telling him.
In perl, Utf-8 data has a flag that tells OK, I am utf-8. If this flag
is not set and output is UTF-8 Perl will "magically" but unfortunately
re-encode the values.

See http://www.mail-archive.com/dbi-dev@perl.org/msg04319.html and
following...

Solutions.
-------------
you can have a look at
http://www.simplicidade.org/notes/archives/2005/12/utf8_and_dbdmys.html
There are three solutions :
- find out EVERYWHERE in the code we receive data from mysql and set the
flag on

|if (! utf8::is_utf8($message)) {
  utf8::decode($message);
}|

Is there a volunteer ? ;)

- patch DBD-MYSQL code so that it makes perl utf-8 aware and use
{*"*mysql_enable_utf8*"*=>1} options when connecting to database.
There is a patch here (version 2.90008 http://lists.mysql.com/perl/3563
and here http://rt.cpan.org/Public/Bug/Display.html?id=17829 version
3.000006 ) And it seems that people in the DBD-MYSQL team will
incorporate this patch in next DBD::mysql version.

- Use (http://dysphoria.net/2006/02/05/utf-8-a-go-go/) UTF8DBI.pm for
any database connections rather than DBD-mysql.
Advantage : If we incorporate those package tightly to C4 and Koha, we
may improve UTF-8 compliance and stability.
Disadvantages : Module is not supported by anyone, not official, and
would require quite HEAVY transformations in our code.

You may have felt which solution seems to me the most reasonale one.
The only problem being...
Are we finished with utf-8 problems ?

I HOPE So.
But we can reasonably think there still may be hidden ones.
Why ?
- http://www.mhonarc.org/archive/html/perl-unicode/2006-07/msg00004.html
and http://www.mhonarc.org/archive/html/perl-unicode/2006-07/msg00008.html
seems to point out some CGI input problems with UTF-8. (It is a fact I
DIDNOT test data input with binmode UTF-8 set or with the DBD-MYSQL hack.)

So I go on testing.

If you need test files, you can get them on
http://hdlaurent.paulpoulain.com/test_mix.pl
http://hdlaurent.paulpoulain.com/test_mix_fixed.pl
http://hdlaurent.paulpoulain.com/testsrecord.xml

-- 
Henri-Damien LAURENT






More information about the Koha-devel mailing list