[Koha-devel] UTF-8 problems : a summary and some solutions

Wed Aug 23 15:44:20 CEST 2006

Joshua Ferraro a écrit :
> Hi Henri-Damien,
>
> The problems you're describing are because your MARC data is not
> valid. You may have UTF-8 encoded data in your MARC, but unless
> the Leader 09 is set to 'a', the MARC::* suite of tools has no
> way of knowing it's UTF-8 and will interpret it as MARC-8 ( properly,
> since there are only two valid encodings in MARC21 ). MARC::*
> has no knowledge of any other encoding, but in the latest sourceforge
> version there is an optional UNIMARC flag you can pass to it that
> will avoid all character set conversions (but this needs to be tested
> as the author didn't have direct access to UNIMARC.
>
> You can not simply switch from MARC-8 to UTF-8 without doing some
> heavy transformations ... for instance, in MARC-8, combining
> characters are of the form <combined character> <base> whereas in
> UTF-8, they are of the form <base> <combined character>.
>
> If you are converting a database that has invalid MARC-8 data (for
> instance, if it has latin1 data) you'll want to use the 
> ignore_errors flag:
>
> MARC::Charset->ignore_errors(1);
>
> to avoid losing entire subfields because of a bad character.
>
> I have confirmed that a UTF-8 Koha is already possible, without
> changing CGI or DBI. So long as MySQL and Apache have been set
> up properly, UTF-8 data passes unharmed between the DBI and
> CGI levels. You can view an example of a pure UTF-8 Koha here:
> http://wipoopac.liblime.com
>
> MARC::* is difficult to install properly, and it's also hard
> to make sure you have valid MARC records, but once you do these
> steps, the process works very smoothly. If we are to claim
> MARC21 compliance, MARC::* is a must (unless you want to write
> a new suite for MARC handling in perl or use a non-perl
> solution).
>   
Thanks for your reply.
Maybe I should use private message to report these thoughts.

OK for claiming MARC compliance, as soon as it is for ANY MARC flavor
would it be UNIMARC or else.

But I am not OK to say pure UTF-8 Koha is already possible, since it is
unfortunately not.
I reported a display error and reported to How you can get these errors
through a simple script.
Both DBI and CGI are buggy in their UTF-8 management, even though it is
true they donot harm UTF-8 data. But if PERL is to cope with utf8 data,
it has to be aware of that and encode things properly.
Maybe for you it works since you have no utf-8 *Both* in your zebra
records *and* your framework or any mysqldata.
But for us it IS a blocking problem and we HAVE to cope with it.
As soon as we work only with mysql or only with zebra, no problems, as I
said.
But we are not.

And we must be aware of that.
But you gave me at least two pennies to think about :
1) LEADER HAS TO be well integrated (in MARChtml2xml in rel 3_0, it is
not even generated)
2) MARC::Charset->ignore_errors can be used but is not the best solution
since some data could be lost without notice.

Many thanks anyway.
-- 
Henri-Damien LAURENT