[Koha-devel] UTF-8 problems : a summary and some solutions

Wed Aug 23 13:14:36 CEST 2006

Hi Henri-Damien,

On Wed, Aug 23, 2006 at 10:35:18AM +0200, Henri-Damien LAURENT wrote:
> Just following this entry.
> I am still working on utf-8 compliance...
> And I am now wondering if the use of MARC::File::XML and MARC::Charset
> is a good solution.
> 
> Indeed, while trying to add a new biblio, using the solution I explained
> in the previous message.
> I face two problems :
> First : CGI is not utf-8 "aware". That is to say that if you provide CGI
> utf-8 text as input, and perl is told to use utf-8 as input data, CGI
> will not place utf-8 flag onto the data provided, so that it will be
> double encoded by perl !
> Solution : Not using UTF-8 on standard input. It will be OK. Maybe
> another patch would be helpful. But I find only the one I told on the
> previous message. Or decoding any entry to put the right flags on ?
> 
> Second : when trying to input TRUE utf-8 data, namely :
> title : mémé est la plus forte.....  ∮ E⋅da = Q,  n → ∞, ∑ f(i) = ∏
> g(i), ∀x∈ℝ: ⌈x⌉ = −⌊−x⌋, α ∧ ¬β = ¬(¬α ∨ β),
> publisher : test(i), ∀x∈ℝ: ⌈x
> and publication year : (i), ∀x∈ℝ: ⌈x
> 
> to addbiblio.pl, MARChtml2xml works and I get :
>  xml : <?xml version="1.0" encoding="UTF-8"?>
>  <collection
>   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>   xsi:schemaLocation="http://www.loc.gov/MARC21/slim
> http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd"
>   xmlns="http://www.loc.gov/MARC21/slim">
>     <datafield tag="101" ind1="" ind2="">
>       <subfield code="a">spa</subfield>
>     </datafield>
>     <datafield tag="200" ind1="" ind2="">
>       <subfield code="a">m\xc3\xa9m\xc3\xa9 est la plus forte..... 
> \xe2\x88\xae E\xe2\x8b\x85da = Q,  n \xe2\x86\x92 \xe2\x88\x9e,
> \xe2\x88\x91 f(i) = \xe2\x88\x8f g(i),
> \xe2\x88\x80x\xe2\x88\x88\xe2\x84\x9d: \xe2\x8c\x88x\xe2\x8c\x89 =
> \xe2\x88\x92\xe2\x8c\x8a\xe2\x88\x92x\xe2\x8c\x8b, \xce\xb1 \xe2\x88\xa7
> \xc2\xac\xce\xb2 = \xc2\xac(\xc2\xac\xce\xb1 \xe2\x88\xa8
> \xce\xb2),</subfield>
>     </datafield>
>     <datafield tag="210" ind1="" ind2="">
>      <subfield code="c">test(i), \xe2\x88\x80x\xe2\x88\x88\xe2\x84\x9d:
> \xe2\x8c\x88x</subfield>
>      <subfield code="d">(i), \xe2\x88\x80x\xe2\x88\x88\xe2\x84\x9d:
> \xe2\x8c\x88x</subfield>
>     </datafield>
> </collection>
> But new biblio is failing.
> it get these errors :
> 
> <http://localhost/dotclear/ecrire/Tue%20Aug%2022%2012:03:35%202006>no
> mapping found at position 37 in m\xc3\xa9m\xc3\xa9 est la plus
> forte..... \xe2\x88\xae E\xe2\x8b\x85da = Q, n \xe2\x86\x92
> \xe2\x88\x9e, \xe2\x88\x91 f(i) = \xe2\x88\x8f g(i),
> \xe2\x88\x80x\xe2\x88\x88\xe2\x84\x9d: \xe2\x8c\x88x\xe2\x8c\x89 =
> \xe2\x88\x92\xe2\x8c\x8a\xe2\x88\x92x\xe2\x8c\x8b, \xce\xb1 \xe2\x88\xa7
> \xc2\xac\xce\xb2 = \xc2\xac(\xc2\xac\xce\xb1 \xe2\x88\xa8 \xce\xb2),
> g0=ASCII_DEFAULT g1=EXTENDED_LATIN at
> /usr/lib/perl5/site_perl/5.8.7/MARC/Charset.pm line 188.
> 
> <http://localhost/dotclear/ecrire/Tue%20Aug%2022%2012:03:35%202006>no
> mapping found at position 11 in test(i),
> \xe2\x88\x80x\xe2\x88\x88\xe2\x84\x9d: \xe2\x8c\x88x g0=ASCII_DEFAULT
> g1=EXTENDED_LATIN at /usr/lib/perl5/site_perl/5.8.7/MARC/Charset.pm line
> 188.
> 
> <http://localhost/dotclear/ecrire/Tue%20Aug%2022%2012:03:35%202006>no
> mapping found at position 7 in (i),
> \xe2\x88\x80x\xe2\x88\x88\xe2\x84\x9d: \xe2\x8c\x88x g0=ASCII_DEFAULT
> g1=EXTENDED_LATIN at /usr/lib/perl5/site_perl/5.8.7/MARC/Charset.pm line
> 188.
> 
> 
> Then add fails.
> So I think that comes from MARC::File::XML that uses MARC::Charset to
> try and get the data from the XML, and MARC::Charset tries to decode the
> utf-8 data as if it was ASCII and it is not. And then spoils the data.
> So my question is : Do we really need to use MARC::File::XML as such or
> can we hack it so that data would be taken as utf-8 ?
> 
> Should we hack MARC::Charset ?
> Should we hack CGI  or create a new package to mark utf-8 data as utf-8
> and convert non-utf-8 data to utf-8 ?

The problems you're describing are because your MARC data is not
valid. You may have UTF-8 encoded data in your MARC, but unless
the Leader 09 is set to 'a', the MARC::* suite of tools has no
way of knowing it's UTF-8 and will interpret it as MARC-8 ( properly,
since there are only two valid encodings in MARC21 ). MARC::*
has no knowledge of any other encoding, but in the latest sourceforge
version there is an optional UNIMARC flag you can pass to it that
will avoid all character set conversions (but this needs to be tested
as the author didn't have direct access to UNIMARC.

You can not simply switch from MARC-8 to UTF-8 without doing some
heavy transformations ... for instance, in MARC-8, combining
characters are of the form <combined character> <base> whereas in
UTF-8, they are of the form <base> <combined character>.

If you are converting a database that has invalid MARC-8 data (for
instance, if it has latin1 data) you'll want to use the 
ignore_errors flag:

MARC::Charset->ignore_errors(1);

to avoid losing entire subfields because of a bad character.

I have confirmed that a UTF-8 Koha is already possible, without
changing CGI or DBI. So long as MySQL and Apache have been set
up properly, UTF-8 data passes unharmed between the DBI and
CGI levels. You can view an example of a pure UTF-8 Koha here:
http://wipoopac.liblime.com

MARC::* is difficult to install properly, and it's also hard
to make sure you have valid MARC records, but once you do these
steps, the process works very smoothly. If we are to claim
MARC21 compliance, MARC::* is a must (unless you want to write
a new suite for MARC handling in perl or use a non-perl
solution).

Cheers,

-- 
Joshua Ferraro                       SUPPORT FOR OPEN-SOURCE SOFTWARE
President, Technology       migration, training, maintenance, support
LibLime                                Featuring Koha Open-Source ILS
jmf at liblime.com |Full Demos at http://liblime.com/koha |1(888)KohaILS