[Koha-devel] UTF-8 problems : a summary and some solutions

Wed Aug 23 10:35:18 CEST 2006

Just following this entry.
I am still working on utf-8 compliance...
And I am now wondering if the use of MARC::File::XML and MARC::Charset
is a good solution.

Indeed, while trying to add a new biblio, using the solution I explained
in the previous message.
I face two problems :
First : CGI is not utf-8 "aware". That is to say that if you provide CGI
utf-8 text as input, and perl is told to use utf-8 as input data, CGI
will not place utf-8 flag onto the data provided, so that it will be
double encoded by perl !
Solution : Not using UTF-8 on standard input. It will be OK. Maybe
another patch would be helpful. But I find only the one I told on the
previous message. Or decoding any entry to put the right flags on ?

Second : when trying to input TRUE utf-8 data, namely :
title : mémé est la plus forte.....  ∮ E⋅da = Q,  n → ∞, ∑ f(i) = ∏
g(i), ∀x∈ℝ: ⌈x⌉ = −⌊−x⌋, α ∧ ¬β = ¬(¬α ∨ β),
publisher : test(i), ∀x∈ℝ: ⌈x
and publication year : (i), ∀x∈ℝ: ⌈x

to addbiblio.pl, MARChtml2xml works and I get :
 xml : <?xml version="1.0" encoding="UTF-8"?>
 <collection
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.loc.gov/MARC21/slim
http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd"
  xmlns="http://www.loc.gov/MARC21/slim">
    <datafield tag="101" ind1="" ind2="">
      <subfield code="a">spa</subfield>
    </datafield>
    <datafield tag="200" ind1="" ind2="">
      <subfield code="a">m\xc3\xa9m\xc3\xa9 est la plus forte..... 
\xe2\x88\xae E\xe2\x8b\x85da = Q,  n \xe2\x86\x92 \xe2\x88\x9e,
\xe2\x88\x91 f(i) = \xe2\x88\x8f g(i),
\xe2\x88\x80x\xe2\x88\x88\xe2\x84\x9d: \xe2\x8c\x88x\xe2\x8c\x89 =
\xe2\x88\x92\xe2\x8c\x8a\xe2\x88\x92x\xe2\x8c\x8b, \xce\xb1 \xe2\x88\xa7
\xc2\xac\xce\xb2 = \xc2\xac(\xc2\xac\xce\xb1 \xe2\x88\xa8
\xce\xb2),</subfield>
    </datafield>
    <datafield tag="210" ind1="" ind2="">
     <subfield code="c">test(i), \xe2\x88\x80x\xe2\x88\x88\xe2\x84\x9d:
\xe2\x8c\x88x</subfield>
     <subfield code="d">(i), \xe2\x88\x80x\xe2\x88\x88\xe2\x84\x9d:
\xe2\x8c\x88x</subfield>
    </datafield>
</collection>
But new biblio is failing.
it get these errors :

<http://localhost/dotclear/ecrire/Tue%20Aug%2022%2012:03:35%202006>no
mapping found at position 37 in m\xc3\xa9m\xc3\xa9 est la plus
forte..... \xe2\x88\xae E\xe2\x8b\x85da = Q, n \xe2\x86\x92
\xe2\x88\x9e, \xe2\x88\x91 f(i) = \xe2\x88\x8f g(i),
\xe2\x88\x80x\xe2\x88\x88\xe2\x84\x9d: \xe2\x8c\x88x\xe2\x8c\x89 =
\xe2\x88\x92\xe2\x8c\x8a\xe2\x88\x92x\xe2\x8c\x8b, \xce\xb1 \xe2\x88\xa7
\xc2\xac\xce\xb2 = \xc2\xac(\xc2\xac\xce\xb1 \xe2\x88\xa8 \xce\xb2),
g0=ASCII_DEFAULT g1=EXTENDED_LATIN at
/usr/lib/perl5/site_perl/5.8.7/MARC/Charset.pm line 188.

<http://localhost/dotclear/ecrire/Tue%20Aug%2022%2012:03:35%202006>no
mapping found at position 11 in test(i),
\xe2\x88\x80x\xe2\x88\x88\xe2\x84\x9d: \xe2\x8c\x88x g0=ASCII_DEFAULT
g1=EXTENDED_LATIN at /usr/lib/perl5/site_perl/5.8.7/MARC/Charset.pm line
188.

<http://localhost/dotclear/ecrire/Tue%20Aug%2022%2012:03:35%202006>no
mapping found at position 7 in (i),
\xe2\x88\x80x\xe2\x88\x88\xe2\x84\x9d: \xe2\x8c\x88x g0=ASCII_DEFAULT
g1=EXTENDED_LATIN at /usr/lib/perl5/site_perl/5.8.7/MARC/Charset.pm line
188.

Then add fails.
So I think that comes from MARC::File::XML that uses MARC::Charset to
try and get the data from the XML, and MARC::Charset tries to decode the
utf-8 data as if it was ASCII and it is not. And then spoils the data.
So my question is : Do we really need to use MARC::File::XML as such or
can we hack it so that data would be taken as utf-8 ?

Should we hack MARC::Charset ?
Should we hack CGI  or create a new package to mark utf-8 data as utf-8
and convert non-utf-8 data to utf-8 ?

-- 
Henri-Damien LAURENT