[Koha-devel] M:F:USMARC & UTF8 problems

Tumer Garip tgarip at neu.edu.tr
Mon Jul 3 03:08:15 CEST 2006


Hi all,
It has been months since devel-week and we are still struggling with UTF8
problems.
Anyone following the IRC may have noticed that with ZOOM now working and
everything looked normal and UTF8 we had a display problem when accented
characters were in the search term.
This was occurring when the new version of MARC::Record (version 2) package
was being used. This package is regarded as a MUST for our MARC8 to UTF8
conversions and also hopefully to handle UNIMARC correctly.

Well after all this time and effort I can report that the handling of UTF8
within this package is buggy.
For a reason I did not understand line 170-174 of USMARC.pm says:
       # if utf8 the we encode the string as utf8
        if ( $marc->encoding() eq 'UTF-8' ) {
            $tagdata = marc_to_utf8( $tagdata );
       }
which simply means if you have an UTF8 marc record re-decode it to UTF8.
Which in itself calls Encode::decode and changes everything to bytes. In
simple terms it messes all the accented UTF8 characters you may have on your
template.
I do not know any need of usage for this conversion and especially in
KOHA_head everything being UTF8 this definitely does more harm than good.

I have corrected the error with this package and a copy is sent to Joshua to
raise the question with the package maintainer.
In the meantime I have added another functionalty of conversion
MARC8-UTF8-MARC8 to this package so one can go
MARC::File::USMARC::decode($marc,&\somefunction,$encoding,$normalize);
Where $encoding could be "UTF-8" or "MARC-8" and the record will get
converted to that encoding as it is decomposed. Rather than the current
method of converting to XML and back to MARC in order to get this.

The fourth argument(is 1 or 0) is a more complicated issue of
composed(precomposed) and decomposed character sets. I will not go into
details of this but I suggest you supply 1 to this argument when converting
to UTF8. This will make sure you get composed characters which are more
browser and XML friendly.(More info on this subject can be found at
Unicode::Normalize and the at the links in document)

This correction resulted in another problem solved. The issue of having to
use LibXML (and a specific version of that) seems to have gone away. Your
SAX filter may stay as PurePerl and still your XML wont break. (see the
article by Joshua http://www.mail-archive.com/address@hidden/msg01006.html).
However extensive testing has to be done to make sure on this. An important
note though. Whatever parser you use you must have PurePerl defined as one
of those parsers in your ParserDetails.ini otherwise it will not work.

I hope this problem is solved once and for all so that we can continue
developing more ZOOM functionalty into KOHA.

Cheers,
Tumer Garip


-- 
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.392 / Virus Database: 268.9.8/380 - Release Date: 30.06.2006
 
  
-------------- next part --------------
A non-text attachment was scrubbed...
Name: USMARC.pm
Type: application/octet-stream
Size: 11546 bytes
Desc: not available
URL: </pipermail/koha-devel/attachments/20060703/c83f4eaa/attachment-0002.obj>


More information about the Koha-devel mailing list