[Koha-devel] Re: UTF8

Fri Apr 7 20:17:47 CEST 2006

(copied to koha-devel)
On Fri, Apr 07, 2006 at 07:38:02PM +0300, Tümer Garip wrote:
> Hi again,
> While waiting for Owens directions for CVS time is passing by and I
> think we have to sort a few things out. Regarding this UTF8 I think a
> little bit more work should be done. Unfortunately I do not have a Linux
> system to test things but they should not be that much different than
> windows platform. Is there anyone willing to try my scripts as I direct
> him/her probably by email or IRC?
Of course, I'd be happy to do this. Let me know of a time to be on
IRC and we'll do it.

> Up until now the only reason KOHA managed both Unimarc and MARC21 is
> because of the existence of char_decode script in biblio.pm and unless
> somebody comes up with a ready perl module doing the Unimarc charset
> conversion to UTF8 everything will get delayed.
In fact, the spanking new CVS version of MARC::File::XML does this
nicely (you'll also need the new MARC::Charset and new MARC::Record, not
available on CPAN unfortunately). You can install it by first checking 
out a copy of the CVS: 

$ cvs -z3 -d:pserver:anonymous at cvs.sourceforge.net:/cvsroot/marcpm co -P marc-record
$ cd marc-record
$ perl Makefile.pl
$ make
$ su make install
$ cd ..

cvs -z3 -d:pserver:anonymous at cvs.sourceforge.net:/cvsroot/marcpm co -P marc-charset
$ cd marc-charset
$ perl Makefile.pl
$ make
$ su make install
$ cd ..

cvs -z3 -d:pserver:anonymous at cvs.sourceforge.net:/cvsroot/marcpm co -P marc-xml
$ cd marc-xml
$ perl Makefile.pl
$ make
$ su make install
$ cd ..

(or the equivilent on a windows system)

Check the POD documentation for MARC::File::XML for how to handle UNIMARC
conversions to UTF-8 (or check a message to koha-zebra from Mike Rylander
with MARC::File::XML in the subject)

> This script was reading UNIMARC and converting it to iso8859 so why it
> cannot convert to UTF8 I do not understand. The only reason I can thing
> of is that the script contains  s/\xc1\x61/à/gm; for unimarc. If
> somebody replaces this 'à' character to its UTF8 hexcode than it should
> work. Because biblio.pm itself is iso8859 file this character is read as
> iso8859-1. What I did (but not sure whether it will work with Linux) is
> to save the biblio.pm as a UTF8 file. This way I did not have to write
> the hexcode for the characters so they are actually UTF8 themselves.
> 
> Can somebody try this with bulkmarcimport.pl and see what happens? It
> works for me. If it does work than I'll give detailed info about because
> MARC8 coming from mysql is a litle bit more tricky.
I think moving forward, we want to use MARC::Charset for conversions, adding
on to it's functionality rather than using our internal char_decode (which
in fact, I've removed from HEAD).

What do others think?

Cheers,

-- 
Joshua Ferraro               VENDOR SERVICES FOR OPEN-SOURCE SOFTWARE
President, Technology       migration, training, maintenance, support
LibLime                                Featuring Koha Open-Source ILS
jmf at liblime.com |Full Demos at http://liblime.com/koha |1(888)KohaILS