[Koha-zebra] Re: Unimarc, marc21, Unicode, and MARC::File::XML

Adam Dickmeiss adam at indexdata.dk
Tue Mar 21 22:54:37 CET 2006


Tümer Garip wrote:
> I thought I explained it but here it is again:
> 
> I do not think which method you use is relevant here but but just try
> this:
> 
> In the release version ZEBRA test/usmarc folder change the zebra.cfg to
> read
> recordType: grs.xml
> in the tabs folder change marc21.abs to read record.abs 
> Use zebraidx to create the database with the single XML record I sent to
> you.
> Start the zebrasrv at the required port.
> Use yaz-client
> f @attr 1=1016 book
> format xml
> show
> 
> I see the xml record header saying
> <?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
> 
> Further down you'll see utf-8 characters of correct hex as
> \XC5\X9F
> 
> Now stop  the server.
> Add line encoding:utf-8 to your zebra.cfg
> Restart the server
> Do the same search you get
> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> 
> Conclusion:
> The database does keep the data in UTF-8 as expected.
> Server does not know about database character set or the xml record taht
> was parsed in and unless specificly set to UTF-8 in Zebra.cfg srever
> goes ahead and changes the header or in fact it produces itself a header
> saying iso-8859-1 while giving out utf-8 characters.

Correct. I was unable to reproduce this fault.. becauase my XML test 
record was able to be represented in UNICODE/UTF-8. Your sample is NOT

Converion from UTF-8 to ISO-8859-1 fails in Zebra.. And in this case, 
Zebra keeps data as is, but unfortunately alters the header anyway. 
That's the mistake. Better behavior would probably be for Zebra to not 
return the data at all, but return a surrogate diagnostic for the record ..

As you say, Zebra can be forced to use utf-8 in retrieval phase in the 
configuration. You can also specify utf-8 via the Z39.50 protocol .. 
(charset utf-8 in yaz-client).. and you should be able to achieve the 
same with ZOOM-Perl.

For Zebra 1.3 we kept Latin-1 as defualt character set because of a 
number of installations using that.. For Zebra 1.4 default is UTF-8.. so 
there should not be a problem with that - in this case.

/ Adam

> 
> I did not ask any help on this thanks. Just clearing some issues with
> Paul's problem.
> Tumer
> -----Original Message-----
> From: Adam Dickmeiss [mailto:adam at indexdata.dk] 
> Sent: Tuesday, March 21, 2006 9:58 PM
> To: Tümer Garip
> Cc: koha-zebra at nongnu.org
> Subject: Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and
> MARC::File::XML
> 
> 
> Tümer Garip wrote:
> 
>>Hi Adam,
>>You seem a bit offended that was not my intention, just frustation 
>>sometimes makes me use harsh words and translanting them to english 
>>may be too harsh.
>>
>>I do not need to send you any config+examples cause I tested this with
> 
> 
>>your default config files. I am attaching an xml record in utf-8
> 
> If you're to receive help from me you need to to tell me which zebra.cfg
> 
> you're using. And show me the record + the way you indexed it (zebraidx 
> update ?)
> 
>>Briefly I had default configuration files and build zebra with xml 
>>records. When I noticed the problem I used yaz-client to see what was 
>>going on. On my log I could see data going in the zebra was with 
>>encoding utf-8 While yaz client was returning xml with headers saying 
>>iso-8859-1 while I could actually see the utf-8 characters as they 
>>show as hex in yaz client.
> 
> I also need to know what you see? And you you'd expect to see.
> 
> / Adam
> 
> 
>>I have retried this procedures just now and it seems the same. Just 
>>adding encoding:UTF-8 to zebra.cfg and restarting the server you get 
>>correct heading and correct data. Please note that server has to be 
>>restarted but zebradb does not have to be rebuilt.
>>
>>Thanks
>>Tumer
>>
>>-----Original Message-----
>>From: Adam Dickmeiss [mailto:adam at indexdata.dk]
>>Sent: Tuesday, March 21, 2006 9:00 PM
>>To: Tümer Garip
>>Cc: paul.poulain at free.fr; koha-zebra at nongnu.org
>>Subject: Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and
>>MARC::File::XML
>>
>>
>>Tümer Garip wrote:
>>
>>
>>>Hi,
>>>
>>>This problem if I understood it correctly has got nothing to do with
>>>mysql or perl it has to do with ZEBRA unless it is to do with UNIMARC 
>>>which I am not familiar with. As you know (Paul) I have an utf-8 
>>>version working.
>>>
>>>I had the same problem from records coming from zebra and found out
>>>that it is not doing what it is supposed to do unless you explicitly 
>>>set it to utf-8. You have to explicitly put "encoding utf-8" in all 
>>>your zebra config files especially the zebra.cfg and your .abs . 
>>>Otherwise unlike the documentation saying that zebra character code is
>>
>>
>>>automatically set by the xml encoding it DOES NOT.
>>
>>I can't reproduce this (bug). Care to share a a config+example that
>>illustrates this (Inserts an XML record from Perl in UTF-8) ?
>>
>>
>>
>>>Perl send xml to zebra with encoding utf-8 on the header and utf-8
>>>data in it. Zebra saves all the data in utf-8 but returns an xml 
>>>saying encoding iso8859-1 at the header and utf-8 characters in data. 
>>>No module can correct this as it is stupid.
>>
>>Just need to know when the stupidity starts:-)
>>
>>/ Adam
>>
>>
>>
>>>I corrected the problem by adding encoding:UTF-8 in zebra.cfg,
>>>record.abs, sort-string.chr
>>>
>>>Hope it solves yours,
>>>
>>>Tumer
>>>
>>>
>>>
>>>_______________________________________________
>>>Koha-zebra mailing list
>>>Koha-zebra at nongnu.org
>>>http://lists.nongnu.org/mailman/listinfo/koha-zebra
>>>
>>
>>
>>
>>
>>_______________________________________________
>>Koha-zebra mailing list
>>Koha-zebra at nongnu.org 
>>http://lists.nongnu.org/mailman/listinfo/koha-zebra
>>
> 
> 
> 
> 
> _______________________________________________
> Koha-zebra mailing list
> Koha-zebra at nongnu.org
> http://lists.nongnu.org/mailman/listinfo/koha-zebra
> 






More information about the Koha-zebra mailing list