[Koha-zebra] Re: Unicode, XML,Zebra,Windows

Thu Mar 23 00:00:52 CET 2006

Hi Adam,
Well I am pleased you managed to reproduce this bug.
Here are a few things to consider going towards 1.4

Zebra document says: 
"Generally, the files are simple ASCII files, which can be maintained
using any text editor. "
And it also says:
"encoding encodingname
This directive specifies character encoding for external records. For
records such as XML that specifies encoding within the file via a header
this directive is ignored. If neither this directive is given, nor an
encoding is set within external records, ISO-8859-1 encoding is assumed.
"

In fact this is not the case. As you have seen the XML file I send to
you had a header saying UTF-8 and also had utf-8 characters in it. In
the windows enviroment that was a genuine utf-8 document. Zebra not only
did not detect that but also ignored the header saying utf-8.

We have a similar problem with .chr files as well. In windows
environment Notepad is the simplest text editor. Write a sort.chr file
with notepad with some utf8 characters, put encoding utf-8 in it and
zebraidx gives syntax error. To overcome that I had to produce a
sort.chr file on a colleagues Unix and use that. When I look into that
file with notepad its unreadable so very difficult to maintain.

I think whats happening with this utf-8 thing is that windows and unix
are using different representations of whether a file is unicode. Since
you do a binary for windows as well I think zebra should stop checking
characters according to unix and rely on things like xml headers or
encoding directives of configuration files (which by the way iy says it
does but as you have seen it does not).

I should also say that I am testing 1.4 and it is very very more
efficient in terms of cpu and memory usage but this problem remains.
Well we can't have them all  or can we?

Regards,
Tumer

-----Original Message-----
From: Adam Dickmeiss [mailto:adam at indexdata.dk] 
Sent: Tuesday, March 21, 2006 11:55 PM
To: Tümer Garip
Cc: koha-zebra at nongnu.org
Subject: Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and
MARC::File::XML

Tümer Garip wrote:
> I thought I explained it but here it is again:
> 
> I do not think which method you use is relevant here but but just try
> this:
> 
> In the release version ZEBRA test/usmarc folder change the zebra.cfg 
> to read
> recordType: grs.xml
> in the tabs folder change marc21.abs to read record.abs
> Use zebraidx to create the database with the single XML record I sent
to
> you.
> Start the zebrasrv at the required port.
> Use yaz-client
> f @attr 1=1016 book
> format xml
> show
> 
> I see the xml record header saying
> <?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
> 
> Further down you'll see utf-8 characters of correct hex as \XC5\X9F
> 
> Now stop  the server.
> Add line encoding:utf-8 to your zebra.cfg
> Restart the server
> Do the same search you get
> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> 
> Conclusion:
> The database does keep the data in UTF-8 as expected.
> Server does not know about database character set or the xml record 
> taht was parsed in and unless specificly set to UTF-8 in Zebra.cfg 
> srever goes ahead and changes the header or in fact it produces itself

> a header saying iso-8859-1 while giving out utf-8 characters.

Correct. I was unable to reproduce this fault.. becauase my XML test 
record was able to be represented in UNICODE/UTF-8. Your sample is NOT

Converion from UTF-8 to ISO-8859-1 fails in Zebra.. And in this case, 
Zebra keeps data as is, but unfortunately alters the header anyway. 
That's the mistake. Better behavior would probably be for Zebra to not 
return the data at all, but return a surrogate diagnostic for the record
..

As you say, Zebra can be forced to use utf-8 in retrieval phase in the 
configuration. You can also specify utf-8 via the Z39.50 protocol .. 
(charset utf-8 in yaz-client).. and you should be able to achieve the 
same with ZOOM-Perl.

For Zebra 1.3 we kept Latin-1 as defualt character set because of a 
number of installations using that.. For Zebra 1.4 default is UTF-8.. so

there should not be a problem with that - in this case.

/ Adam

> 
> I did not ask any help on this thanks. Just clearing some issues with 
> Paul's problem. Tumer
> -----Original Message-----
> From: Adam Dickmeiss [mailto:adam at indexdata.dk] 
> Sent: Tuesday, March 21, 2006 9:58 PM
> To: Tümer Garip
> Cc: koha-zebra at nongnu.org
> Subject: Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and
> MARC::File::XML
> 
> 
> Tümer Garip wrote:
> 
>>Hi Adam,
>>You seem a bit offended that was not my intention, just frustation
>>sometimes makes me use harsh words and translanting them to english 
>>may be too harsh.
>>
>>I do not need to send you any config+examples cause I tested this with
> 
> 
>>your default config files. I am attaching an xml record in utf-8
> 
> If you're to receive help from me you need to to tell me which 
> zebra.cfg
> 
> you're using. And show me the record + the way you indexed it 
> (zebraidx
> update ?)
> 
>>Briefly I had default configuration files and build zebra with xml
>>records. When I noticed the problem I used yaz-client to see what was 
>>going on. On my log I could see data going in the zebra was with 
>>encoding utf-8 While yaz client was returning xml with headers saying 
>>iso-8859-1 while I could actually see the utf-8 characters as they 
>>show as hex in yaz client.
> 
> I also need to know what you see? And you you'd expect to see.
> 
> / Adam
> 
> 
>>I have retried this procedures just now and it seems the same. Just
>>adding encoding:UTF-8 to zebra.cfg and restarting the server you get 
>>correct heading and correct data. Please note that server has to be 
>>restarted but zebradb does not have to be rebuilt.
>>
>>Thanks
>>Tumer
>>
>>-----Original Message-----
>>From: Adam Dickmeiss [mailto:adam at indexdata.dk]
>>Sent: Tuesday, March 21, 2006 9:00 PM
>>To: Tümer Garip
>>Cc: paul.poulain at free.fr; koha-zebra at nongnu.org
>>Subject: Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and 
>>MARC::File::XML
>>
>>
>>Tümer Garip wrote:
>>
>>
>>>Hi,
>>>
>>>This problem if I understood it correctly has got nothing to do with 
>>>mysql or perl it has to do with ZEBRA unless it is to do with UNIMARC

>>>which I am not familiar with. As you know (Paul) I have an utf-8 
>>>version working.
>>>
>>>I had the same problem from records coming from zebra and found out 
>>>that it is not doing what it is supposed to do unless you explicitly 
>>>set it to utf-8. You have to explicitly put "encoding utf-8" in all 
>>>your zebra config files especially the zebra.cfg and your .abs . 
>>>Otherwise unlike the documentation saying that zebra character code 
>>>is
>>
>>
>>>automatically set by the xml encoding it DOES NOT.
>>
>>I can't reproduce this (bug). Care to share a a config+example that 
>>illustrates this (Inserts an XML record from Perl in UTF-8) ?
>>
>>
>>
>>>Perl send xml to zebra with encoding utf-8 on the header and utf-8 
>>>data in it. Zebra saves all the data in utf-8 but returns an xml 
>>>saying encoding iso8859-1 at the header and utf-8 characters in data.

>>>No module can correct this as it is stupid.
>>
>>Just need to know when the stupidity starts:-)
>>
>>/ Adam
>>
>>
>>
>>>I corrected the problem by adding encoding:UTF-8 in zebra.cfg, 
>>>record.abs, sort-string.chr
>>>
>>>Hope it solves yours,
>>>
>>>Tumer
>>>
>>>
>>>
>>>_______________________________________________
>>>Koha-zebra mailing list
>>>Koha-zebra at nongnu.org 
>>>http://lists.nongnu.org/mailman/listinfo/koha-zebra
>>>
>>
>>
>>
>>
>>_______________________________________________
>>Koha-zebra mailing list
>>Koha-zebra at nongnu.org
>>http://lists.nongnu.org/mailman/listinfo/koha-zebra
>>
> 
> 
> 
> 
> _______________________________________________
> Koha-zebra mailing list
> Koha-zebra at nongnu.org 
> http://lists.nongnu.org/mailman/listinfo/koha-zebra
>