[Koha-zebra] RE: [Koha-devel] Building zebradb

Mon Mar 13 00:22:29 CET 2006

Tümer Garip wrote:

>Hi again Sebastian,
>
>You are a gem. You are absolutely right about the <collection> wrapper.
>The MARC::File::XML module produces MARCXML with this wrapper. Having
>removed this wrapper now I can get iso2079 out of zebra with no problem
>at all.
>  
>
Glad you got that sorted.

>I'll play with version 1.4 during this week.
>
>And here is something more for those using Windows platform.
>I used to report that sorting does not work. Well it does work. The
>problem was I have utf8 characters in my
>sort.chr table. The windows notepad puts some hidden character to the
>beginning of a file if it contains utf8 characters. Zebra does not like
>that and gives syntax error. So I used another package to produce my
>sort.chr Everything is now OK.
>  
>
:-)

--Seb

>I have tp get back and start using 1.4 now
>
>Thanks
>Tumer
>
>-----Original Message-----
>From: Sebastian Hammer [mailto:quinn at indexdata.com] 
>Sent: Sunday, March 12, 2006 5:03 PM
>To: Tümer Garip
>Cc: koha-zebra at nongnu.org; Adam Dickmeiss
>Subject: Re: [Koha-zebra] RE: [Koha-devel] Building zebradb
>
>
>Tümer Garip wrote:
>
>  
>
>>Hi,
>>
>>To clear some issues with Sebastian:
>> 
>>
>>    
>>
>>>>Again, I'd be keen to know which Zebra version you're running?
>>>>     
>>>>
>>>>        
>>>>
>>1- I am using Zebra version 1.3.34
>> 
>>
>>    
>>
>It might be a good idea to take 1.4 for a spin.. there's been some 
>changes to the ISAM system. It should otherwise do everythiung the 
>current server does, and more.
>
>  
>
>>>>Why not ask for the records in ISO2709 from Zebra if that's what you
>>>>want to work with (when records are loaded in MARCXML, it can spit
>>>>        
>>>>
>out
>  
>
>>>>     
>>>>
>>>>either 2709 or XML)? Or is it just that you want to have them in
>>>>MARC::Record? At any rate, ISO2709 is a more compact exchange format,
>>>>     
>>>>
>>>>        
>>>>
>>so
>> 
>>
>>    
>>
>>>>it seems to make sense to use it.
>>>>     
>>>>
>>>>        
>>>>
>>2- Definitely correct and theoretically yes, but I keep saying that 
>>when you feed zebra with MARCXML you CAN NOT get back MARC records. All
>>    
>>
>
>  
>
>>you get back is the leader and lots of blank space. I tried it with 
>>YAZ-client as well. I have reported this as a bug on indexdata list but
>>    
>>
>
>  
>
>>never got an answer.
>> 
>>
>>    
>>
>I wouldn't be too surprised if this was caused by the <collection> 
>wrapper.. Zebra 1.4 can be configured to look for data at a certain 
>level of the DOM tree -- 1.3 assumed the root element is the root 
>element of the record.
>
>  
>
>>>>I get update times around .5-.7 seconds,
>>>>     
>>>>
>>>>        
>>>>
>>3- .4 or .5 is what we get as well. Stupidly enough write the file to 
>>disk use zebraidx and you get .04-.09 and even faster times. What magic
>>    
>>
>
>  
>
>>zebraidx use that ZOOM does not know I donno.
>> 
>>
>>    
>>
>The magic is simply updating multiple records at the same time, in one 
>go -- something that is not possible though ZOOM today.
>
>  
>
>>>>Now *that* is nasty. Do you have *any* way of consistently recreating
>>>>the problem? Even if you don't, I'm sure Adam and the Zebra crew
>>>>        
>>>>
>would
>  
>
>>>>     
>>>>
>>>>like to see some stack traces of those crashes!!
>>>>     
>>>>
>>>>        
>>>>
>>4- The problem is intermittent. I'll try to provide details when it 
>>reoccurs. What it says is  "Fatal error: inconsistent register" from 
>>then on you have to throw the zebradatabase away and rebuild 
>>everything. You cannot fallback to the working state. Shadow system was
>>    
>>
>
>  
>
>>supposed to do this. If something is gone wrong when you commit it 
>>should not go and mess the whole database but just refuse to do it and 
>>let you wipe off the last update commit operation. But you can not do 
>>that. So unless I have missed something shadow files are useless.
>> 
>>
>>    
>>
>Well, they are useless when the problem is caused by a bug in Zebra, as 
>seems to be the case here. Clearly, the error is not detected until it's
>
>too late. It would be very helpful if we could discover some sequence of
>
>updates that reliably recreated this..
>
>but I'd see this as another good reason for taking 1.4 for a spin. As I 
>say, there have been notable changes made to the indexing subsystem -- 
>it may be that this is a problem that has been fixed or eliminated by 
>some other change.
>
>  
>
>>Here is some more to think about zebra:
>>1- The MARCXML we feed into zebra is 
>><collection><record></record></collection> package. When you get it 
>>back it is  still like that as supposed to be.  But if you feed zebra 
>>with iso2709 marc records and ask back from zebra xml records you get 
>>back <record></record> package no <collection> wrapping around it. 
>>Although this is not a problem currently, I still do not like zebra 
>>doing that. Its MARC to XML conversion should follow standarts rather 
>>than create its own.
>> 
>>
>>    
>>
>Check the standard. The <collection> wrapper is optional. For my part, I
>
>never  use the <collection> wrapper when I deal with single records. I 
>don't know if Zebra does the right thing with it, but it seems to work 
>for you otherwise..
>
>--Seb
>
>  
>
>>Thanks for your quick response
>>Regards,
>>
>>Tumer
>>
>>-----Original Message-----
>>From: Joshua Ferraro []
>>Sent: Sunday, March 12, 2006 7:06 AM
>>To: koha-devel at nongnu.org; tgarip at neu.edu.tr
>>Subject: [Koha-devel] Building zebradb
>>
>>
>>Tumer,
>>
>>Sebastian's been good enough to respond to your post (I forwarded this 
>>to the koha-zebra list). If you get a change, could you join koha-zebra
>>    
>>
>
>  
>
>>(if you're not already on it) and follow up -- I've a feeling it could 
>>prove to be a very productive thread.
>>
>>Cheers,
>>
>>Joshua
>>
>>----- Forwarded message from Sebastian Hammer <quinn at indexdata.com>
>>-----
>>
>>
>>Hi Joshua,
>>
>>Thanks for this feedback, it's very interesting. Clearly some of the
>>issues you describe (i.e. a lack of stability around upadets) indicate 
>>software problems, but there's also some interesting ideas for possible
>>    
>>
>
>  
>
>>refinements or new developments which I think would be really useful to
>>    
>>
>
>  
>
>>get into the general development plans for the software... there are 
>>more folks at ID involved in Zebra development that I'd like to get
>>    
>>
>into
>  
>
>>these thoughts... I dunno if a wiki or just a larger zebra-dev list is
>>in order, but it's something to think about.
>>
>>Joshua Ferraro wrote:
>>
>> 
>>
>>    
>>
>>>----- Forwarded message from Tümer Garip <tgarip at neu.edu.tr> -----
>>>
>>>
>>>Hi,
>>>We have now put the zebra into production level systems. So here is
>>>some experience to share.
>>>
>>>Building the zebra database from single records is a veeeeery looong
>>>process. (100K records 150k items)
>>>
>>>
>>>   
>>>
>>>      
>>>
>>Yes, that confirms my expectations. We could think about building  some
>>kind of buffering into for first-time updating, or else the logic has
>>    
>>
>to
>  
>
>>be in the application, as you've seen... the situation is particularly
>>grim if shadow-indexing is enabled during indexing and every record is 
>>committed, since this causes a sync of the disks, which could take up
>>    
>>
>to
>  
>
>>a whole second.
>>
>>Also, I'm not sure which version of Zebra you're using? I've been foing
>>some performance-testing of Zebra for the Internet Archive, and noted 
>>quite a difference between 1.3 and 1.4 (the CVS version) which is
>>    
>>
>really
>  
>
>>where all the development happens.
>>
>> 
>>
>>    
>>
>>>Best method we found:
>>>
>>>1- Change zebra.cfg file to include
>>>
>>>iso2079.recordType:grs.marcxml.collection
>>>recordType:grs.xml.collection
>>>
>>>2- Write (or hack export.pl) to export all the marc records as one big
>>>chunk to the correct directory with an extension .iso2079 And system 
>>>call "zebraidx -g iso2079 -d <dbnamehere> update records -n".
>>>
>>>This ensures that zebra knows its reading marc records rather than xml
>>>and builds 100K+ records in zooming speed. Your zoom module always
>>>      
>>>
>uses
>  
>
>>>   
>>>
>>>      
>>>
>> 
>>
>>    
>>
>>>the grs.xml filter while you can anytime update or reindex any big
>>>chunk of the database as long as you have marc records.
>>>
>>>
>>>   
>>>
>>>      
>>>
>>Good strategy, I think.. but of course it's weird and awkward to have 
>>to
>>
>>use two different formats, especially when they both have limitations.
>>We really must look into handling ISO2709 from ZOOM.
>>
>>Mind you, version 1.4 should be able to read multiple 
>>collection-wrapped
>>
>>MARCXML records in one file, but only (AFAIK) in conjunction witht the
>>new XSLT-based index rules. I *would* like to try to develop a good way
>>    
>>
>
>  
>
>>to work with bibliographic data in that framework.
>>
>> 
>>
>>    
>>
>>>3-We are still using the old API so we read the xml and use
>>>MARC::Record->new_from_xml( $xmldata ) A note here that we did not had
>>>      
>>>
>
>  
>
>>>to upgrade MARC::Record or MARC::Charset at all. Any marc created 
>>>within KOHA is UTF8 and any marc imported into KOHA (old 
>>>marc_subfield_tables) was correctly decoded to utf8 with char_decode
>>>      
>>>
>of
>  
>
>>>   
>>>
>>>      
>>>
>> 
>>
>>    
>>
>>>biblio.
>>>
>>>
>>>   
>>>
>>>      
>>>
>>Why not ask for the records in ISO2709 from Zebra if that's what you
>>want to work with (when records are loaded in MARCXML, it can spit out 
>>either 2709 or XML)? Or is it just that you want to have them in 
>>MARC::Record? At any rate, ISO2709 is a more compact exchange format,
>>    
>>
>so
>  
>
>>it seems to make sense to use it.
>>
>> 
>>
>>    
>>
>>>4- We modified circ2.pm and items table to have item onloan field and
>>>mapped it to marc holdings data. Now our opac search do not call mysql
>>>      
>>>
>
>  
>
>>>but for the branchname.
>>>
>>>5- Average updates per day is about 2000 (circulation+cataloger). I 
>>>can
>>>   
>>>
>>>      
>>>
>> 
>>
>>    
>>
>>>say that the speed of the zoom search which slows down during a commit
>>>operation is acceptable considering the speed gain we have on the 
>>>search.
>>>
>>>
>>>   
>>>
>>>      
>>>
>>Again, I'd be keen to know which Zebra version you're running?
>>
>>Beause the Internet Archive will be doing similar point-updates for
>>records (only a *lot* more often than 2000 times per day) I have been 
>>looking a lot at the update speed for these small changes that only 
>>affect a single term in a record (like a circulation code).. In my test
>>    
>>
>
>  
>
>>database of 150K records, 10K average size, I get update times around 
>>.5-.7 seconds, which just seems intuitively faster than it should have 
>>to be. I'm going to nudge the coders to see if we can possibly do this 
>>better.
>>
>> 
>>
>>    
>>
>>>6- Zebra behaves very well with searches but is very tempremental with
>>>updates. A queue of updates sometimes crashes the zebraserver. When
>>>      
>>>
>the
>  
>
>>>   
>>>
>>>      
>>>
>> 
>>
>>    
>>
>>>database crash we can not save anything even though we are using 
>>>shadow
>>>   
>>>
>>>      
>>>
>> 
>>
>>    
>>
>>>files. I'll be reporting on this issue once we can isolate the
>>>problems.
>>>
>>>
>>>   
>>>
>>>      
>>>
>>Now *that* is nasty. Do you have *any* way of consistently recreating
>>the problem? Even if you don't, I'm sure Adam and the Zebra crew would 
>>like to see some stack traces of those crashes!!
>>
>>--Sebastian
>>
>> 
>>
>>    
>>
>>>Regards,
>>>Tumer
>>>
>>>
>>>
>>>_______________________________________________
>>>Koha-devel mailing list
>>>Koha-devel at nongnu.org
>>>http://lists.nongnu.org/mailman/listinfo/koha-devel
>>>
>>>----- End forwarded message -----
>>>
>>>
>>>
>>>   
>>>
>>>      
>>>
>> 
>>
>>    
>>
>
>  
>

-- 
Sebastian Hammer, Index Data
quinn at indexdata.com   www.indexdata.com
Ph: (603) 209-6853