[Koha-zebra] Re: Import Speed

Thu Mar 2 17:05:44 CET 2006

Joshua,

Importing records one at a time when first building a database, or when 
doing a batch update that is a substantial percentage of the size of the 
database is not a good idea. The software has no way to optimize the 
layout of the index files, so for each record update, things get 
shuffled around, resulting on very sluggish update performance and a 
less-than-ideal layout inside the index files.

It would be highly advisable to do at least the initial import from the 
command-line. I think it would make a lot of sense if this could be done 
well from the protocol, but AFAIK, the extended service interface at the 
moment only allows you to insert one record at a time.

>Can we just process the raw MARC? Why did we choose the '.xml'
>storage method in Zebra and is it a good choice? Would '.sgml' or
>'.marc' be a better choice (because we could batch import directly
>instead of '.xml's one-at-a-time)? Could we somehow use '.marc' for
>the import and then switch to '.xml'?
>  
>
That's a good question. You use .xml because extended services only work 
with XML. It *may* be possible to ingest records from the command-line 
as grs.marcxml (which reads MARC records and renders them internally as 
MARCXML), then do subsequent updates as XML, doing the conversion on the 
client side. I say *may*, because I haven't tried that, but I think it'd 
be worth a shot and it should be easy to make the experiment:

1: Start with a sample of MARC records
2: Build the initial index like so:

% zebraidx init
% zebraidx -f 10 -n -t grs.marcxml update  recordfile     (-n disables 
the shadow system for this update)

This should run  pleasantly fast compared to what you see now.

3: Try to update some records as MARCXML.

--Seb

>Any suggestions on how to handle the connection in a more efficient
>way?
>
>Cheers,
>
>  
>

-- 
Sebastian Hammer, Index Data
quinn at indexdata.com   www.indexdata.com
Ph: (603) 209-6853