[Koha-zebra] Re: Import Speed
Sebastian Hammer
quinn at indexdata.com
Thu Mar 2 17:05:44 CET 2006
Joshua,
Importing records one at a time when first building a database, or when
doing a batch update that is a substantial percentage of the size of the
database is not a good idea. The software has no way to optimize the
layout of the index files, so for each record update, things get
shuffled around, resulting on very sluggish update performance and a
less-than-ideal layout inside the index files.
It would be highly advisable to do at least the initial import from the
command-line. I think it would make a lot of sense if this could be done
well from the protocol, but AFAIK, the extended service interface at the
moment only allows you to insert one record at a time.
>Can we just process the raw MARC? Why did we choose the '.xml'
>storage method in Zebra and is it a good choice? Would '.sgml' or
>'.marc' be a better choice (because we could batch import directly
>instead of '.xml's one-at-a-time)? Could we somehow use '.marc' for
>the import and then switch to '.xml'?
>
>
That's a good question. You use .xml because extended services only work
with XML. It *may* be possible to ingest records from the command-line
as grs.marcxml (which reads MARC records and renders them internally as
MARCXML), then do subsequent updates as XML, doing the conversion on the
client side. I say *may*, because I haven't tried that, but I think it'd
be worth a shot and it should be easy to make the experiment:
1: Start with a sample of MARC records
2: Build the initial index like so:
% zebraidx init
% zebraidx -f 10 -n -t grs.marcxml update recordfile (-n disables
the shadow system for this update)
This should run pleasantly fast compared to what you see now.
3: Try to update some records as MARCXML.
--Seb
>Any suggestions on how to handle the connection in a more efficient
>way?
>
>Cheers,
>
>
>
--
Sebastian Hammer, Index Data
quinn at indexdata.com www.indexdata.com
Ph: (603) 209-6853
More information about the Koha-zebra
mailing list