[Koha-zebra] Re: Import Speed

Thu Mar 2 20:07:04 CET 2006

Mike Taylor wrote:

>>Date: Thu, 02 Mar 2006 11:05:44 -0500
>>From: Sebastian Hammer <quinn at indexdata.com>
>>
>>Importing records one at a time when first building a database, or
>>when doing a batch update that is a substantial percentage of the
>>size of the database is not a good idea. The software has no way to
>>optimize the layout of the index files, so for each record update,
>>things get shuffled around, resulting on very sluggish update
>>performance and a less-than-ideal layout inside the index files.
>>    
>>
>
>Sure, but ...
>
>  
>
>>It would be highly advisable to do at least the initial import from
>>the command-line. I think it would make a lot of sense if this could
>>be done well from the protocol, but AFAIK, the extended service
>>interface at the moment only allows you to insert one record at a
>>time.
>>    
>>
>
>But -- ??  What magic does the command-line import have access to that
>ZOOM update doesn't?  Clearly it's using some kind of in-memory
>caching to hugely reduce the frequency of disk-writes, but why
>shouldn't that also be used the doing a ZOOM update?  Isn't that (part
>of) the purpose of delaying the "commit" call?  If not, then we need
>to add $conn->option("updateCacheSize" => 100*1024*1024)
>  
>
'commit' has nearly nothing to do with it.

When you insert a new record, it is scanned, and keys are extracted from 
the record according to the indexing rules found in the .abs file. These 
keys are written to the disk, sorted, and then merged into the index. If 
you update a thousand records at the same time, in a single operation, 
then all of those keys are extracted and written into the indexes in a 
single pass, which is much more efficient than doing it in a thousand 
passes. When you do a first-time update of a thousand records, things 
are even more efficient because the merging can be dropped entirely -- 
keys are written to the indexes just about as fast as the disk can eat 
them with minimum seeks.

Doing multiple, single updates between commits doesn't help here. Keys 
are still extracted from the records and merged into the indexes once 
for each individual update operation..  the shadow files merely record 
and defer the physical 'write' operations, so they can be executed 
later, independently of the 'read' operations. Update a thousand records 
one at a time between commits and you just end up with some horribly 
complex shadow files, and the system will probably have to work pretty 
hard to do the commit step.

It would be much better if we had a new stage between the updating of 
records and the commit... something to allow us to transfer a large 
number of records (preferably more than one per operation to cut down on 
the round-trip traffic), THEN index them, THEN commit the changes. Then 
we'd be able to do remote updates as efficiently as we can do them locally..

Something like

1. Update, update, update, update.....
2. Index
3. Commit

Etc.

The problem, as far as I can tell, is that you can only transfer records 
one at a time in the present extended services system, and they're 
extracted and indexed one at a time.. this is a fine way to update, 
well, one record at a time, but it's just about the worst way possible 
to update 1000 records at a time.

It shouldn't be hard to implement either -- all the hard work has 
already been done, the logic just needs to be implemented differently.

Mike, you can speak to Adam about this over lunch if you get a chance.. 
it is possible that I misrepresent what happens -- but this  reflects my 
understanding.

--Seb

> _/|_	 ___________________________________________________________________
>/o ) \/  Mike Taylor  <mike at miketaylor.org.uk>  http://www.miketaylor.org.uk
>)_v__/\  "If I write in C++ I probably don't use even 10% of the language,
>	 and in fact the other 90% I don't think I understand" -- Brian
>	 W. Kernighan.
>
>
>  
>

-- 
Sebastian Hammer, Index Data
quinn at indexdata.com   www.indexdata.com
Ph: (603) 209-6853