[Koha-zebra] Import Speed

Thu Mar 2 16:44:22 CET 2006

Hi all,

So we've finally got records importing into Zebra, but there are some
serious performance issues, some related to MARC::Charset and some
(I suspect) to Perl-ZOOM/Zebra.

So ... let me lay out what we're doing and then show some benchmarks
I've done before I get to my questions.

CONNECTION OBJECT

All of our connections to Zebra are handled in our Context module.
My test indicate that it maintains the same connection throughout
all the operations. The connection maintainer checks the connection
like this:

if (defined($context->{"Zconn"})) {
	$Zconn = $context->{"Zconn"};
        $rs=$Zconn->search_pqf('@attr 1=4 mineral');

	if ($Zconn->errcode() != 0) {
                        $context->{"Zconn"} = &new_Zconn();
                        return $context->{"Zconn"};
                }
                return $context->{"Zconn"};
        } else {
                $context->{"Zconn"} = &new_Zconn();
                return $context->{"Zconn"};
                }
}

Of course, in the context of batch import, this gets called for
every record import (with '.xml' as the storage format) ... just
wondering if there's a better way ... see below for details.

IMPORT PROCESS

We've got a script called 'bulkmarcimport' that is able to take a
file full of MARC records and import them into Zebra as well as
some MySQL fields. It can be set to perform a 'commit' operation
after a certain number of records, but I haven't seen any performance
issues related to when the 'commit' operation happens (ie, if you do
it once every 100 records it seems about the same speed as if you
do it once every 10000).

The process looks something like this:

Grab a MARC record with MARC::Batch
Add some local use fields, etc.
Feed it to old-koha routines to put some stuff in MySQL
Hand it to MARC::File::XML to create an XML object
	MARC::File::XML converts from MARC-8 to UTF-8 where necessary
Feed it to Zebra via perl-zoom
Perform a commit operation every so often

BENCHMARCKS

I recently tried the import with 5000 records. The first few hundred 
were rolling along at about 3-5 per second but this eventually
droped to 1-2 per second and finally to about 1 per 2 secs. The whole 
batch imported in 40604 seconds.

Running a benchmarc on an import of 100 records yields the following:
sam:/home/koha/testing/cvsrepo/koha# perl -d:DProf misc/migration_tools/bulkmarcimport.pl -d -n 100 -commit 100 -file /home/jmf/koha.mrc
deleting biblios
COMMIT OPERATION SUCCESSFUL
100 MARC records imported in 59.907173871994 seconds

sam:/home/koha/testing/cvsrepo/koha# dprofpp tmon.out                           
Exporter::export_ok_tags has -1 unstacked calls in outer
AutoLoader::AUTOLOAD has -1 unstacked calls in outer
Exporter::Heavy::heavy_export has 12 unstacked calls in outer
bytes::AUTOLOAD has -1 unstacked calls in outer
Exporter::Heavy::heavy_export_ok_tags has 1 unstacked calls in outer
POSIX::__ANON__ has 1 unstacked calls in outer
POSIX::load_imports has 1 unstacked calls in outer
Exporter::export has -12 unstacked calls in outer
Storable::thaw has 1 unstacked calls in outer
bytes::length has 1 unstacked calls in outer
POSIX::AUTOLOAD has -2 unstacked calls in outer
Total Elapsed Time = 39.96495 Seconds
  User+System Time = 18.82495 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 19.4   3.657 20.081  25211   0.0001 0.0008  MARC::Charset::marc8_to_utf8
 17.1   3.223  9.841 264274   0.0000 0.0000  MARC::Charset::Table::get_code
 13.6   2.566  2.566 264257   0.0000 0.0000  Storable::mretrieve
 11.3   2.141  0.000 264257   0.0000 0.0000  Storable::thaw
 8.57   1.613  2.934 528514   0.0000 0.0000  Class::Accessor::__ANON__
 8.05   1.516  1.516 264274   0.0000 0.0000  SDBM_File::FETCH
 7.07   1.331  2.669 264257   0.0000 0.0000  MARC::Charset::Code::char_value
 7.02   1.321  1.321 528514   0.0000 0.0000  Class::Accessor::get
 6.80   1.281 11.123 264274   0.0000 0.0000  MARC::Charset::Table::lookup_by_ma
                                             rc8
 4.81   0.906  0.906 264257   0.0000 0.0000  MARC::Charset::_process_escape
 2.34   0.441  0.837  14532   0.0000 0.0001  MARC::Record::field
 2.10   0.396  0.396 264274   0.0000 0.0000  MARC::Charset::Table::db
 2.09   0.393  0.393 196731   0.0000 0.0000  MARC::Field::tag
 1.83   0.344  0.344  25714   0.0000 0.0000  MARC::File::XML::escape
 1.66   0.313 21.170    503   0.0006 0.0421  MARC::File::XML::record

So ... that's all good, mostly MARC::* to blame for speed issues
there. However, that doesn't explain the dramatic slowdown over a 
period of time. So I did another benchmark, this time with 5000 records.
Here were the results:

# dprofpp tmon.out
Exporter::export_ok_tags has -1 unstacked calls in outer
AutoLoader::AUTOLOAD has -1 unstacked calls in outer
Exporter::Heavy::heavy_export has 12 unstacked calls in outer
bytes::AUTOLOAD has -1 unstacked calls in outer
Exporter::Heavy::heavy_export_ok_tags has 1 unstacked calls in outer
POSIX::__ANON__ has 1 unstacked calls in outer
POSIX::load_imports has 1 unstacked calls in outer
Exporter::export has -12 unstacked calls in outer
utf8::AUTOLOAD has -1 unstacked calls in outer
utf8::SWASHNEW has 1 unstacked calls in outer
Storable::thaw has 1 unstacked calls in outer
bytes::length has 1 unstacked calls in outer
POSIX::AUTOLOAD has -2 unstacked calls in outer
Total Elapsed Time = 37695.61 Seconds
  User+System Time = 33524.23 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 97.9   32848 32848.  10170   3.2299 3.2299  Net::Z3950::ZOOM::connection_searc
                 515                         h_pqf
 0.44   147.7 788.26 103492   0.0001 0.0008  MARC::Charset::marc8_to_utf8
 0.37   124.2 400.37 126313   0.0000 0.0000  MARC::Charset::Table::get_code
 0.36   121.2 121.20 126295   0.0000 0.0000  Storable::mretrieve
 0.20   68.49 68.491 126313   0.0000 0.0000  SDBM_File::FETCH
 0.20   67.63  0.000 126295   0.0000 0.0000  Storable::thaw
 0.17   56.76 56.767 252590   0.0000 0.0000  Class::Accessor::get
 0.17   56.11 112.88 252590   0.0000 0.0000  Class::Accessor::__ANON__
 0.15   48.71 449.09 126313   0.0000 0.0000  MARC::Charset::Table::lookup_by_ma
                   1                         rc8
 0.12   41.82 95.098 126295   0.0000 0.0000  MARC::Charset::Code::char_value
 0.10   33.27 33.274 126295   0.0000 0.0000  MARC::Charset::_process_escape
 0.06   18.81 18.811 126313   0.0000 0.0000  MARC::Charset::Table::db
 0.05   17.07 32.295 728288   0.0000 0.0000  MARC::Record::field
 0.05   15.44 15.443 802346   0.0000 0.0000  MARC::Field::tag
 0.04   13.86 828.46  25241   0.0005 0.0328  MARC::File::XML::record

QUESTIONS

So I think we could gain some import speed if we were to preprocess
the set of MARC records by doing the MARC-8 to UTF-8 and fixing 
position 9 in the leader _before_ we begin the import. But that still
leaves us with an import that starts out fairly fast and then starts
seriously lagging. And if I'm not mistaken, it's the connection that
is taking so much time over the long haul.

So, my questions:

Can we just process the raw MARC? Why did we choose the '.xml'
storage method in Zebra and is it a good choice? Would '.sgml' or
'.marc' be a better choice (because we could batch import directly
instead of '.xml's one-at-a-time)? Could we somehow use '.marc' for
the import and then switch to '.xml'?

Any suggestions on how to handle the connection in a more efficient
way?

Cheers,

-- 
Joshua Ferraro               VENDOR SERVICES FOR OPEN-SOURCE SOFTWARE
President, Technology       migration, training, maintenance, support
LibLime                                Featuring Koha Open-Source ILS
jmf at liblime.com |Full Demos at http://liblime.com/koha |1(888)KohaILS