[Koha-zebra] Import Speed
Joshua Ferraro
jmf at liblime.com
Thu Mar 2 16:44:22 CET 2006
Hi all,
So we've finally got records importing into Zebra, but there are some
serious performance issues, some related to MARC::Charset and some
(I suspect) to Perl-ZOOM/Zebra.
So ... let me lay out what we're doing and then show some benchmarks
I've done before I get to my questions.
CONNECTION OBJECT
All of our connections to Zebra are handled in our Context module.
My test indicate that it maintains the same connection throughout
all the operations. The connection maintainer checks the connection
like this:
if (defined($context->{"Zconn"})) {
$Zconn = $context->{"Zconn"};
$rs=$Zconn->search_pqf('@attr 1=4 mineral');
if ($Zconn->errcode() != 0) {
$context->{"Zconn"} = &new_Zconn();
return $context->{"Zconn"};
}
return $context->{"Zconn"};
} else {
$context->{"Zconn"} = &new_Zconn();
return $context->{"Zconn"};
}
}
Of course, in the context of batch import, this gets called for
every record import (with '.xml' as the storage format) ... just
wondering if there's a better way ... see below for details.
IMPORT PROCESS
We've got a script called 'bulkmarcimport' that is able to take a
file full of MARC records and import them into Zebra as well as
some MySQL fields. It can be set to perform a 'commit' operation
after a certain number of records, but I haven't seen any performance
issues related to when the 'commit' operation happens (ie, if you do
it once every 100 records it seems about the same speed as if you
do it once every 10000).
The process looks something like this:
Grab a MARC record with MARC::Batch
Add some local use fields, etc.
Feed it to old-koha routines to put some stuff in MySQL
Hand it to MARC::File::XML to create an XML object
MARC::File::XML converts from MARC-8 to UTF-8 where necessary
Feed it to Zebra via perl-zoom
Perform a commit operation every so often
BENCHMARCKS
I recently tried the import with 5000 records. The first few hundred
were rolling along at about 3-5 per second but this eventually
droped to 1-2 per second and finally to about 1 per 2 secs. The whole
batch imported in 40604 seconds.
Running a benchmarc on an import of 100 records yields the following:
sam:/home/koha/testing/cvsrepo/koha# perl -d:DProf misc/migration_tools/bulkmarcimport.pl -d -n 100 -commit 100 -file /home/jmf/koha.mrc
deleting biblios
COMMIT OPERATION SUCCESSFUL
100 MARC records imported in 59.907173871994 seconds
sam:/home/koha/testing/cvsrepo/koha# dprofpp tmon.out
Exporter::export_ok_tags has -1 unstacked calls in outer
AutoLoader::AUTOLOAD has -1 unstacked calls in outer
Exporter::Heavy::heavy_export has 12 unstacked calls in outer
bytes::AUTOLOAD has -1 unstacked calls in outer
Exporter::Heavy::heavy_export_ok_tags has 1 unstacked calls in outer
POSIX::__ANON__ has 1 unstacked calls in outer
POSIX::load_imports has 1 unstacked calls in outer
Exporter::export has -12 unstacked calls in outer
Storable::thaw has 1 unstacked calls in outer
bytes::length has 1 unstacked calls in outer
POSIX::AUTOLOAD has -2 unstacked calls in outer
Total Elapsed Time = 39.96495 Seconds
User+System Time = 18.82495 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
19.4 3.657 20.081 25211 0.0001 0.0008 MARC::Charset::marc8_to_utf8
17.1 3.223 9.841 264274 0.0000 0.0000 MARC::Charset::Table::get_code
13.6 2.566 2.566 264257 0.0000 0.0000 Storable::mretrieve
11.3 2.141 0.000 264257 0.0000 0.0000 Storable::thaw
8.57 1.613 2.934 528514 0.0000 0.0000 Class::Accessor::__ANON__
8.05 1.516 1.516 264274 0.0000 0.0000 SDBM_File::FETCH
7.07 1.331 2.669 264257 0.0000 0.0000 MARC::Charset::Code::char_value
7.02 1.321 1.321 528514 0.0000 0.0000 Class::Accessor::get
6.80 1.281 11.123 264274 0.0000 0.0000 MARC::Charset::Table::lookup_by_ma
rc8
4.81 0.906 0.906 264257 0.0000 0.0000 MARC::Charset::_process_escape
2.34 0.441 0.837 14532 0.0000 0.0001 MARC::Record::field
2.10 0.396 0.396 264274 0.0000 0.0000 MARC::Charset::Table::db
2.09 0.393 0.393 196731 0.0000 0.0000 MARC::Field::tag
1.83 0.344 0.344 25714 0.0000 0.0000 MARC::File::XML::escape
1.66 0.313 21.170 503 0.0006 0.0421 MARC::File::XML::record
So ... that's all good, mostly MARC::* to blame for speed issues
there. However, that doesn't explain the dramatic slowdown over a
period of time. So I did another benchmark, this time with 5000 records.
Here were the results:
# dprofpp tmon.out
Exporter::export_ok_tags has -1 unstacked calls in outer
AutoLoader::AUTOLOAD has -1 unstacked calls in outer
Exporter::Heavy::heavy_export has 12 unstacked calls in outer
bytes::AUTOLOAD has -1 unstacked calls in outer
Exporter::Heavy::heavy_export_ok_tags has 1 unstacked calls in outer
POSIX::__ANON__ has 1 unstacked calls in outer
POSIX::load_imports has 1 unstacked calls in outer
Exporter::export has -12 unstacked calls in outer
utf8::AUTOLOAD has -1 unstacked calls in outer
utf8::SWASHNEW has 1 unstacked calls in outer
Storable::thaw has 1 unstacked calls in outer
bytes::length has 1 unstacked calls in outer
POSIX::AUTOLOAD has -2 unstacked calls in outer
Total Elapsed Time = 37695.61 Seconds
User+System Time = 33524.23 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
97.9 32848 32848. 10170 3.2299 3.2299 Net::Z3950::ZOOM::connection_searc
515 h_pqf
0.44 147.7 788.26 103492 0.0001 0.0008 MARC::Charset::marc8_to_utf8
0.37 124.2 400.37 126313 0.0000 0.0000 MARC::Charset::Table::get_code
0.36 121.2 121.20 126295 0.0000 0.0000 Storable::mretrieve
0.20 68.49 68.491 126313 0.0000 0.0000 SDBM_File::FETCH
0.20 67.63 0.000 126295 0.0000 0.0000 Storable::thaw
0.17 56.76 56.767 252590 0.0000 0.0000 Class::Accessor::get
0.17 56.11 112.88 252590 0.0000 0.0000 Class::Accessor::__ANON__
0.15 48.71 449.09 126313 0.0000 0.0000 MARC::Charset::Table::lookup_by_ma
1 rc8
0.12 41.82 95.098 126295 0.0000 0.0000 MARC::Charset::Code::char_value
0.10 33.27 33.274 126295 0.0000 0.0000 MARC::Charset::_process_escape
0.06 18.81 18.811 126313 0.0000 0.0000 MARC::Charset::Table::db
0.05 17.07 32.295 728288 0.0000 0.0000 MARC::Record::field
0.05 15.44 15.443 802346 0.0000 0.0000 MARC::Field::tag
0.04 13.86 828.46 25241 0.0005 0.0328 MARC::File::XML::record
QUESTIONS
So I think we could gain some import speed if we were to preprocess
the set of MARC records by doing the MARC-8 to UTF-8 and fixing
position 9 in the leader _before_ we begin the import. But that still
leaves us with an import that starts out fairly fast and then starts
seriously lagging. And if I'm not mistaken, it's the connection that
is taking so much time over the long haul.
So, my questions:
Can we just process the raw MARC? Why did we choose the '.xml'
storage method in Zebra and is it a good choice? Would '.sgml' or
'.marc' be a better choice (because we could batch import directly
instead of '.xml's one-at-a-time)? Could we somehow use '.marc' for
the import and then switch to '.xml'?
Any suggestions on how to handle the connection in a more efficient
way?
Cheers,
--
Joshua Ferraro VENDOR SERVICES FOR OPEN-SOURCE SOFTWARE
President, Technology migration, training, maintenance, support
LibLime Featuring Koha Open-Source ILS
jmf at liblime.com |Full Demos at http://liblime.com/koha |1(888)KohaILS
More information about the Koha-zebra
mailing list