[Koha-devel] Search Engine Changes : let's get some solr

Mon Nov 15 05:46:15 CET 2010

Hi,

2010/11/14 Frédéric Demians <frederic at tamil.fr>:
> A MARCXML document is very simple XML which doesn't need a full
> fledged XML parser. I'm just saying that as soon as MARCXML records as
> stored in Koha are valid, if it isn't already the case, we can avoid
> using an heavy-weighted parser which impact performance and isn't
> required. We need of course to continue to use a SAX parser for incoming
> records.

I've measured, and your parser is, in fact pretty fast -- *if* you
feed it only MARCXML that meets narrower constraints than are
permitted by the MARC21slim schema.  However, I see no good reason to
limit Koha to that artificial restriction; having biblioitems.marcxml
contain MARCXML that validates against the MARC21slim is sufficient.

Two parsers doing similar operations is an invitation for subtle bugs.
 The pure Perl parser you propose currently doesn't handle namespaces
prefixes (which are allowed in MARC21slim records), wouldn't handle
any situation where the attributes aren't in the order you expect them
in (attribute order is not significant per the XML specification), and
will blithely accept non-well-formed XML without complaining (this is
*not* a good thing).  It also doesn't recognize and correctly handle
XML entities. Obviously you could address much of this in your code,
but I suspect what you'll find is that you'll end up with an XML
parser that is slower and still has more bugs than any of the standard
parser modules.

Fortunately, I've found an approach that is significantly faster than
MARC::File::XML/SAX: dropping SAX from MARC::File::XML entirely and
using XML::LibXML's DOM parser instead [1].  It is faster [2] than
using XML::LibXML::SAX::Parser [3], XML::Expat [4], and even
XML::ExpatXS [5].  A pure Perl approach based on your work [6] does
win the race [7], but it also fails some of MARC::File::XML's test
cases and I'm sure it would lose speed once extended to handle the
full range of what constitutes a valid MARCXML document.

But, one might ask, what about memory usage with a DOM parser?
MARC::File::XML as used by Koha (and used in general) is geared
towards parsing one record at a time; it doesn't currently have any
provision for loading an entire file in memory.  A DOM tree for a
typical MARCXML record is not a big deal, and even a record having
several thousand items wouldn't be any more unmanageable.  (Of course,
as we all know, one of the most significant gains to be had will arise
from changing Koha so that it doesn't embed item data in bib MARC tags
as a matter of course).  In fact, we already have proof that we'd be
no worse off as far as memory consumption is concerned --
XML::LibXML::SAX::Parser, as it happens, isn't a traditional SAX
parser.  What it does is load the XML document into a DOM tree, then
walks the tree and fires off SAX events.  In other words, we're
*already* using a DOM parser.

In any event, I would be grateful for people to test the DOM version
of MARC::File::XML.  It passes MARC::File::XML's test suite
successfully, but more testing to verify that it won't break things
would help a great deal.

By the way, I did also try XML::Twig, but that didn't turn out to be
faster than XML::LibXML::SAX::Parser, and in some cases was slower.

[1] http://git.librarypolice.com/?p=marcpm.git;a=shortlog;h=refs/heads/use-dom-instead-of-sax
[2] http://librarypolice.com/nytprof/run-libxml-dom-2/index.html
[3] http://librarypolice.com/nytprof/run-sax-libxml-sax-parser/
[4] http://librarypolice.com/nytprof/run-sax-expat/index.html
[5] http://librarypolice.com/nytprof/run-sax-expatxs/index.html
[6] http://git.librarypolice.com/?p=marcpm.git;a=shortlog;h=refs/heads/pure-perl
[7] http://librarypolice.com/nytprof/run-pp/

Regards,

Galen
-- 
Galen Charlton
gmcharlt at gmail.com