[Koha-devel] Search Engine Changes : let's get some solr

Frédéric Demians frederic at tamil.fr
Sun Nov 14 23:57:41 CET 2010


 >> But since records stored into Koha are cleanly UTF-8 encoded, are
 >> well formed XML and respect a minimalist schema,
 > That is the ideal. In practice, Koha currently does not enforce either
 > of your two assumptions in that statement; patches to tighten that up
 > would be a good idea.

I don't understand. Do you mean that biblioitems.marcxml field and its
mirror in Zebra can contain something else than valid MARCXML? Invalid
encoded characters shouldn't change anything whatever parser is used. I
see bug #2916 on bugzilla. Is there something more?

 >> we could parse them much more quickly directly in Perl.
 > I question whether a pure Perl implementation would be faster.
 > LibXML::XML::SAX, XML::SAX::Expat, and XML::SAX::ExpatXS have the the
 > advantage that much of the parsing is handled by C code.

My tests show that pure Perl parsing is twelve time as fast as with a
SAX parser. A script test is here:

     http://tinyurl.com/23xaqkg

 > Are you suggesting that we adopt yet another hand-crafted, pure Perl
 > XML parser, one that does not support namespaces (a lot of MARCXML
 > data in the wild does reference the marc namespace) and does not check
 > for well formed XML *and* adopt a new MARC module that appears to be
 > all of a few days old and lacks test cases for use in Koha?

I've neither said nor suggested that. The issue pointed by Henri-Damien
is that in C4::Search we get MARC::Record from ISO2709 because using
MARCXML to build the same MARC::Record is much slower. And so we're
limited to 99,999 record size. I say that we could build a MARC::Record
from the MARCXML returned by Zebra using a pure Perl parser.  And so I
pointed to some code explaining that it could be ported, ie adapted to
generate a MARC::Record as usually used in Koha. Have I proposed to
substitute a new (immature?) MARC module, for whatever motives?  I don't
think so.

 > On a more general note, XML parsing is a (mostly) solved problem in
 > Perl. I don't think the way forward is to interpose hand-crafted
 > pure-Perl-parsing of the MARCXML. To suggest an alternative approach
 > that I think would would bear fruit and be less error prone, we can
 > try other standard XML parsers such as XML::Twig.

A MARCXML document is very simple XML which doesn't need a full
fledged XML parser. I'm just saying that as soon as MARCXML records as
stored in Koha are valid, if it isn't already the case, we can avoid
using an heavy-weighted parser which impact performance and isn't
required. We need of course to continue to use a SAX parser for incoming
records.
--
Frédéric




More information about the Koha-devel mailing list