[Koha-devel] Search Engine Changes : let's get some solr

Galen Charlton gmcharlt at gmail.com
Sun Nov 14 22:28:57 CET 2010


Hi,

2010/11/14 Frédéric Demians <frederic at tamil.fr>:
> MARCXML parsing is slow because MARC::File::XML uses a SAX parser to do the
> job and do some 'magic' encoding-decoding to-from MARC8--Galen could correct
> me if I'm wrong.

Properly used (i.e., with BinaryEncoding => utf8 when parsing known
UTF-8 MARCXML records), MARC::File::XML doesn't automatically
transcode from MARC-8 to UTF-8, so that's a nonissue.  Admittedly,
there are still some circumstances where MARC::File::XML does
inappropriately try to inject a MARC-8 to UTF-8 conversion.  Patches
to improve MARC::File::XML are welcome.

> But since records stored into Koha are cleanly UTF-8
> encoded, are well formed XML and respect a minimalist schema,

That is the ideal.  In practice, Koha currently does not enforce
either of your two assumptions in that statement; patches to tighten
that up would be a good idea.

> we could parse
> them much more quickly directly in Perl.

I question whether a pure Perl implementation would be faster.
LibXML::XML::SAX, XML::SAX::Expat, and XML::SAX::ExpatXS have the the
advantage that much of the parsing is handled by C code.

> I've done some experimentation. It
> works easily. This code could be ported in five minutes:
>
> http://tinyurl.com/3x3d6b9

Are you suggesting that we adopt yet another hand-crafted, pure Perl
XML parser, one that does not support namespaces (a lot of MARCXML
data in the wild does reference the marc namespace) and does not check
for well formed XML *and* adopt a new MARC module that appears to be
all of a few days old and lacks test cases for use in Koha?

What you propose is interesting, and I'm sure you'll pursue it, but it
would need more time to bake.

On a more general note, XML parsing is a (mostly) solved problem in
Perl.  I don't think the way forward is to interpose hand-crafted
pure-Perl-parsing of the MARCXML.  To suggest an alternative approach
that I think would would bear fruit and be less error prone, we can
try other standard XML parsers such as XML::Twig.

Regards,

Galen
-- 
Galen Charlton
gmcharlt at gmail.com


More information about the Koha-devel mailing list