[Koha-devel] Search Engine Changes : let's get some solr

Sun Nov 14 22:42:18 CET 2010

Le 14/11/2010 22:28, Galen Charlton a écrit :
> Hi,
> 
> 2010/11/14 Frédéric Demians <frederic at tamil.fr>:
>> MARCXML parsing is slow because MARC::File::XML uses a SAX parser to do the
>> job and do some 'magic' encoding-decoding to-from MARC8--Galen could correct
>> me if I'm wrong.
> 
> Properly used (i.e., with BinaryEncoding => utf8 when parsing known
> UTF-8 MARCXML records), MARC::File::XML doesn't automatically
> transcode from MARC-8 to UTF-8, so that's a nonissue.  Admittedly,
> there are still some circumstances where MARC::File::XML does
> inappropriately try to inject a MARC-8 to UTF-8 conversion.  Patches
> to improve MARC::File::XML are welcome.
> 
>> But since records stored into Koha are cleanly UTF-8
>> encoded, are well formed XML and respect a minimalist schema,
> 
> That is the ideal.  In practice, Koha currently does not enforce
> either of your two assumptions in that statement; patches to tighten
> that up would be a good idea.
Some work on it is pushed in BibLibre-various branch.
C4::Charset, C4::Biblio and C4::Search.
We used that to get Korean correctly displayed... and searched.

> 
>> we could parse
>> them much more quickly directly in Perl.
> 
> On a more general note, XML parsing is a (mostly) solved problem in
> Perl.  I don't think the way forward is to interpose hand-crafted
> pure-Perl-parsing of the MARCXML.  To suggest an alternative approach
> that I think would would bear fruit and be less error prone, we can
> try other standard XML parsers such as XML::Twig.
Having played with that module, it is quite neat and fast at parsing,
was meant for parsing on the fly big XML documents. But again, we would
have to improve C4::XSLT.pm...

-- 
Henri-Damien LAURENT