[Koha-devel] Record parsers

Thu Nov 18 22:16:54 CET 2010

[Much of the discussion of record parsers has very little to do with the
subject of Solr/Lucene specifically under which preceding discussion of
record parsers appeared.]

Reply inline:

Previous Subject: Re: [Koha-devel] Search Engine Changes : let's get some
solr

1.  GENERAL PUROSE XML PARSER.

On Mon, November 15, 2010 13:58, Ian Walls wrote:
> Just to throw in on something I ready earlier in this thread, I'd say that
> for a general practice with Koha going forward, we should pick a single
> XML
> parser that can handle arbitrary schemas, and use that.

Having a general purpose XML parser would be very useful as one step
towards greater generalisation and abstraction in Koha.  Picking a single
XML parser for all use cases might be an optimisation mistake which we
would come to regret in future.

2.  METADATA SCHEMA AGNOSTIC RECORDS.

>  I would very much
> like to make Koha not just MARC-agnostic, but metadata schema agnostic,
> and
> coding ourselves into a corner now (even for a noticeable performance
> boost), would make life difficult later.  As I think the rest of the
> thread
> attests, there are other ways to improve our XML parsing.
>
> If this had already been resolved earlier in the conversation, I apologize
> for redundancy; I haven't had my morning coffee yet.

The issue of a general purpose XML parser had been considered tangentially
but without the appropriate context of metadata schema agnostic records.

I think that considering record parsers which are not MARC or MARCXML
specific is important for long term development.

2.1.  INTERNAL RECORD FORMATS.

For some future development, Koha should not be dependent upon a metadata
exchange record syntax for anything other than lossless data input and
data output.  An internal record syntax should be optimised for particular
library management system functions.

The general state of Koha may not be ready for the work which would be
required to ensure that changing the base record format would be lossless.
 However, we should be enabling the future possibility by implementing
abstraction when opportunities arise.

Frédéric Demians recognises the distinction between internal record use
for Koha and external record use for interfacing with the world.  Previous
discussion in the "MARC record size limit thread" had also considered
non-XML record syntaxes such as YAML.

On Mon, November 15, 2010 05:56, Frédéric Demians wrote:

[...]

> It's a design choice. MARCXML is the Koha internal serialization format
> for MARC records. There is no obligation to conform to MARC21slim
> schema. We even could choose another serialization format as it has
> already been discussed. biblioitems.marcxml isn't open to the wide.

[...]

> And we could benefit of it if
> pure Perl parsing is a real performance gain. That is for the good
> reason.

However, the prospect of using Koha specific record syntax parser for
record creation or modification scares me.  I would much prefer some lower
efficiency with validity constraints from a Perl module widely tested
outside of Koha.

2.1.1.  REASON FOR INTERNAL RECORD FORMATS.

An example record format is record format optimised for indexing which
would store information such as the language of material a clear
appropriate place for indexing.  Records optimised for indexing would be
different from the primary form of the record optimised editing and an
alternate form optimised for display.

MARC often uses one or more of several different places with varying forms
of presentation for the same information.  Examples include language of
material which may be multiple and refer to language from which material
was translated; the muddle of recording content type, material type,
carrier type and their various relationships; the muddle of date forms and
similar numeric and sequential designators; the muddle of ordered
classification and similar hierarchical designators; transcribed and
natural language record content with no controlled vocabulary; etc.

[In the interest of time, I omit providing detailed examples.]

Consider the case of language of material.  Enhancing records to use fixed
fields or fixed subfields for better indexing is insufficient to record
the complexity of language use cases.  XPath indexing of MARC records
cannot cope well enough with all the possibilities.  The information can
be parsed out of MARC records reliably into a record specially optimised
for indexing.  Storing the information in MARC in an easily indexable
manner is the problem.

3.  ENABLING FUTURE DEVELOPMENT.

Generalising and abstracting record parsing would enable future
development such as records normalised for a particular purpose without
being dependent upon MARC.  Developments which enable future work do not
require a commitment to a particular development idea but help free the
constraints of development practicalities by leaving less work to provide
some future development.

Thomas Dukleth
Agogme
109 E 9th Street, 3D
New York, NY  10003
USA
http://www.agogme.com
+1 212-674-3783