[Koha-devel] Search Engine Changes : let's get some solr

Fri Oct 8 21:02:54 CEST 2010

1.  Z39.50/SRU SUPPORT.

If I would have any specific concern about the prospect of new development
complicating long term development, it would be about the possibility of
breaking or neglecting necessary Z39.50/SRU server support in the process
of adding excessively generic Solr/Lucene indexing.  Z39.50/SRU are
important library standards for record sharing which is vital to the good
functioning of the library community.

I commend Henri-Damien Laurent for taking the issue of Z39.50/SRU support
seriously and finding JZKit as a possible solution for Z39.50/SRU support
using Solr/Lucene.

2.  AVOIDING FEATURE REGRESSION OR BLOCKS TO FUTURE DEVELOPMENT.

Popular implementations of Solr/Lucene in library automation systems have
made all the mistakes of sacrificing the precision needed for serious
library research in return for high recall with poor relevancy often found
in Google which may merely satisfy casual queries.

I share the concern that working with Zebra is too much like working with
a black box into which one cannot peer.  I make no claim that existing
Z39.50/SRU Zebra support in Koha is ideal but merely than it should not be
too easily sacrificed for something else with its own problems which are
merely less familiar to us.  I suggest that we retain the existing
Z39.50/SRU Zebra support in Koha while adding other options which may
improve local indexing.

The full use of Bib-1 position, structure, and completeness attributes for
Z39.50 or the ordered prox CQL operator for SRU would allow the precise
queries needed for serious research.  The lack of a completeness operator
in CQL is a serious deficiency for SRU.  Index Data may still need to
develop support in Zebra for the ordered prox CQL operator which will most
likely require paying to support that effort when it would be appreciated
in the Koha community.  Zebra certainly has bugs as does all software. 
[See the end of the document for the Index Data promise about bugs.] 
Ultimately, I see no manageable way to have a free software library
automation system without paying for some support for something from Index
Data even if that would merely be Z39.50/SRU client programming libraries.

Solr/Lucene may now be a good choice for internal indexing in Koha. 
Lucene was not considered fairly during 2005 testing for Koha because the
Perl bindings at that time were notoriously slow.  Solr and Lucene have
long had the mind share and development advantage of being Apache
Foundation projects which Zebra will never match, hence the forthcoming
inclusion of Solr/Lucene indexing for the next major versions of Pazpar2
and Zebra .  However, Solr/Lucene has had problems which should not go
unconsidered in evaluating or actually implementing Solr/Lucene based
indexing in Koha.

I am not certain what point 8 from Henri-Damien's message is specifically
meant to criticise.  Is the complaint against indexing based the DOM in
general or against the frustration of needing to migrate from an
inefficient deprecated means of using the DOM to a more efficient means of
using it?

On Mon, October 4, 2010 08:10, LAURENT Henri-Damien wrote:

[...]

>    8. we use a deprecated way to define indexes for biblios (grs1) and
> the tool developped by indexdata to change to DOM has many flaws. we
> could manage and do with it. But is it worth the strive ?

[...]

I contend that although working with the DOM can be difficult at times,
the DOM helps provide needed flexibility and precision in indexing.

2.1.  HISTORICAL LACK OF PRECISION IN SOLR/LUCENE.

Solr/Lucene may have been a poor choice during the 2004 - 2006 period of
sponsoring Perl Zoom and developing Zebra in Koha.  Lucene had originally
been developed for full text indexing of unstructured documents.  Solr had
originally been merely an easy to configure front end to a subset of
Lucene functionality.  Solr became a popular choice for the simplest free
software OPACs.  I have always tried to subject choices taken in Koha to
personal reconsideration and made a modest investigation of the
capabilities of Lucene and Solr/Lucene in 2007.  I consulted widely and
attended some conferences asking questions of the most expert implementers
of library automation systems who had been using Lucene or Solr/Lucene.  I
tried to consult with people working to solve real problems rather than
merely relying upon possibly incomplete documentation.  In 2007, Solr
provided no support for indexing to serve important concepts used for
obtaining precision in library systems.

2.1.1.  ASPECTS OF PRECISION HISTORICALLY UNSUPPORTED BY SOLR/LUCENE.

Hierarchy where some content is subsidiary to other content and content
derives meaning from the place in the hierarchy had no support in Solr
circa 2007.  Field to subfield relationships is an example of hierarchy in
MARC records.  Namespace hierarchies are examples of hierarchy in XML
records and are accessible by XPath queries.  Hierarchy is a fundamental
feature of classification and retrieval for easily including wanted record
sets and excluding unwanted record sets.

Sequential order where the order of separate record sub-elements is
relevant to meaning had no support in Solr circa 2007.  Philosophy -
History, meaning 'history of philosophy', is an entirely different subject
from History - Philosophy, meaning 'philosophy of history'.  Note that the
inversion of word order between individual controlled vocabulary elements
and the corresponding English phrase with the same meaning.  The
sequential order of fields within a record or MARC subfields within a
particular field are examples of sequential order in MARC records.  The
sequential order of namespaces within a record and the order of repeated
elements within the same namespace are examples of sequential order in XML
records accessible by XPath queries.  Sequential order is a fundamental
feature of meaning in language and is not necessarily reducible to phrase
strings where interceding terms may or may not be present and word order
may be inverted as in the example given.

2.1.2.  ALTERNATIVES FOR PRECISION USING LUCENE.

In 2005, work at Bibliothèque de l'Université Laval (originators of
RAMEAU) had developed LIUS (Lucene Index Update and Search) to overcome
some difficulties of Lucene including fielded indexing of the very
simplest flat field metadata found in some general purpose document types
and XPath indexing for XML documents,
http://sourceforge.net/projects/lius/ .   Laval now uses Solr/Lucene based
Constellio, http://www.constellio.com/ .

In 2007, I had been informed by a programmer of library automation systems
working in the pharmaceutical industry, if I remember his job correctly,
that hierarchical indexing and sequential indexing could be done in Lucene
but that there was no support for such indexing in Solr.  Precision is
very important for both scientific and business purposes in the
pharmaceutical industry.  Despite valid criticism of some business
practises within the pharmaceutical industry, lives are often at stake in
their work.

We should treat the quality of information retrieval in library automation
systems as if lives are at stake.  Lives will sometimes be at stake in the
research which people do.

2.2.  CONSEQUENCES OF LACK OF PRECISION.

Sadly, the concept of precision has not been one which signified in the
minds of those developing the popular free software OPACs using
Solr/Lucene or some of their non-free equivalents.  Examples of the
consequences to which Koha is not excluded are using only $a in faceting
despite the presence of other important subfields; jumbling all the
subfields from all similar fields independently; and returning irrelevant
results because subfields have been treated as mere independent keywords
devoid of contextual meaning even in the context of a query using an
authority controlled field.

Human nature, to which Koha is not immune, may have some impetus to
oversimplify for an expected advantage.  Oversimplification in the context
of a library automation system could eliminate the ability for the user to
access the real complexity and richness of relationships in bibliographic
records to improve speed or robustness.  Such oversimplification exists to
a large extent in every actual library automation system.

I may be raising a false alarm about the possibility that some feature
advance may complicate or block better improvements in the future. Yet, I
prefer to take a vigilant stance rather than be sorry later for not having
raised a concern.

2.3.  CURRENT SUITABILITY OF SOLR/LUCENE.

I note significant improvements identified in the Solr/Lucene changelog
from version 1.3 in 2008 and later.

The DataImportHandler was added in version 1.3.  DataImportHandler has
options for XPath based indexing.

Solr still seems to have no support for ordered proximity searches. 
Perhaps XPath based indexing would address the problem.  A possible
workaround modifying the Lucene code in SolrQueryParser to return
SpanNearQuery instead of PhraseQuery may be a very undesirable remedy,
breaking one feature to fix another.

Whether the improvements in Solr/Lucene are sufficient to overcome the
past limitations which I have identified would require experimentation.

3.  SUPPORT MODELS FOR NEEDED PROGRAMMING LIBRARIES.

It is good that companies such as Knowledge Integration,
http://www.k-int.com/ , developers of JZKit, http://www.k-int.com/jzkit ,
are providing some free software competition and complementary work to
what is available from Index Data.

Note that the JZKit developer, Ian Ibbotson, is using Yaz as a Z39.50
client,
http://k-int.blogspot.com/2008/05/exposing-solr-services-as-z3950-server.html
leaving a dependency on Index Data for Z39.50 for client side services.

There is some prevarication at Index Data against fully embracing free
software in everything they do.  Inevitably they need revenue to be
sustainable.  The following thought about a possible shortage of Index
Data development time and the consequences is merely speculative but not
uninformed.  Index Data may have a problem of not enough developers
working for them with sufficient experience to further the development of
the underlying programming libraries which we use to meet the amount of
the work which the library community hopes to have from them.  Contracting
for Index Data development in the absence of sufficient development time
to go around might result in bidding for the importance of the development
which you need as much as it is sharing the cost of development with
others.

Would working with Knowledge Integration which has even fewer developers
be significantly different in terms of development costs?  Does Knowledge
Integration need less money for a given amount of work than Index Data
does to be sustainable?

Consider that JZKit seems to have no documentation worthy of identifying
as documentation.  The source code repository contains about four pages of
outlines for documentation with only one sentence of actual content,
http://www.k-int.com/developer/downloads .  There are some comments in the
source code which I understand are used as documentation for JZKit.  Yet
the comments are too few and incomplete to be of sufficient use to me and
from what I have noted others as well.  There are some virtually empty
example configuration files which could be used as a basis for speculating
how configuration works.  JZKit supposedly has a mailing list but I have
not found it.

Index Data does provide documentation even if we have often found it
inadequate for our needs in Koha development.  I suspect that sufficient
documentation at Knowledge Integration as at Index Data requires a support
contract and as we know has no guarantees for completeness.  Writing clear
and thorough documentation is hard work.  Writing documentation is the
last thing which programmers generally want to do.  Lack of good
documentation is a common characteristic of free software.

If there would also be a need for JZKit to have some missing feature or
better functionality, would the situation also be any different for
Knowledge Integration development than Index Data development?  See the
unfortunate position of Knowledge Integration on GPL contributions or AGPL
3 contributions in the case of JZKit,
http://www.k-int.com/developer/participate .

The library community needs to find the means of working more
cooperatively to ensure a steady availability of development resources at
companies such as Index Data and Knowledge Integration for sustainable
shared development.

I hope that Index Data may eventually be won over from their sometimes
prevaricative position towards free software.  Yet they need to be
sustainable by some means.  I do not find the position of of Knowledge
Integration to be any different and note that they do not have a link to
the source code repository for OpenHarvest,
http://www.k-int.com/developer/downloads .  Index Data does have a long
history of supporting free software for libraries.  Index Data also makes
an extraordinary almost impossible to be believed promise in their support
contracts to fix any bug within set number of days.

The issue of how to share the cost of support contracts for programming
libraries provided by companies such as Index Data or Knowledge
Integration across multiple Koha support companies or even outside of the
Koha community needs to be considered.

[...]

Thomas Dukleth
Agogme
109 E 9th Street, 3D
New York, NY  10003
USA
http://www.agogme.com
+1 212-674-3783