[Koha-devel] Search Engine Changes : let's get some solr

Mon Oct 11 10:57:21 CEST 2010

Reply inline:

On Sun, October 10, 2010 22:26, glawson at rhcl.org wrote:

[...]

1.  PRECISION AND BUGS IN KOHA ZEBRA IMPLEMENTATION.

> Does the term "precision" have a meaning significantly different when
> applied to the indexing of databases than it does in general use?

The ordinary language use of precision has basically the same meaning as
the more specialist uses which measure precision mathematically.

>
> Quote:
>
> "The precision of a measurement system, also called reproducibility or
> repeatability, is the degree to which repeated measurements under
> unchanged conditions show the same results."
> http://en.wikipedia.org/wiki/Accuracy_and_precision

The measurement theory Wikipedia article is fine.  The Wikipedia article
treating the same concepts for information retrieval is
http://en.wikipedia.org/wiki/Precision_and_recall .  Precision has a
disambiguation page, http://en.wikipedia.org/wiki/Precision .

>
> I was not aware that precision was a problem with Zebra, although we have
> found, to our great dismay with our recent implementation of Koha, that
> relevancy, which I would call accuracy, is. More correctly and
> scientifically stated perhaps, relevancy sucks.

I did not assert that precision is necessarily a problem with Zebra. 
Henri-Damien Laurent listed some Zebra bugs which could be construed as
affecting precision by not returning a result set or returning a result
set in which some multi-byte characters are mangled.  Zebra bugs need to
be fixed.

What I did assert is that we are not yet using some options in Zebra
Z39.50 support which allow for better precision matching sets of
controlled terms especially useful for subject and classification based
queries.  There has not been sufficient programming time available to
implement the options to which I referred.  The underlying Koha
implementation of authority control and support for classification needs
improvement first.

I suspect that the relevancy issues to which you are referring are
different and relate to ranking the result set, possibly adding indexes
appropriate to your organisation, and using appropriate fielded queries. 
You would need to state the problematic results which you have for
particular queries to know the problem to which you are referring.  Chris
Cormack may have identified at least one of the problems for you in his
reply,
http://lists.koha-community.org/pipermail/koha-devel/2010-October/034470.html
.

> Do I incorrectly
> associate accuracy with relevancy?

I think that you correctly associate accuracy with relevancy.

I am not certain where recall would fit in comparing information retrieval
terms with measurement theory terms.

>
> Quote:
>
> "...accuracy of a measurement system is the degree of closeness of
> measurements of a quantity to its actual (true) value."
> IBID

2.  CONSIDERING OPTIONS CAREFULLY.

My concern is that we do not completely abandon the Zebra indexing system
which we now have because of some bugs which could be fixed.  We could
have a sophisticated Z39.50/SRU server and Solr/Lucene indexing by working
with Index Data and with bugs fixed for a moderate service contract. 
Given the undocumented but apparently unsophisticated feature set of
JZKit, I suspect that it will be more expensive to have Knowledge
Integration develop a sufficiently sophisticated Z39.50/SRU server
implementation in JZKit.

The large Solr/Lucene community may support many of our interests well
without needing to depend upon us financially, which would be a real
advantage.  However, such a non-library community, despite the presence of
some library community members, is liable to take a very long time to
appreciate the value of sophisticated implementations of precision to
support all types of library queries well.

[...]

Thomas Dukleth
Agogme
109 E 9th Street, 3D
New York, NY  10003
USA
http://www.agogme.com
+1 212-674-3783