[Koha-devel] Solr / zebra / search in Koha 3.10 => starting a workgroup

Fri Mar 30 18:15:14 CEST 2012

Hello all,

As you know, our main goal for the oct12 release of Koha is to introduce
solr as an alternate search engine.
BibLibre already explained which improvements will be added by this
search engine on the blog page:
http://drupal.biblibre.com/en/blog/entry/solr-developments-for-koha

During the hackfest in Marseille, a group of 4 persons (Claire,
Henri-Damien, Juan and Zeno) worked on how this work should be done to
be introduced smoothly. The first goal being that a library wanting to
run zebra still could. As some librarians could want to use another
search engine than zebra or solr, we want to follow a path that would
result in a better modularity.
I also think that most of us agree that current search code is ugly &
very hard to maintain/improve.

The hackfesters have produced a drawing explaining how we could name the
different packages:
https://docs.google.com/a/biblibre.com/drawings/d/1ZdsQsoThYgIVSgH3LqgRZy17xm9X7XkLT6RG3fDYCzs/edit,
with a page on the wiki:
http://wiki.koha-community.org/wiki/Switch_to_Solr_RFC#.23kohahack12

In this drawing (read from bottom to top), there are 2 main layers
"Search" and "Index", that are reponsible of doing searches and doing
indexing. The "Conf" object will be responsible to retrieve the
configuration (current getIndexes), the "Query" object would be
responsible to build the query in SearchEngine grammar, the "Plugin"
object would be reponsible to deal with records before indexing (like
normalizing data)

Claire (from BibLibre) made a first implementation of this organization
on github:
https://github.com/clrh/wip-searchengine-layer/tree/master/lib/SearchEngine.
Juan (from xercode), also worked on this organization, on the zebra
side. His code is available also on github:
https://github.com/xercode/Data-SearchEngine-Zebra. Now, Henri-Damien is
continuing the work for implementing zebra with this global structure.

In the meantime, 2 other directions have been followed:
* Frédéric (Demians, from Tamil) wrote a daemon for zebra indexing (see
http://git.tamil.fr/?p=Koha-Contrib-Tamil;a=summary), that resulted in
bug http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=7759, that
document how to introduce this daemon for indexing. Liz (and maybe
others) are using it without any problem. This git repository introduces
some other tools, but what they effectively do is not completely clear
to me (Frédéric, if you want, to add some info...)
* Galen (Charlton, from Equinox) wrote some code that you can see in
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=7818 and is in
"needs signoff" status. The description of the bug includes a lot of
things: DOM indexing for biblios (and a tool to automatically write the
DOM xsl from the record.abs), and a normalizer for datas, an indexer
(Koha::Indexer). Unless I've missed something (Galen, tell if I'm
wrong): for now, only the DOM indexing is submitted, normalizer and
indexer are not.

What we all agree about: we should have a clearer way to: Normalize /
Index / Search in Koha. That's great !

The structure described by the hackfester is great because it's
independent from the SearchEngine you use.
I think large portions (if not all) of Koha::Contrib::Tamil could be
used to write the zebra indexing layer.
I also think that The DOM indexing part of what Galen has submitted can
be signed-off & pushed without any risk, but the normalize and indexer
parts will need coordination to avoid having BibLibre/xercode working in
a direction, and Galen working in another. I really like the idea of
having normalizer not necessary being MARC; that could be useful in the
future.

That's why I propose to organize an IRC meeting (date and time to
define, but that will be in Europe afternoon / US morning) with all
volunteers to coordinate their efforts. I think this meeting should be
regular (monthly ?)
After each meeting, a summary of the conclusions would be made on the
wiki and posted on this mailing-list.

My proposition: if you're interested by participating to this effort,
please answer to this mail. (I'll then start a doodle to find a proper
time. I propose 2 hours for the duration of the 1st meeting, then,
hopefully, shorter meetings) -Juan/Galen/Zeno, you're considered as
being interested by this topic ;-)
-- 
Paul POULAIN
http://www.biblibre.com
Expert en Logiciels Libres pour l'info-doc
Tel : (33) 4 91 81 35 08