[Koha-devel] Search Engine Changes : let's get some solr

Mon Oct 4 10:10:41 CEST 2010

Hi
As you already read in Paul previous message about
"BibLibre strategy for 3.4 and next version", we are growing, want be
involved in the community as previously. Paul promised some POCs, here
is one available. We also worked on Plack and support. We created a base
of script to search for Memoryleaks. We'll demonstrate that later.

zebra is fast and embeds native z3950 server. But it has also some major
drawbacks we have to cope with on our everyday life making it quite
difficult to maintain.

   1.  zebra config files are a nightmare. You can't drive the
configuration file easily. namely : Can't edit indexs via HTTP or
configuration. all is in files hardcoded on disk. ⇒ you can't list
indexes you can't change indexes, you can't edit indexes, you can't say
I want this index at OPAC, that in intranet. (Could be done with
scraping ccl.properties, and then record.abs and bib1.att…. But what a
HELL) So you cannot customize configuration defining the indexes you
want easily. And ppl donot get a translation of the indexes since all
the indexes are hardcoded in the ccl.properties and we donot have a
translation process so that ccl attributes could be translated into
different languages.

   2.  no real-time indexing : the use of a crontab is poor: when you
add an authority while creating a biblio, you have to wait some some
minutes to end your biblio (might be solved since zebra has some way to
index biblios via z3950 extended services, but hard and should be tested
and at the time community first tested that, a performance problem was
raised on indexing.)

   3. no way to access/process/delete data easily. If you have indexes
in it or have some problems with your data, you have to reindex the
whole stuff and indexing errors are quite difficult to detect.

   4. during index process of a file, if you have a problem in your
data, zebraidx just fails silently… And this is NOT secure. And you have
no way to know WHICH biblio made the process crash. We had a LOT of
trouble with Aix-Marseille universities that have some
arabic translitterated biblios that makes zebra/icu completly crash ! We
had to do some recursive script to find 14 biblios on 730 000 that makes
zebra crash (even is properly stored & displayed)

   5. facets are not working properly : they are on the result displayed
because there are problems with diacritics & facets that can't be solved
as of today. And noone can provide a solution (we spoke about that with
indexdata and no clear solution was really provided.

   6. zebra does not evolve anymore. There is no real community around
it, it's just an opensource indexdata software. We sent many questions
onlist and never got answers. We could pay for better support but the
fee required is quite deterrent and benefit is still questionable.

   7. icu & zebra are colleagues, not really friends : right truncation
not working, fuzzy search not working and facets.

   8. we use a deprecated way to define indexes for biblios (grs1) and
the tool developped by indexdata to change to DOM has many flaws. we
could manage and do with it. But is it worth the strive ?

I think that every one agrees that we have to refactor C4::Search.
Indeed, query parser is not able to manage independantly all the
configuration options. And usage of usmarc as internal for biblio comes
with a serious limitation of 9999 bytes, which for big biblios with many
items, is not enough.

BibLibre investigated in a catalogue based on solr.
A  University in France contracted us for that development.
This University is in relation with all the community here in France and
solr will certainly be adopted by all the libraries France wide.
We are planning to release the code on our git early spring next year
and rebase on whatever Koha version will be released at that time 3.4 or
3.6.

Why ?

Solr indexes with data with HTTP.
It can provide fuzzy search, search on synonyms, suggestions
It can provide facet search, stemming.
utf8 support is embedded.
Community is really impressively reactive and numerous and efficient.
And documentation is very good and exhaustive.

You can see the results on solr.biblibre.com and catalogue.solr.biblibre.com

http://catalogue.solr.biblibre.com/cgi-bin/koha/opac-search.pl?q=jean
http://solr.biblibre.com/cgi-bin/koha/admin/admin-home.pl
you can log there with demo/demo lgoin/password

http://solr.biblibre.com/cgi-bin/koha/solr/indexes.pl
is the page where ppl can manage their indexes and links.

a) Librarians can define their own indexes, and there is a plugin that
fetches data from rejected authorities and from authorised_values (that
could/should  have been achieved with zebra but only with major work on
xslt).

b) C4/Search.pm count lines of code could be shrinked ten times.
You can test from poc_solr branch on
git://git.biblibre.com/koha_biblibre.git
But you have to install solr.

Any feedback/idea welcome.
-- 
Henri-Damien LAURENT
BibLibre