[Koha-devel] Search Engine Changes : let's get some solr

Thu Nov 11 14:09:48 CET 2010

LAURENT Henri-Damien wrote:
> involved in the community as previously. Paul promised some POCs, here
> is one available. [...]

Sorry for taking a while to look at this, but it raised so many
questions in my mind when I first read it and I've been a bit busy,
so I thought I'd leave it a while and see if some were covered by
others.  Some were (thanks!) but many are left, so here we go:
What's a POC?  Piece Of Code?  (I assume it's not the C I'd usually
mean in that abbreviation ;-) )

> zebra is fast and embeds native z3950 server. But it has also some major
> drawbacks we have to cope with on our everyday life making it quite
> difficult to maintain.
> 
>    1.  zebra config files are a nightmare.

I've librarians editing zebra config files.  They've seen far worse
from the awful library management systems of the past.

> You can't drive the
> configuration file easily. namely : Can't edit indexs via HTTP or
> configuration. all is in files hardcoded on disk.

We could fix this by providing an HTTP interface if anyone wanted.
This isn't a problem unique to Zebra: some Koha configuration is
only in files on disk.  Being in a config file is not hardcoded!

So, this is solvable if someone wanted it enough.  Does anyone
want me to take this enhancement forwards?

> [...] And ppl donot get a translation of the indexes since all
> the indexes are hardcoded in the ccl.properties and we donot have a
> translation process so that ccl attributes could be translated into
> different languages.

This sounds like a problem in our translation process.  Would the
translation manager like to consider it, please?

>    2.  no real-time indexing : the use of a crontab is poor: when you
> add an authority while creating a biblio, you have to wait some some
> minutes to end your biblio

This is being considered in bug 5165.  It's a problem in how we are
using Zebra, really.

> (might be solved since zebra has some way to
> index biblios via z3950 extended services, but hard and should be tested
> and at the time community first tested that, a performance problem was
> raised on indexing.)

Does someone have a link to this performance problem, please?

>    3. no way to access/process/delete data easily. If you have indexes
> in it or have some problems with your data, you have to reindex the
> whole stuff and indexing errors are quite difficult to detect.

I'm not entirely sure what is being wanted here.  Indexing problems
have been a bit nasty on many systems.

>    4. during index process of a file, if you have a problem in your
> data, zebraidx just fails silently…

Example?

> And this is NOT secure.

What security data does zebra leak in this failure case?

> And you have
> no way to know WHICH biblio made the process crash. [...]

It's quite possible, but Koha has made that mistake too.  In one
recent less serious example, I found that Koha knew which biblio was
at fault, but didn't bother to report the biblio details in the
failure error message.  However, if it's an actual crash, it can be
difficult to generate an error from a crashing C process.  Maybe
you could dump core and examine it, but bissecting the input data 
like HDL did is probably quicker.

>    5. facets are not working properly : they are on the result displayed
> because there are problems with diacritics & facets that can't be solved
> as of today. And noone can provide a solution (we spoke about that with
> indexdata and no clear solution was really provided.

Does anyone have a link to that conversation, please?  I'd like to
know more about it before we hit it for real.

>    6. zebra does not evolve anymore. There is no real community around
> it, it's just an opensource indexdata software. We sent many questions
> onlist and never got answers. We could pay for better support but the
> fee required is quite deterrent and benefit is still questionable.

It's disappointing there's no community, but that happens sometimes.
I guess we could try and make it part of our community, if it's
important enough.  It's some different skills, but not completely
inconsistent.

What fee is being asked for what benefit?

>    7. icu & zebra are colleagues, not really friends : right truncation
> not working, fuzzy search not working and facets.

Those are pretty specific claims, directly contradicting
http://www.indexdata.com/zebra/doc/querymodel-rpn.html#querymodel-bib1-truncation
http://www.indexdata.com/zebra/doc/querymodel-zebra.html#querymodel-zebra-attr-scan
and so on.  Anyone else like to comment on them?

>    8. we use a deprecated way to define indexes for biblios (grs1) and
> the tool developped by indexdata to change to DOM has many flaws. we
> could manage and do with it. But is it worth the strive ?

I think so.

> I think that every one agrees that we have to refactor C4::Search.
> Indeed, query parser is not able to manage independantly all the
> configuration options. And usage of usmarc as internal for biblio comes
> with a serious limitation of 9999 bytes, which for big biblios with many
> items, is not enough.

Where does that 9999-byte limit come from?  Could some methods from
the analytic records RFC give us a route around it?

> BibLibre investigated in a catalogue based on solr.
> A  University in France contracted us for that development.
> This University is in relation with all the community here in France and
> solr will certainly be adopted by all the libraries France wide. [...]

That's disappointing.  While it's not a problem for universities, the
big problem I see with Solr http://lucene.apache.org/solr/ is that it
is Java, which poses many management challenges for smaller libraries
and requires very different skills to current Koha deployments.  It
seems a bit like throwing the baby out with the bath water, at first
glance.

> Solr indexes with data with HTTP.

Why is this the top benefit?  We're smart enough to write for whatever
protocol, so I'm not sure I understand.

> It can provide fuzzy search, search on synonyms, suggestions
> It can provide facet search, stemming.

In theory, so can Zebra.

> utf8 support is embedded.

Hmmm, we'll see. (I thought Java preferred some other unicode form.)

> Community is really impressively reactive and numerous and efficient.
> And documentation is very good and exhaustive.

Well, those are both good things.

> a) Librarians can define their own indexes, and there is a plugin that
> fetches data from rejected authorities and from authorised_values (that
> could/should  have been achieved with zebra but only with major work on
> xslt).

I've read that to be so configurable online, Solr must be allowed to
create files in its own installation directory, which seems like a
security problem (or at least, against debian policy).  Is that true?
For example
http://sourceforge.net/mailarchive/forum.php?thread_name=ABC31E122EEAC44897B337BDEA897736BE58B9663A%40VUEX2.vuad.villanova.edu&forum_name=vufind-general

Can we predefine indexes?

But it could be achieved with zebra?

> b) C4/Search.pm count lines of code could be shrinked ten times.
> You can test from poc_solr branch on
> git://git.biblibre.com/koha_biblibre.git
> But you have to install solr.

In other questions, how does its performance compare?

Are there drawbacks to adopting it, counterweighing the benefits of
overcoming the above-stated problems with Zebra?

Hope that helps move the discussion on,
-- 
MJ Ray (slef), member of www.software.coop, a for-more-than-profit co-op.
Past Koha Release Manager (2.0), LMS programmer, statistician, webmaster.
In My Opinion Only: see http://mjr.towers.org.uk/email.html
Available for hire for Koha work http://www.software.coop/products/koha