[Koha-devel] Search Engine Changes : let's get some solr

Mon Oct 4 22:16:43 CEST 2010

Adding to the aforementioned limitations and desired improvements, another
issue with Zebra is that it index builds scale very poorly. Indexes must be
built serially, and it's a really problematic limitation for large catalogs.
A parallelized index creation process would be a huge benefit. Does SOLR
have that capability?

Also, what do you have in mind for continuing Z39.50 support? This is a
must-have feature for many libraries. I've been investigating the
possibility of using MongoDB or a similar dynamic indexer to replace Zebra,
but the need to write a Z39.50 front end adds a great deal more work to the
project.

Clay

On Mon, Oct 4, 2010 at 1:10 AM, LAURENT Henri-Damien <
henridamien.laurent at biblibre.com> wrote:

> Hi
> As you already read in Paul previous message about
> "BibLibre strategy for 3.4 and next version", we are growing, want be
> involved in the community as previously. Paul promised some POCs, here
> is one available. We also worked on Plack and support. We created a base
> of script to search for Memoryleaks. We'll demonstrate that later.
>
>
> zebra is fast and embeds native z3950 server. But it has also some major
> drawbacks we have to cope with on our everyday life making it quite
> difficult to maintain.
>
>   1.  zebra config files are a nightmare. You can't drive the
> configuration file easily. namely : Can't edit indexs via HTTP or
> configuration. all is in files hardcoded on disk. => you can't list
> indexes you can't change indexes, you can't edit indexes, you can't say
> I want this index at OPAC, that in intranet. (Could be done with
> scraping ccl.properties, and then record.abs and bib1.att.... But what a
> HELL) So you cannot customize configuration defining the indexes you
> want easily. And ppl donot get a translation of the indexes since all
> the indexes are hardcoded in the ccl.properties and we donot have a
> translation process so that ccl attributes could be translated into
> different languages.
>
>   2.  no real-time indexing : the use of a crontab is poor: when you
> add an authority while creating a biblio, you have to wait some some
> minutes to end your biblio (might be solved since zebra has some way to
> index biblios via z3950 extended services, but hard and should be tested
> and at the time community first tested that, a performance problem was
> raised on indexing.)
>
>   3. no way to access/process/delete data easily. If you have indexes
> in it or have some problems with your data, you have to reindex the
> whole stuff and indexing errors are quite difficult to detect.
>
>   4. during index process of a file, if you have a problem in your
> data, zebraidx just fails silently... And this is NOT secure. And you have
> no way to know WHICH biblio made the process crash. We had a LOT of
> trouble with Aix-Marseille universities that have some
> arabic translitterated biblios that makes zebra/icu completly crash ! We
> had to do some recursive script to find 14 biblios on 730 000 that makes
> zebra crash (even is properly stored & displayed)
>
>   5. facets are not working properly : they are on the result displayed
> because there are problems with diacritics & facets that can't be solved
> as of today. And noone can provide a solution (we spoke about that with
> indexdata and no clear solution was really provided.
>
>   6. zebra does not evolve anymore. There is no real community around
> it, it's just an opensource indexdata software. We sent many questions
> onlist and never got answers. We could pay for better support but the
> fee required is quite deterrent and benefit is still questionable.
>
>   7. icu & zebra are colleagues, not really friends : right truncation
> not working, fuzzy search not working and facets.
>
>   8. we use a deprecated way to define indexes for biblios (grs1) and
> the tool developped by indexdata to change to DOM has many flaws. we
> could manage and do with it. But is it worth the strive ?
>
> I think that every one agrees that we have to refactor C4::Search.
> Indeed, query parser is not able to manage independantly all the
> configuration options. And usage of usmarc as internal for biblio comes
> with a serious limitation of 9999 bytes, which for big biblios with many
> items, is not enough.
>
> BibLibre investigated in a catalogue based on solr.
> A  University in France contracted us for that development.
> This University is in relation with all the community here in France and
> solr will certainly be adopted by all the libraries France wide.
> We are planning to release the code on our git early spring next year
> and rebase on whatever Koha version will be released at that time 3.4 or
> 3.6.
>
>
> Why ?
>
> Solr indexes with data with HTTP.
> It can provide fuzzy search, search on synonyms, suggestions
> It can provide facet search, stemming.
> utf8 support is embedded.
> Community is really impressively reactive and numerous and efficient.
> And documentation is very good and exhaustive.
>
> You can see the results on solr.biblibre.com and
> catalogue.solr.biblibre.com
>
> http://catalogue.solr.biblibre.com/cgi-bin/koha/opac-search.pl?q=jean
> http://solr.biblibre.com/cgi-bin/koha/admin/admin-home.pl
> you can log there with demo/demo lgoin/password
>
> http://solr.biblibre.com/cgi-bin/koha/solr/indexes.pl
> is the page where ppl can manage their indexes and links.
>
> a) Librarians can define their own indexes, and there is a plugin that
> fetches data from rejected authorities and from authorised_values (that
> could/should  have been achieved with zebra but only with major work on
> xslt).
>
> b) C4/Search.pm count lines of code could be shrinked ten times.
> You can test from poc_solr branch on
> git://git.biblibre.com/koha_biblibre.git
> But you have to install solr.
>
> Any feedback/idea welcome.
> --
> Henri-Damien LAURENT
> BibLibre
> _______________________________________________
> Koha-devel mailing list
> Koha-devel at lists.koha-community.org
> http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: </pipermail/koha-devel/attachments/20101004/f716634c/attachment.htm>