[Koha-bugs] [Bug 12478] Elasticsearch support for Koha

bugzilla-daemon at bugs.koha-community.org bugzilla-daemon at bugs.koha-community.org
Tue Sep 1 13:43:01 CEST 2015


http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=12478

--- Comment #100 from Jonathan Druart <jonathan.druart at bugs.koha-community.org> ---
(In reply to Robin Sheat from comment #91)
> (In reply to Jonathan Druart from comment #81)
> > Note the following:
> > MariaDB [koha_es_unimarc]>  insert into search_field (name, type) select
> > distinct mapping, type from elasticsearch_mapping;
> > Query OK, 73 rows affected, 57 warnings (0.05 sec)
> > Records: 73  Duplicates: 0  Warnings: 57
> > 
> > MariaDB [koha_es_unimarc]> show warnings;
> > +---------+------+--------------------------------------------+
> > | Level   | Code | Message                                    |
> > +---------+------+--------------------------------------------+
> > | Warning | 1265 | Data truncated for column 'type' at row 1  |
> 
> Hmm, I remember that, but I'm not 100% sure it mattered. Could be wrong
> though.

It's caused by the fact that you insert a an empty string into an enum field.
I am not sure about the consequences.

> Here it is:
> 
> http://elasticsearch.koha.catalystdemo.net.nz/files/koha_es_marc21.sql.bz2
> 
> it's not the best data, but it's good enough for messing about with.

Great, thanks. Another set of data :)

> > It comes from the 008
> > > "Pictura murală*" has "pubdate":"||||" (/_search?q=_id:39&pretty)
> > 008 090409|||||||||xx |||||||||||||| ||und||
> > > The Korean Go Association's learn to play go  "pubdate":"uuuu"
> > 008 971030muuuu9999nyua          000 0 eng 
> > 
> > But the index should not contain an invalid date.
> 
> Hmm. I don't know if we can put validation into the fixer rules. I'll have
> to explore that some further. Possibly also telling ES that this must be a
> number could cause bad data to get rejected, but it may reject the whole
> record, not sure.
> 
> Do you happen to know how zebra handles that?

Absolutely no idea.

> > For Solr (you can find the code on the BibLibre repo at
> > https://git.biblibre.com/biblibre/koha_biblibre/commits/dev/solr Browse
> > C4/Search/), we used a system of plugins. And there is a Date plugin
> > (https://git.biblibre.com/biblibre/koha_biblibre/blob/
> > bd38ce1811289fcfbd75a37ec99fc4cd3c5d37f4/C4/Search/Plugins/Date.pm) which
> > does this job.
> > A plugin can be linked to a mapping.
> 
> We probably can't directly reuse that, at present we're using Catmandu do do
> the data conversion and interfacing with ES for the most part. But it's
> possible I can hook something in somewhere.

We will have to do some data pre-processing before indexing the records.
I need to learn more about ES, but with Solr we had to process the date values
for date type mappings.
Otherwise it is not possible to correctly request on this index (for instance
dates range, or sort by, etc.).
By the way, the date type is only used on acqdate and copydate, why not on
other dates (at least pubdate)?

> (In reply to Jonathan Druart from comment #90)
> > Something else, there is a sort issue in the facets:
> > 
> > [Some entries]
> >  Zeitoun, Ariel,
> >  Ó Cadhain, Máirtín.
> >  Ślez, Ts..
> > 
> > Ó should be after O, not after Z.
> 
> Line 573 of opac/opac-search.pl does a sort with cmp, which isn't very
> unicode aware. I'm putting that in the not-my-problem bin as it's in
> upstream :)

Yes, and, IMO, there is a design issue here.
We should not reuse the pl and tt files.
How do you plan to add features that Zebra cannot provide? :)
Not sure it will be maintainable to add conditions (if SE == 'ES') in the TT.
For instance, for the facets, we would like to display them as ES retrieve them
(order by most used), and add the number of occurrences.

-- 
You are receiving this mail because:
You are watching all bug changes.


More information about the Koha-bugs mailing list