[Koha-bugs] [Bug 12478] Elasticsearch support for Koha

bugzilla-daemon at bugs.koha-community.org bugzilla-daemon at bugs.koha-community.org
Fri Aug 28 02:19:14 CEST 2015


http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=12478

--- Comment #80 from Robin Sheat <robin at catalyst.net.nz> ---
(In reply to Jonathan Druart from comment #79)
> The first problem I got was to find a MARC21 DB (since the UNIMARC mappings
> are not defined, I cannot test with an UNIMARC DB).

The UNIMARC mappings should be defined, though not tested.

> I have used the one created for the sandboxes
> (http://git.koha-community.org/gitweb/?p=contrib/global.git;a=blob;f=sandbox/
> sql/sandbox1.sql.gz;h=19268bccb43b2a33d5644b7d86cbb1abb323016b;hb=HEAD). But
> there are only 436 biblios, it's not enough to test some stuffs (facets for
> instance).
> Or maybe you can share your DB?

I could, but I think we'll get more useful results from different databases.

> Here some notes:
> 
> 1/ Add deps to C4/Installer/PerlDependencies.pm

Yeah, I'm mostly waiting for things to settle (which they have now.)

> 2/ The number of tests provided is very low.

Yes, I've been meaning to go back and add a pile more.

> 3/ catalyst/elastic_search is 1004 commits behind origin/master, please
> rebase

It's just a tedious process, so I keep putting it off :) should do that soon
though.

> 4/ The message "No 'elasticsearch' block is defined in koha-conf.xml" should
> be raised before starting the indexation process, and not on commiting the
> first batch.

Added to my TODO.

> 5/ You really need to tune the default value for the commit :)
> commit 100:  perl misc/search_tools/rebuild_elastic_search.pl -b  77.57s
> user 0.86s system 91% cpu 1:25.62 total
> commit 1000: perl misc/search_tools/rebuild_elastic_search.pl -b  24.68s
> user 0.52s system 79% cpu 31.595 total
> For Solr, we used 5000.
> Yes I know, it's configurable.

I just picked a number and haven't gone back to it. I'm also thinking that
maybe dropping the committing entirely and just feeding straight into Catmandu
and letting it do its own batching, rather than doubling up on it. More
experimentation needed really, but definitely increasing the default is a
sensible thing to do.

FWIW, committing at 5,000:

real    2m14.627s
user    1m13.272s
sys     0m2.228s

100:

real    6m6.280s
user    4m45.268s
sys     0m2.828s

That's a fair difference :)

> 6/ Verbose does not work as expected, it could be fixed with

Oops. TODOed.

> 
> 7/ perl -e "use
> Pod::Checker;podchecker('misc/search_tools/rebuild_elastic_search.pl')";
> *** WARNING: empty section in previous paragraph at line 36 in file
> misc/search_tools/rebuild_elastic_search.pl
> *** ERROR: =over on line 38 without closing =back at line EOF in file
> misc/search_tools/rebuild_elastic_search.pl

TODOed.

> 8/ 2 occurrences of "Solr" reintroduced in installer/data/mysql/sysprefs.sql
> and koha-tmpl/intranet-tmpl/prog/en/modules/admin/preferences/admin.pref

Must have come about when merging. TODOed.

> 9/ Test!
> I have launched some searches, with the same DB (the one from the sandbox).
> On a local using your remote branch and another one using master (sandbox7
> provided by BibLibre).
> 
> a. Search for 'd' (screentshot opac_search_for_d_sort_by_relevance.png ES on
> the left, Zebra on the right).
> Main differences:
> - 183 vs 182 results (?) 

I wouldn't necessarily expect them to be the same, especially for a fairly
meaningless search.

> - the order is not the same (make sense)
> - Locations and Places facets are missing

Yeah, they're not faceted yet. Added that to my TODO list before I forget
again.

> - 6 entries are displayed in the facets for ES (current behavior is 5). 
> 
> b. Search for 'd', sort by title AZ (screenshot
> opac_search_for_d_sort_by_title.png)
> - Zebra displayes only 1 facet

That's probably zebra being wrong then :)

> - The order is still completely different

I'm not sure which is right in this case, though I'm doing some work on the
sorting at the moment that would allow you to pick which of the fields that end
up in title you want to sort by. For example, it might be that ES is putting
the ones with a lower series title near the start, even though it displays a
different title. That'll be tuneable when I'm done with the current stuff.

> c. Search for 'harry', sort by title AZ (screenshot
> opac_search_for_harry_sort_by_title.png)
> - 'Show more' links is displayed even if only 2 entries for a facet are
> available

Thought I'd fixed that, I'll have to have a look again.

> - The order is still different ("The discovery of heaven" should be sorted
> either before Dollhouse (if the is a stopword) either after "Hareios*"

Dollhouse probably has another title field that's actually being used, as noted
above.

> - The availability is wrong for ES (The item for Dollhouse is not for loan)

Why is it not for loan? Is it by policy, because there are no items, or because
all items are issued?

> d. Search for Books (limit by item type in the adv search), sort by pubdate
> (screenshot limit_by_book_sort_by_pubdate.png)
> - "Return to the last advanced search" link is not displayed

I wonder how it knows to show that...

I can't actually find that string in my checkout at all.

> - The item types facet contains several entries, which does not make sense

Curious. Are there situations where you have a biblio-level itemtype that
differs from the item-level item type, or where one biblio might have multiple
items with different item types? At the moment, I think they're all being
thrown into one facet pot.

> - The number of results highly differ (395 vs 364)

Probably due to biblio-vs-item itemtype selection not being supported yet. If
you can find it giving you a record that plain shouldn't match though, that'd
be interesting.

> - The order is still completely different. I had a look in the index and
> found:
> "Pictura murală*" has "pubdate":"||||" (/_search?q=_id:39&pretty)
> The Korean Go Association's learn to play go  "pubdate":"uuuu"
> (/_search?q=_id:155&pretty)
> Where do come from these values? Shouldn't be a date, or at least an integer?

Could be the mapping is funny/broken for that. My test system has things like:

"pubdate":"1998"

though, which implies that it's correct. The actual mapping comes from:

INSERT INTO `elasticsearch_mapping` (`indexname`, `mapping`, `facet`,
`suggestible`, `type`, `marc21`, `unimarc`, `normarc`) VALUES
('biblios','pubdate',FALSE,FALSE,'','008_/7-10','100a_/9-12','008_/7-10');

On the other hand, it does have:

"date-entered-on-file":"61006"

which doesn't look right no matter how you carve it.

> It's not easy to know what is indexed where. Did you have a look at the
> indexes configuration page the Solr stuff had?
> It provided an interface to configure the different mappings, it was very
> useful.

I haven't yet got to the point where I have the time to make an interface. At
the moment it's all configured in elasticsearch_mapping.sql, which is somewhat
human readable/editable. After loading the data into a table, it rewrites all
those tables into a form that'll be more conducive for having a GUI on top of,
but is less human readable.

BTW, if you add

<trace_to>Stderr</trace_to>

to the <elasticsearch> block, it'll dump all the chatter with ES out to stderr,
which is useful for seeing what exactly is going on. I warn you, there is a lot
there though.

Thanks for testing, even if I have a pile more things to fix now :)

-- 
You are receiving this mail because:
You are watching all bug changes.


More information about the Koha-bugs mailing list