[Koha-bugs] [Bug 19893] New: Alternative optimized indexing for Elasticsearch

Fri Dec 29 14:27:10 CET 2017

https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=19893

            Bug ID: 19893
           Summary: Alternative optimized indexing for Elasticsearch
 Change sponsored?: ---
           Product: Koha
           Version: master
          Hardware: All
                OS: All
            Status: NEW
          Severity: enhancement
          Priority: P5 - low
         Component: Searching - Elasticsearch
          Assignee: koha-bugs at lists.koha-community.org
          Reporter: glasklas at gmail.com

At our library perhaps owning to a larger than average number of biblios a full
re-index takes an unacceptable amount of time complete (> 24h). We also had an
issue with indexing becoming increasingly slower when new mappings are added.
After some profiling using NYTProf it became clear most of this overhead is in
Catmandu::Store::ElasticSearch and Catmandu::MARC. After giving it some thought
the simplest way to resolve this issue actually seemed to be to replace these
libraries with Koha-specific code, since the functionality provided is actually
not that hard to re-implement in a more efficient manner. Due to the complexity
of Catmandu optimizing these libraries would most likely be more challenging
(and some parts are not actually possible to optimize because of limitations
owing to the architecture of Catmandu/Fix). Main benefits include:

1) Increased indexing performance (about twice as fast, six times as fast if
comparing time spent in update_index()), due to more efficient json-conversion
and fewer Elasticsearch requests.
2) With Catmandu indexing speed decreases as more mappings are added, with the
alternative algorithm indexing is kept more or less constant no matter how many
mappings you add.
3) Neglectable indexing start-up time. For example we have an issue with the
book drop machine, each return taking a couple of seconds because of the
catmandu start-up overhead.
4) More transparent code and less complexity compared with Catmandu.

With this patch the largest bottleneck is instead Marc::Record::as_xml_record,
to use marc21 as serialization format would probably be a lot faster but still
chose marc-xml because of the binary format length limitation (which could be
exceeded with many items). Still, I will probably try to look into faster
marc-xml serialization options in the future to address this.

I also attach profiling results with and without the patch applied.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are watching all bug changes.