[Koha-bugs] [Bug 25265] Elasticsearch - Batch editing items on a biblio can lead to incorrect index

bugzilla-daemon at bugs.koha-community.org bugzilla-daemon at bugs.koha-community.org
Fri Sep 11 14:17:32 CEST 2020


https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=25265

--- Comment #22 from Nick Clemens <nick at bywatersolutions.com> ---
(In reply to Marcel de Rooy from comment #19)
> General
> Module changes, no tests ? 

I will work on something here

> We are now fetching MARC records in ModZebra for Elastic. 

We were before these patches

> that if we use Elastic we should be leaving ModZebra as soon as we can ;)

Agreed, but that is a bigger job, this is addressing a bug

> The operation of sending biblionumbers and biblio records in two lists makes
> me a bit suspicious. There is no check anymore that these lists are in sync.
> If there would be a gap somehow, we would be indexing records on the wrong
> id? Might be just theoretical but looks suboptimal.

As you note later, we only pass a record from ModAuthority, otherwise we are
building the lists

> It seems that ModZebra is still inserting records into zebraqueue when we
> use Elastic. Why? Should we skip and cleanup ?

Z3950 was the original main reason, we have the z3950responder now, but keeping
zebra in sync
is not a bad thing. Eventually yes, but for now the dafety net has been useful.
Beyond the scope of this bug

> The for loop in ModZebra seems to be suboptimal? In a loop you are checking
> if a record count is zero and if so, insert one record. 
> Why dont you do one DELETE statement and one INSERT ? delete from
> zebra_queue where biblio_auth_number IN ($str) etc.
> Actually thinking about that: if you do one db call per record, why dont you
> add the loop in batch mod without touching ModZebra ?
> [Wrote this earlier: see my conclusion too.]

That is essentially what we have now, a loop in the batch operations, calling
ModZebra for each record.
This could be improved, but I don't think this adds complexity so much as
highlighting what we do

> 
> Koha/SearchEngine/Elasticsearch/Indexer.pm dates from 2013 but still
> includes:
> $indexer->update_index_background does just refer to update_index: TODO
> implement in the future - I don't know the best way of doing this yet.
> Same $indexer->delete_index_background # TODO: Should be made async

This is work we should do, we don't have sponsors or the infrastructure yet,
hopefully the task queue can help here

> 
> C4/AuthoritiesMarc.pm:    ModZebra( $authid, 'specialUpdate',
> 'authorityserver', $record );
> Is this the only use of the record parameter in ModZebra ? Apparently. For
> Zebra we don't need it, for Elastic we now do.

This code is weird, I agree, but beyond the scope of this bug


> This part in ModZebra is really troublesome too. We do not even check if we
> got a biblio or an authority record !
>             unless ($record) {
>                 $record = GetMarcBiblio({
>                     biblionumber => $biblionumber,
>                     embed_items  => 1 });
>             }

This code is beyond the scope here, we are in C4/Biblio.pm - we really should
only
deal with biblio, and we do, except for the call above


> Conclusion:
> Looks like we can still improve a lot of the Elastic code. 

Yes we definitely can

> Adding the
> current code in the old ModZebra is not the way to go imo. 

Rewriting and moving this will be a bigger work and this is a bug that affects
stables

> Move that code to
> SearchEngine by just adding a call. 

> Since we only have one use of the record
> parameter in ModZebra, I would be inclined to remove it for consistency. I
> understand that it is useful as optimization though.

Please file a new bug for this

> Adding a loop somewhere else around ModZebra is probably the pragmatic way
> too. We never call it for lists. Perhaps adding an iteration routine in a
> module ?

I don't see why looping in one place or the other matters. In fact, look at
this comment in the code
    # true ModZebra commented until indexdata fixes zebraDB crashes (it seems
they occur on multiple updates
    # at the same time


> Changing status.

Your comments are well appreciated Marcel, and they are true. We have technical
debt here, and we need to rethink some of our methods.
You have taken a very broad look at the code, and identified issues that affect
many areas.
Currently though, we have a bug that is affecting users of ES, and we need to
support these users so we can continue to test and
improve the ES code. I attempted to take a path here that makes smaller
changes, the basic idea of calling for index updates by batch rather than
individually I 
think is sound.

I will work on tests and take a look at the Zebra loop and would ask you to
reconsider some of the larger points as future enhancements

-- 
You are receiving this mail because:
You are watching all bug changes.


More information about the Koha-bugs mailing list