[Koha-devel] Proposed "metadata" table for Koha

Wed Dec 9 00:27:12 CET 2015

Tomas, do you mean that “rebuild_zebra.pl -daemon" is used by 3.22 onward, or that it’s the preferable way of doing it but not the actual default? I’m a bit confused : ).

For the moment, I’m going to write the OAI-PMH importer to use Zebra for matching, and we’ll see during the testing phase if it’s fast enough. 

In theory, a stale Zebra should only matter if a new record is downloaded via OAI-PMH and not added to Zebra before an updated version of that record is downloaded and actioned. In theory, that scenario should be rare, although in practice I can imagine someone adding a record and then fixing a spelling mistake a second or two later, and having those two rapid-fire changes reach downstream close together. I may also try to build some heuristics into the OAI-PMH importer so that it checks its own OAI-PMH tables before trying to check Zebra. That could help mitigate certain scenarios where the Zebra queue gets backed up or Zebra dies, but there’s already a trail of previous OAI-PMH updates that can be used for matching. It just wouldn’t take into account a scenario where a record already exists in the catalogue via a non-OAI-PMH import and Zebra is down/unavailable. In that case, there would be duplicate records. But without a way of checking the source of truth, I don’t see any other options…

David Cook

Systems Librarian

Prosentient Systems

72/330 Wattle St, Ultimo, NSW 2007

From: Tomas Cohen Arazi [mailto:tomascohen at gmail.com] 
Sent: Wednesday, 9 December 2015 5:41 AM
To: David Cook <dcook at prosentient.com.au>
Cc: Jesse <pianohacker at gmail.com>; koha-devel <koha-devel at lists.koha-community.org>
Subject: Re: [Koha-devel] Proposed "metadata" table for Koha

2015-12-01 1:44 GMT-03:00 David Cook <dcook at prosentient.com.au <mailto:dcook at prosentient.com.au> >:
>
> My main concern about Zebra is with it not being fast enough. Tomas mentioned that Zebra runs updates every 5 seconds, but it looks to me like rebuild_zebra.pl <http://rebuild_zebra.pl>  (via /etc/cron.d/koha-common) is only run every 5 minutes on a Debian package install. At least that was the case in Koha 3.20. That’s a huge gap when it comes to importing records. Say you import record A at 5:01pm… and then you try to import record A again at 5:03 using a matching rule. The update from 5:01pm hasn’t been processed yet, so you wind up with 2 copies of record A in your Koha catalogue.

You are right that the default setup is setting a cronjob to check the queue every 5 minutes. I planned to change that default behaviour to set USE_INDEXER_DAEMON=yes in /etc/default/koha-common and have the cron line commented or conditional to USE_INDEXER_DAEMON=no. But I got distracted the last couple weeks before the release and forgot to post a patch for that. Anyway, rebuild_zebra.pl <http://rebuild_zebra.pl>  -daemon should be run by default.

> We run openSUSE and we define our own Zebra indexer, which does run every 5 seconds. I haven’t stress tested it yet, but 5 seconds might be a bit long even under ideal circumstances, if an import job is running every 2-3 seconds. Sure, that 2-3 seconds might be a bit optimistic… maybe it will also be every 5 seconds. But what happens if the Zebra queue backs up? Someone runs “touch_all_biblios.pl <http://touch_all_biblios.pl> ” or something like that and fills up the zebra queue, while you’re importing records. Zebra is going to be out of date.

True

> There needs to be a source of truth, and that’s the metadata record in MySQL. Zebra is an indexed cache, and while usually it doesn’t matter too much if that cache is a little bit stale, it can matter when you’re importing records.

I agree the importing step should rely on the source of truth.

> Another scenario… what happens if Zebra is down? You’re going to get duplicate records because the matcher won’t work. That said, we could mitigate that by double-checking that Zebra is actually alive programmatically before commencing an import. There could be other heuristics as well… like not running an import (which uses matching rules) unless the zebraqueue only has X number of waiting updates. Ideally, it would be 0, but that’s unlikely in large systems. It also is impossible if you’re importing at a rate of more than one every 5 seconds (which would be absurdly slow).

I wouldn't create a workaround for Zebra being down only for being able to match existing records... I would just print a red box saying the tool is not available.

> I am curious as to whether the default Zebra update time for Debian packages is 5 minutes or 5 seconds. While it doesn’t affect me too much as I don’t use Debian, it matters for regular users of Koha.

Ok, i'll provide the mentioned patch :-P

--
Tomás Cohen Arazi
Theke Solutions (http://theke.io)
✆ +54 9351 3513384
GPG: B76C 6E7C 2D80 551A C765  E225 0A27 2EA1 B2F3 C15F

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.koha-community.org/pipermail/koha-devel/attachments/20151209/1222b9de/attachment.html>