[Koha-devel] Proposed "metadata" table for Koha

Tue Dec 1 05:44:57 CET 2015

I can’t remember if I sent another message just to Tomas or if it was to the list, but I can see how a “metadata” table might be heavy handed, so I suggested an “identifier” table which would extract identifiers from the record for easy recall. Of course, as I said in the previous email, Zebra is going to do a better job in some cases because it can normalize values for indexing and retrieval while MySQL would not. 

My main concern about Zebra is with it not being fast enough. Tomas mentioned that Zebra runs updates every 5 seconds, but it looks to me like rebuild_zebra.pl (via /etc/cron.d/koha-common) is only run every 5 minutes on a Debian package install. At least that was the case in Koha 3.20. That’s a huge gap when it comes to importing records. Say you import record A at 5:01pm… and then you try to import record A again at 5:03 using a matching rule. The update from 5:01pm hasn’t been processed yet, so you wind up with 2 copies of record A in your Koha catalogue. 

We run openSUSE and we define our own Zebra indexer, which does run every 5 seconds. I haven’t stress tested it yet, but 5 seconds might be a bit long even under ideal circumstances, if an import job is running every 2-3 seconds. Sure, that 2-3 seconds might be a bit optimistic… maybe it will also be every 5 seconds. But what happens if the Zebra queue backs up? Someone runs “touch_all_biblios.pl” or something like that and fills up the zebra queue, while you’re importing records. Zebra is going to be out of date. 

There needs to be a source of truth, and that’s the metadata record in MySQL. Zebra is an indexed cache, and while usually it doesn’t matter too much if that cache is a little bit stale, it can matter when you’re importing records.

Another scenario… what happens if Zebra is down? You’re going to get duplicate records because the matcher won’t work. That said, we could mitigate that by double-checking that Zebra is actually alive programmatically before commencing an import. There could be other heuristics as well… like not running an import (which uses matching rules) unless the zebraqueue only has X number of waiting updates. Ideally, it would be 0, but that’s unlikely in large systems. It also is impossible if you’re importing at a rate of more than one every 5 seconds (which would be absurdly slow).

So MySQL is the source of truth… but we can’t very well do an ExtractValue query on biblioitems.marcxml when we have a database with over a million rows (the threshold for a speedy query is probably actually much lower than that).

(On a related note, I think we’re also going to run into massive problems with Authority records, as I don’t think the 001 of incoming records is saved at all. Maybe the MARC templates can handle that scenario though. I admit I haven’t looked at them much. We probably should be moving the 001 into the 035 for both bibliographic and authority records during an import…)

--

I suppose it’s possible that we might have to suck up that Zebra might not be fast enough. I suppose the onus is on me (and whoever else is interested in not using Zebra for matching like Katrina and Andreas) to prove that Zebra is too slow under stress. 

I am curious as to whether the default Zebra update time for Debian packages is 5 minutes or 5 seconds. While it doesn’t affect me too much as I don’t use Debian, it matters for regular users of Koha. 

David Cook

Systems Librarian

Prosentient Systems

72/330 Wattle St, Ultimo, NSW 2007

From: Jesse [mailto:pianohacker at gmail.com] 
Sent: Tuesday, 1 December 2015 12:03 PM
To: David Cook <dcook at prosentient.com.au>
Cc: koha-devel <koha-devel at lists.koha-community.org>
Subject: Re: [Koha-devel] Proposed "metadata" table for Koha

If we do this, I very much vote for doing it the way Tomas is describing (aka, storing entire chunks of metadata as blobs). Koha 2.2 had a row-per-subfield structure kind of like what you're suggesting, and it required a lot of monkeying around to accurately represent all the vagaries and ordering of MARC subfields. It was also (from what I remember) quite a disk hog. 

2015-11-30 15:42 GMT-07:00 David Cook <dcook at prosentient.com.au <mailto:dcook at prosentient.com.au> >:

Just to contradict myself a bit, it might be worth mentioning that Zebra will do a better job with ISSN and ISBN matching, as I think it normalizes those strings. That would be nicer than a regular string matching SQL query…

David Cook

Systems Librarian

Prosentient Systems

72/330 Wattle St, Ultimo, NSW 2007

From: Tomas Cohen Arazi [mailto:tomascohen at gmail.com <mailto:tomascohen at gmail.com> ] 
Sent: Tuesday, 1 December 2015 1:20 AM
To: David Cook <dcook at prosentient.com.au <mailto:dcook at prosentient.com.au> >
Cc: koha-devel <koha-devel at lists.koha-community.org <mailto:koha-devel at lists.koha-community.org> >
Subject: Re: [Koha-devel] Proposed "metadata" table for Koha

2015-11-29 21:52 GMT-03:00 David Cook <dcook at prosentient.com.au <mailto:dcook at prosentient.com.au> >:

Hi all:

For those not following along at http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=10662, we’ve recently started talking about the possibility of adding a “metadata” table to Koha.

The basic schema I have in mind would be something like: metadata.id <http://metadata.id> , metadata.record_id, metadata.scheme, metadata.qualifier, metadata.value.

The row would look like: 1, 1, marc21, 001, 123456789

It might also be necessary to store “metadata.record_type” so as to know where metadata.record_id points. This obviously has a lot of disadvantages… redundant data between “metadata” rows, no database cascades via foreign keys, etc. However, it might be necessary in the short-term as a temporary measure.

Of course, adding “yet another place” to store metadata might not seem like a great idea. We already store metadata in biblioitems.marcxml (and biblioitems.marc), Zebra, and other biblio/biblioitems/items relational database fields. Do we really need a new place to worry about data?

I think we should have a metadata_record table storing the serialized metadata, and more needed information (basically the fields Koha::MetadataRecord has...) and let the fulltext engine do the job for accessing those values.

The codebase is already too bloated trying to band-aid our "minimal" usage of the search engines' features. Of course, while trying to fix that we might find our search engine has problems and/or broken functionalities (zebra facets are so slow that are not cool). But we should definitely get rid of tons of code in favour of using the search engine more, and probably have QueryParser be the standard, having a driver for ES...

-- 

Tomás Cohen Arazi

Theke Solutions (http://theke.io <http://theke.io/> )
✆ +54 9351 3513384
GPG: B76C 6E7C 2D80 551A C765  E225 0A27 2EA1 B2F3 C15F

_______________________________________________
Koha-devel mailing list
Koha-devel at lists.koha-community.org <mailto:Koha-devel at lists.koha-community.org> 
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-devel
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/

-- 

Jesse Weaver

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.koha-community.org/pipermail/koha-devel/attachments/20151201/a6ba7244/attachment-0001.html>