[Koha-devel] Proposed "metadata" table for Koha

Mon Nov 30 23:36:22 CET 2015

I’m not 100% sure what I think yet, but in the past I was certainly in favour of a metadata_record table that stored the serialized metadata and whatever else it needed to support that. I still think it’s an all right idea to have that table.

In general, I’m in favour of using the full text engine for searching, although as Katrin has noted on  <http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=10662> http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=10662, Zebra isn’t necessarily updated fast enough to be used for “matching” when importing records, especially when records are being downloaded and imported every 2-3 seconds. Also, what happens if Zebra goes down? Suddenly your catalogue gets flooded with duplicates. I suppose one way of getting around that is querying Zebra to make sure it is alive before starting an import. However, that doesn’t solve the speed problem. I don’t think there’s any reliable way of knowing if the record you want to match on has already been indexed (or indexed again) in Koha. Don’t we only update Zebra once every 60 seconds? 

The OAI-PMH import wouldn’t be the only one affected by the indexing. The OCLC Connexion daemon and any Staged Import both use Zebra for matching. If Zebra hasn’t indexed relevant additions or updates, the matching won’t work when it should work. For records in the hundreds, thousands, and millions, that can cause major problems both with duplicates and failed updates. 

Maybe a “metadata” table is overkill. In fact, I can’t necessarily see a lot of advantages to storing mass quantities of metadata in the relational database off the top of my head , but perhaps some way of keeping record identifiers in the relational database would be doable. 

If we think about the metadata in terms of a “source of truth”, the relational database is always going to contain the source of truth. Zebra is basically just an indexed cache, and when it comes to importing records… I rather be querying a source of truth than a cache as the cache might be stale. At the moment, it’s going to be stale by at least 1-59 seconds… longer if Zebra has a lot of indexing jobs to do when it receives an update.

Maybe there’s a way to mitigate that. Like… waiting to do an import until Zebra has reported that it’s emptied the zebraqueue X seconds ago, although zebraqueue may never be empty. There’s always that possibility that you’re going to miss something, and that possibility doesn’t exist in the relational database, as it’s the source of truth. If the identifier doesn’t exist in the database, then it doesn’t exist for that record (or there’s a software bug which can be fixed). 

While we probably could use the search engine more throughout Koha, I think it might not be wise to use it during an import. 

(As for the QueryParser, I totally agree about it being the standard, and creating a driver for ES. I chatted with Robin about this a bit over the past few months, but I haven’t had time to help out with that. The QueryParser also isn’t quite right for Zebra either yet, so it would probably make sense to focus on finalizing the PQF driver first.)

David Cook

Systems Librarian

Prosentient Systems

72/330 Wattle St, Ultimo, NSW 2007

From: Tomas Cohen Arazi [mailto:tomascohen at gmail.com] 
Sent: Tuesday, 1 December 2015 1:20 AM
To: David Cook <dcook at prosentient.com.au>
Cc: koha-devel <koha-devel at lists.koha-community.org>
Subject: Re: [Koha-devel] Proposed "metadata" table for Koha

I think we should have a metadata_record table storing the serialized metadata, and more needed information (basically the fields Koha::MetadataRecord has...) and let the fulltext engine do the job for accessing those values.

The codebase is already too bloated trying to band-aid our "minimal" usage of the search engines' features. Of course, while trying to fix that we might find our search engine has problems and/or broken functionalities (zebra facets are so slow that are not cool). But we should definitely get rid of tons of code in favour of using the search engine more, and probably have QueryParser be the standard, having a driver for ES...

-- 

Tomás Cohen Arazi

Theke Solutions (http://theke.io <http://theke.io/> )
✆ +54 9351 3513384
GPG: B76C 6E7C 2D80 551A C765  E225 0A27 2EA1 B2F3 C15F

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.koha-community.org/pipermail/koha-devel/attachments/20151201/da567285/attachment-0001.html>