[Koha-bugs] [Bug 10662] Build OAI-PMH Harvesting Client

Mon Apr 20 03:09:36 CEST 2020

https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=10662

--- Comment #291 from David Cook <dcook at prosentient.com.au> ---
On one hand, I feel like we're so close with the current patches. It just needs
more unit tests. 

On the other hand, the unit tests for the task scheduling and concurrent
processing are actually quite challenging. Also, these patches are a huge chunk
of functionality, which increase Koha's overall size. 

I am tempted to take this code and split it into two parts: 
1. Koha plugin for Web functionality
2. Standalone OAI-PMH harvester

My thinking is the Koha plugin will allow you to connect to a separately
packaged OAI-PMH Harvester API in order to add/start/stop/update/remove
harvesting tasks. Easy!

The standalone OAI-PMH harvester will then take care of the scheduling of tasks
and high-performance downloading of records. The OAI-PMH harvester will then
have actions to take on downloaded record batches. Not too hard!

This is where things get interesting. Ideally, there would be a Koha API backed
by a queue and Koha worker(s) to handle the processing of records. But that
doesn't currently exist. I can use the Koha plugin to inject an API route, but
there is no existing queue mechanism. Uh oh!

The API could be used to store the records in a database table, but then I
would need a Koha-based worker to access that database table and apply all the
Koha-specific rules to the data. 

At this point it would be nice to have RabbitMQ for the queue, and then the
Koha plugin could provide a Koha worker I suppose, which a sysadmin could
manually start. I suppose we don't have to have RabbitMQ. The Koha worker could
just tap into the database directly (until RabbitMQ is available). 

So in the end I suppose really 3 parts:
1. Koha plugin (Web functionality)
2. Koha plugin (Import Worker functionality)
3. Standalone OAI-PMH harvester

Alternatively, the import API could handle all the Koha-related processing. I
could do some tests to see how fast the web API could process the data. The
OAI-PMH download will probably always be faster than the upload, so the OAI-PMH
harvester would need to have its own internal queue for the downloaded records,
but that could keep the Koha side of things slimmer. Plus... if Koha did
implement a message queue like RabbitMQ, that change could be done
transparently in the Koha background without affecting the OAI-PMH harvester. 

Ok so...

1. Koha plugin (Web UI to interact with OAI-PMH harvester)
2. Koha plugin (Import API to receive harvested OAI-PMH records)
3. Standalone OAI-PMH harvester

I think that this makes sense. People could use the plugin, and then if they're
liking it, then we could try again to get it into Koha master.

I am actually interested in rewriting the OAI-PMH harvester in Golang to take
advantage of its concurrent programming strengths. By using the Koha plugin to
provide/consume APIs, we're able to use the best tools for the job for the
actual OAI-PMH work. Note too that the OAI-PMH harvesting itself isn't actually
Koha-specific. The only Koha-specific aspects are the scheduling and the record
import. There's no real need to have the OAI-PMH code in the Koha codebase.

-- 
You are receiving this mail because:
You are watching all bug changes.