[Koha-bugs] [Bug 10662] Build OAI-PMH Harvesting Client

Mon Apr 20 02:21:19 CEST 2020

https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=10662

--- Comment #290 from David Cook <dcook at prosentient.com.au> ---
I've looked more at Bug 22417, and it's got me thinking. 

The OAI-PMH harvester has a few core needs:

1. Instant communication between Web UI and OAI-PMH harvester to coordinate
harvesting tasks
2. Ability to schedule tasks
3. Ability to execute and repeat download tasks in parallel in the background
4. Ability to save downloaded records
5. Ability to import records into Koha

The first is achieved by exchanging JSON messages over a Unix socket. (I'd
actually like to change this to a TCP socket and use JSON messages over a HTTP
API. Using HTTP would make it easier to communicate over a network, simplify
the client code using standard communications mechanisms, and has
authentication methods which could help to secure the OAI-PMH harvester. I
could also provide a Docker image that contains the OAI-PMH harvester in a
separate Docker container from the Koha application server.) This works fairly
well and is easy enough to achieve.

The second is provided by a bespoke scheduler using POE timers built into the
OAI-PMH harvester. Given the granularity of the scheduling, I don't see this
changing any time soon. In theory, a generic Koha task scheduler could replace
this functionality, but that seems unlikely any time soon. This is arguably one
of the most complex parts of the OAI-PMH harvester. 

Currently, the OAI-PMH uses an in-memory queue for download tasks, and a
database queue for import tasks. I thought a lot about how RabbitMQ might be
used to replace these queues. It could be useful to replace the in-memory
download queue, and the download workers could be split out of the existing
OAI-PMH harvester. As for the import tasks, the download workers need to save
the records and enqueue an import task ASAP. At the moment, they save the
records to disk, and add an import task to the database with a pointer to the
records on disk. It works well enough, but it assumes that you have the disk
space, and that the import worker has access to the local disk. I've been
thinking it would be better to either 1) save the records to the database and
enqueue a RabbitMQ message with a pointer to the database, or 2) send the
records in a RabbitMQ message. I think the 1st option is probably better,
because there is increased visibility. You can't see all the messages in a
RabbitMQ queue. That all said, saving to disk is going to be faster than
sending the data over a network. (However, in the past I think that I've sent
individual records over the network. In this case, I'd be sending whole batches
of records at once.) But for a download worker to send records, it would need
credentials for either the database or RabbitMQ... so I'm thinking that perhaps
it would be better to use an import API with an API key, although that would
involve receiving all the data over the network and then sending it to the
database. Also slow. But haven't tested the actual speeds. The import API would
save the data to the database, and then enqueue an import task. 

Of course, this would just be a re-working of what's already here. The benefits
of re-working the queues and workers are arguable at this point, although there
is certainly benefits of changing from a bespoke client/server communication
protocol to HTTP over TCP.

-- 
You are receiving this mail because:
You are watching all bug changes.