[Koha-bugs] [Bug 10662] Build OAI-PMH Harvesting Client

bugzilla-daemon at bugs.koha-community.org bugzilla-daemon at bugs.koha-community.org
Fri Jun 17 03:04:16 CEST 2022


https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=10662

--- Comment #314 from David Cook <dcook at prosentient.com.au> ---
Fortunately, since we have RabbitMQ now, some of it should be a lot easier! (I
was just looking at bug 27421 which asynchronously stages, imports, and reverts
MARC imports. It could be helpful for this work too.)

*The hard part for the OAI-PMH harvester in Koha is the task scheduling.*

Koha doesn't have any way to let users define their own task schedules. (Back
in 2018, Frido also mentioned Koha support companies might not want to let
librarians set task schedules for OAI-PMH anyway for performance/API rate
limiting reasons.)

That said, if librarians controlling the scheduling doesn't matter for you, you
could just create a cronjob that sends OAI-PMH tasks to RabbitMQ (or a plugin
that uses the nightly plugin cronjob). 

Then all that's left is to create a Koha::BackgroundJob::OAIPMHHarvest class.

-- 

The background job class could probably encapsulate the entire task. (For bug
10662, the requirement was to download records every 3 seconds, so I had to
split the harvest/download and import tasks into two separate asynchronous
tasks to achieve fast enough download speeds.)

For bug 10662, I also had a requirement to handle very long XML streams over
HTTP rather than the usual short XML responses, which technically is allowed
according to the OAI-PMH specification, and that meant a custom downloader. It
was high performance but it meant I had to add even more code. 

In theory, you might be able to have Koha::BackgroundJob::OAIPMH::Download,
Koha::BackgroundJob::OAIPMH::Stage and Koha::BackgroundJob::OAIPMH::Import
classes. The scheduler (e.g. cronjob) could enqueue a
Koha::BackgroundJob::OAIPMH::Download task which downloads the records, that
could then enqueue a Koha::BackgroundJob::OAIPMH::Stage task to stage (ie run
the matcher/duplicate finder and ideally do some OAI-PMH specific checks), and
that could enqueue the final Koha::BackgroundJob::OAIPMH::Import task to run
the actual import. 

(The advantage of breaking it into 3 different tasks is that Koha by default
only has 1 background job worker, so very long tasks could prevent other tasks
from running in a timely way.)

(However, if we had more than 1 background job worker, I'd be a little
concerned about race conditions where Worker B tries to import Record 1-A after
Worker A has imported Record 1-B where Record 1-A is older than Record 1-B.
There needs to be a sanity check to make sure that records only overwrite older
records.)

Depending on how bug 27421 works, Koha::BackgroundJob::OAIPMH::Stage and
Koha::BackgroundJob::OAIPMH::Import could potentially be subclasses of
Koha::BackgroundJob::StageMARCForImport and
Koha::BackgroundJob::StageMARCForImport. Although I don't really like Koha's
built-in MARC import classes for OAI-PMH, because once records are staged
they're imported without any sanity checks. Also record matching rules are
user-controllable and solely MARC based so they're unreliable and not great for
matching incoming OAI-PMH records to past harvested records.

-- 
You are receiving this mail because:
You are watching all bug changes.


More information about the Koha-bugs mailing list