[Koha-bugs] [Bug 10662] Build OAI-PMH Harvesting Client

Fri Nov 13 15:52:51 CET 2015

http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=10662

--- Comment #25 from Andreas Hedström Mace <andreas.hedstrom.mace at sub.su.se> ---
The National Library of Sweden have together with Stockholm University Library
provided the funding required for David to finalize his work on the harvester.
Stockholm University Library has been testing the OAI-PMH harvester extensively
of late, and have provided feedback and been in discussion with Dave about the
development of the harvester. Here I’ll try to summarize our discussions. David
will probably have to fill in the gaps where needed, and provide further
detail!

Our use case
We are harvesting records from the Swedish union catalogue LIBRIS, which
provides records in Marcxml. Today only bibliographic records are harvested,
but we hope to add functionality in the future to also allow holdings to be
harvested (but this is a separate development and won’t be discussed further
here.)

We would want to harvest repeatedly and often, preferably every 5 seconds or
so, to always have up-to-date records in our local system. Cataloging is done
in LIBRIS.

Core functionality
* The harvester works as intended, where we have tried harvesting record,
editing/deleting them at the source and then reharvesting them. All works as
intended.
* We also tried to delete a record in Koha and then do a harvest – the intended
error message is displayed (“Harvested records in error state”).
* It’s very good that the HTTP and OAI-PMH parameters for the OAI server target
can be tested directly! (I was trying to set up LIBRIS SRU server in Koha the
other day and was frustrated that I had to go to cataloging to test whether or
not I had set-up correct parameters…)

All in all, the harvester works as intended!

Major issues

Repeated harvests
The harvester as built today is made to run one-time harvest or repeating
harvest that are long in between – like once every night. For those use cases,
performing the scheduling in the GUI and then running the job with the cronjob
(the download and the import parameters) is not a problem. But for repeated
tasks, this divided responsibility is highly problematic.
We would like to have all harvests (or tasks) set from the GUI! To facilitate
this, David has proposed to change the functionality of the harvester to work
as a daemon instead. The reasons for these is as follows:

* Using the daemon, all scheduling can be handled by the GUI
* Using the daemon, you could harvest every few seconds. The original intent
with the cronjob was that it would be set once and never looked at again. The
harvesting would just happen in the background. But since you want more control
and to run the harvest every few seconds, a daemon is the way to go. 
* The key benefit of using the daemon is that you can control it from the GUI
and that it can manage the harvests. Trying to set/schedule a cronjob from the
GUI would be a bad idea. 
* If you’re trying to re-harvest every few seconds, a cronjob could easily get
out of control. You could easily have competing processes and no way to control
them at all. A daemon couldn’t be a communications centre in the way described.
The way I envision it, the daemon will communicate with the Web GUI. You could
start, stop, and pause harvests. The daemon would also be in charge of the
actual harvest, as it could control its own activity. You can’t really control
a cronjob. The cron daemon starts cronjobs based on its own unique syntax and
that’s it. It’s just a scheduler. It’s not a controller. The daemon I’m talking
about would be a controller. You could tell it “STOP 1” and it would stop
running the harvest with the 1 identifier.

David could preferably provide more detail on the proposed daemon approach. 

We had some initial reservations about the use of a daemon for the harvester,
mainly as this would be a background process that might be hard to
evaluate/work with for a systems administrator, to which David replied:

* Why would it be hard for a systems administrator to evaluate/work with a
daemon? It seems to me that it would actually be easier for sysadmins to
evaluate/work with a daemon, as it can be monitored and controlled as a
separate process. It’s much easier to control than a cronjob.

It would be good to have input from others in the community on the merits of
having the harvester run as a daemon!

Matching rules
At the moment there are not matching rules for the harvester per se. The only
matching that is done is based on the OAI-PMH unique identifier. If there’s
already a record in Koha with the same title, but not the same OAI-PMH unique
identifier, you will get a duplicate.

Not having matching rules will essentially make the harvester useless for us,
and I would guess anyone harvesting from a union catalogue. We don’t want to
add a lot of unnecessary duplicates to our local catalogue. In case of
libraries who are already running Koha and would want to start using the
harvester, there would be a lot of duplicates (possibly everything!). Also, we
do not want to limit libraries to use one source to harvest from – there might
be a need in the future to harvest from multiple sources.

We suggest that the “Staged Marc Management” tool should be used to actually
import the records into Koha – then the matching rules that apply there would
be used. Or copying/mirroring this functionality for the harvester.

Small issues
* Viewing a server target, the page doesn’t have a back button or working
breadcrumbs. David has suggested that he might not add a back-button but will
fix the breadcrumbs.
* The reset repository harvest button should have a warning or a help text next
to it, explaining that all harvested records will be removed.
* A help text should be added next to the Until parameter, detailing that this
should not be set for repeated harvest. Otherwise, as the From parameter is
auto-updated with each harvest, Until might be set before From, which will
cause to harvester to fail.
* More detailed information should be presented under “View”, preferably lists
of records imported (where you can click on the bib-id to go to the actual
record), lists of deleted records, updated records etc. We will draw up what we
would like to see in terms of details and send to David. We can also post it
here, if others are interested?
* It would be great if multiple sets could be provided for one OAI server.
The first time a new server is added, pressing the “Test HTTP and OAI-PMH
parameters” will send you back to the OAI-PMH server targets (oai_client.pl)
page, like you would expect the save button to do. David has confirmed that
this is a bug.

-- 
You are receiving this mail because:
You are watching all bug changes.