[Koha-bugs] [Bug 10662] Build OAI-PMH Harvesting Client

Tue Sep 8 05:42:15 CEST 2015

http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=10662

--- Comment #19 from David Cook <dcook at prosentient.com.au> ---
Hi all!

I've finally got something up for testing, so please everyone take some time to
test it out. So much has changed since I first started working on this back in
2013, but hopefully it should provide all the functionality that you need.

I'm sure that the user interface could use more attention, so I'd love to
receive feedback on that.

I'd also love to hear back about how the feature works. The key component is
the "oai_harvester.pl" cronjob, which will be set up by a system administrator.
I don't think there's much that a web user can do to affect that, although I
have seen other bugs talking about giving web users control over scheduling
tasks. I think web users controlling scheduling would be outside the scope of
this bug.

Unlike the "Staged MARC Management", there is currently no way of un-importing
and re-importing. You can only "reset repository harvest", which will delete
all currently harvested records and allow you to schedule a new re-harvest.
While I originally was going to leverage the "Staged MARC Management" code, I
decided that giving web users control over selectively un-importing and
re-importing batches of records harvested via OAI-PMH could be really
problematic. That is, you might un-import a batch which deletes 10 records,
import a new batch which contains those 10 records, then try to re-import an
earlier batch of those 10 records. Even if the (optional) record matching rules
were set up perfectly, your Koha records would be wrong; they'd be for an older
version of the upstream record. I decided that once a record was added to Koha
- all further updates and deletions should be automatic. And if a record was
deleted from Koha (other than by "resetting the harvest"), then it could not be
re-added; it will instead generate an error (since you can't currently
"undelete" a bibliographic record or re-add it with the same biblionumber).
However, I'm happy to discuss options for handling records that have been
deleted from Koha. There is code that checks if the record has been deleted in
Koha, so it would be trivial to add a new record with a new biblionumber,
although I'd have to update some other code which expects a unique OAI-PMH
identifier to be tied to only 1 Koha biblionumber whereas in this case it would
have 2 or more.

In fact, I'm happy to discuss every part of this code. 

Some of you might be interested in improving performance. At the moment, the
"oai_harvester.pl" runs synchronously, which means that first all the records
need to be downloaded into the database, and then all the records need to be
processed and imported into Koha. For initial imports or large imports, this
takes hours. However, I've recently gained a lot of experience using POE (Perl
Object Environment). Using POE, I could presumably write an asynchronous
program which could import records as they're received, rather than waiting for
the entire harvest to complete. Unfortunately, POE was removed from Koha's
dependencies in the past year or so, but I don't think it would be problematic
to add it to the dependencies once again. 

--

Despite me posting these patches, the work isn't done yet. 

A keen observer will note that there is a lack of consistency in naming. I
sometimes say "oai_client", "oai_server", "oai_target", "oai repository". It's
not always exactly clear what I mean, even though I know what I mean. I want to
be clear in differentiating this feature from Koha's OAI-PMH server as well. I
would be receptive to comments about preferred terminology in both the backend
and the web app.

Additionally, I also need to do the following:

1) Add unit tests
2) Revise the embedded POD in the code
3) Add help pages (and possibly hints/tips in the templates for web users)

I'm going to hold off on these 3 tasks for the moment until we get further into
the testing. Otherwise they'll just need to be revised again after more code
iterations. (That said, it would have been smart to have written unit tests
from the beginning as I built up the code. Alas. Next time.)

-- 
You are receiving this mail because:
You are watching all bug changes.