[Koha-bugs] [Bug 10662] Build OAI-PMH Harvesting Client

bugzilla-daemon at bugs.koha-community.org bugzilla-daemon at bugs.koha-community.org
Mon Sep 2 09:21:49 CEST 2013


http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=10662

--- Comment #3 from David Cook <dcook at prosentient.com.au> ---
Created attachment 20757
  -->
http://bugs.koha-community.org/bugzilla3/attachment.cgi?id=20757&action=edit
Bug 10662 - Build OAI-PMH Harvesting Client

N.B. This feature is still a work in progress. This commit represents
my work to date on the OAI-PMH harvester, but it's not complete. While
the core classes are operational, I still need to improve the cronjob,
the DC to MARC21 XSLT, and the internal workings of the classes (mostly
in terms of error handling, edge cases, and reporting).

This patch adds several new files to Koha:

Overview:
1) C4/OAI/Harvester.pm: This contains 2 classes. The Harvester class
sets up the 6 OAI-PMH verbs (2 of which harvest records) and imports
records into Koha. The Harvester::Record class is a helper class for
processing records, transforming metadata, etc. At the moment, this is
hardwired for MARC21 but it's easy enough to expand it out to other
flavours (provided there are XSLTs there for the metadata transforms).

2) koha-tmpl/intranet-tmpl/prog/en/xslt/DC2MARC21slim.xsl: This is a
XSLT which transforms oai_dc into MARC21. This is a lightweight XSLT
based on one I found from the Library of Congress. I improved the
Leader and I will endeavour to improve the 008 and map more fields in
a more orderly way than this. However, it's a good start.

3) koha-tmpl/intranet-tmpl/prog/en/xslt/MARC21slim2MARC21enhanced.xsl:
This simply strips 999 fields from incoming records, adds the OAI-PMH
unique identifier, and adds a 999 field if that unique identifier has
already been imported in the past (the 999 is added so that automatic
matching and replacement can take place on import).

4) misc/cronjobs/oai_harvester.pl: At the moment, this script takes
database config for a OAI-PMH repository to create a Harvester object,
then tries each OAI-PMH verb and imports the resulting records into
Koha.

5) t/db_dependent/OAI_harvester.t: This is a unit test which uses
Koha as the OAI-PMH repository and client in a circular loop. It
should test the high level methods and go from retrieving database
config to importing records. It uses the "rollback" method so you
won't have a bunch of records imported into your database (although
your autoincrement will probably go up anyway).

Test Plan (For MARC21 users):
0) Apply the patch and run updatedatabase.pl (it will add two new
tables to your database: oai_harvest and oai_harvest_repositories).
I've documented them in kohastructure.sql.

1) For starters, run the OAI_harvester.t test. It should cycle
through all the tests without any problems. If you there are problems,
let me know on Bugzilla, in IRC, or via email.

2a) Next, if you feel safe importing records into your database, use the
config from the unit test as an example and create an entry in your
oai_harvest_repositories table. Be careful with the "import_mode".
"Automatic" will automatically stage and import any records
harvested via OAI-PMH. "Manual" will stage them but not import them.

2b) Once you're somewhat sure of your config, run oai_harvester.pl. The
default is not to use automatic token resumption (so you should only
have 50 records in your import batch most likely). You can change this
in the cronjob itself.

If you fully import this batch, try running "oai_harvester.pl" again.
You shouldn't get any results (as you've already imported that batch
of records). To try out the "replace/update" feature: update one of
the original records you imported (out of that first 50), re-index,
and re-run "oai_harvester.pl". You should now get 1 record in your
batch which should automatically match the original. In "automatic"
mode, it would automatically replace the record (although this can
still be reverted in the Manage Staged MARC Records). In "manual"
mode, you'll notice in the management interface that there has
been a match for that record with the original.

3) To more fully explore: Examine C4/OAI/Harvester.pm. I wrote the POD
at the end of August, so it's a bit out of date but it should be
mostly accurate. I've tried to include as many comments as possible
along the way as well.

It's not a very sophisticated module. However, there are a lot of
different scenarios that I'm trying to account for. I might've missed
some use/edge cases, implemented bad code, or error handling. A place to
start might be any FIXME or TODO messages.

--

As I said at the top, this is still a work in progress, but I'd be
happy to get any feedback or advice on how to proceed with this
feature.

I'll continue working on it and post updates.

-- 
You are receiving this mail because:
You are watching all bug changes.


More information about the Koha-bugs mailing list