[Koha-bugs] [Bug 10662] Build OAI-PMH Harvesting Client

bugzilla-daemon at bugs.koha-community.org bugzilla-daemon at bugs.koha-community.org
Wed Nov 30 00:22:02 CET 2016


https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=10662

David Cook <dcook at prosentient.com.au> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|Patch doesn't apply         |In Discussion

--- Comment #108 from David Cook <dcook at prosentient.com.au> ---
Thinking more about matching and how using a OAI-PMH identifier really isn't
enough, especially as the identifier is unique only within the repository. You
could have two separate repositories with the exact same identifier, so you
need to check the OAI-PMH repository URL as well. 

https://www.openarchives.org/OAI/openarchivesprotocol.html#UniqueIdentifier

There are a few ways of verifying that two records describe the same upstream
record, but it involves some analysis. 

https://www.openarchives.org/OAI/2.0/guidelines-aggregator.htm#Identifiers

And that analysis gets tricky when you're wanting to match against MARCXML
records using Zebra. 

Especially since different frameworks may or may not contain the fields that
you store OAI-PMH data in for the purposes of matching.

--

I'm thinking of maybe making a sort of tiered search... where we search the
database for OAI-PMH details... and if none are found then we use the Zebra
search. However, that's problematic as it sets up some inconsistencies in
methods of importing.

--

To date, we think about importing only in terms of MARCXML... and that makes
some sense. So with OAI-PMH, surely we can still just think of it in terms of
MARCXML. Except that the harvested record isn't necessarily the same as the
imported record. 

Although maybe it should be. 

Maybe instead of using OAI specific details, we should require the use of
http://www.loc.gov/marc/bibliographic/bd035.html. 

I don't know how realistic that is though. How many organisations actually have
registered MARC Organisation codes?

Maybe that's the prerogative of the Koha user rather than the Koha system
though. 

It looks like VuFind uses the MARC 001 for matching
(https://github.com/vufind-org/vufind/blob/master/import/marc.properties),
although that's obviously highly problematic. It has some facility for adding a
prefix to the 001 for uniqueness, but that's a hack.

A sample harvest of DSpace's oai_dc to VuFind's Solr indexes uses the OAI-PMH
identifier for matching it seemes (https://vufind.org/wiki/indexing:dspace),
but as I noted above that's also technically problematic as you may have the
same identifier in multiple repositories. In theory it shouldn't happen... but
it could.

It looks like DSpace uses the OAI-PMH identifier (stored in the database it
seems) for matching as well
https://github.com/DSpace/DSpace/blob/master/dspace-api/src/main/java/org/dspace/harvest/OAIHarvester.java#L485.
As noted above, this has issues if the identifier isn't unique outside the
repository. 

DSpace has a sanity check to make sure the item hasn't been deleted before
trying to do an update... I've been thinking I could match a OAI-PMH identifier
to a biblionumber using a foreign key with restrict so that you can't delete a
bib record without unlinking it from its OAI-PMH provenance record.

So VuFind and DSpace don't have the most sophisticated of matchers and they
both have problems which I'd like to avoid.

--

I recall Leif suggesting that we export some data to MySQL tables (e.g. 001,
003, 020, 022, 035), but that's not without its difficulties as well, and keeps
us locked into MARC as well. 

--

I also remember Mirko mentioning Catmandu::OAI, but it's just a layer over
HTTP::OAI, and HTTP::OAI is flawed in a few ways and won't meet Stockholm
University Library's requirements of having a OAI-PMH harvester that parses a
XML stream.

In any case... the downloading of OAI-PMH records is the easy part. It's what
to do with them once we're uploading to Koha...

--

For a truly robust solution, I think we'd need to overhaul Koha's importing
facilities, and I'm not sure of the best way to do that.

We need to be able to link X number of arbitrary identifiers with a particular
Koha "record", and we need to be able to query those identifiers in a that
allows for rapid matching.

To be honest, this is something that Linked Data seems to be good at. You have
a particular subject, and then you can link data to it arbitrarily. Then for
your query, you could look for "?subject <http://koha/hasOAIPMHid>
<oai:koha:5000>" or "?subject <http://koha/marc/controlNumber> '666'"

Of course, ElasticSearch would work just as well, although you'd still want to
save that data somewhere as a source of truth. We don't tend to use things like
Zebra/Solr/ElasticSearch as the sole repository of data since we can refresh it
from a source of truth.

I suppose both a triplestore and a RDBMS have the same problem. You can link an
identifier to a record, but what if you lose that link? You wind up with
duplicates. I suppose the best you can do is try to prevent people from
destroying links by accident.

-- 
You are receiving this mail because:
You are watching all bug changes.


More information about the Koha-bugs mailing list