[Koha-bugs] [Bug 10662] Build OAI-PMH Harvesting Client

Tue Nov 24 22:54:54 CET 2015

http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=10662

--- Comment #44 from Andreas Hedström Mace <andreas.hedstrom.mace at sub.su.se> ---
(In reply to David Cook from comment #41)
> I agree once again with Katrin. I think I've said before (either here or via
> email) that using Zebra for matching can be very unreliable. 
> 
> Currently, I use the unique OAI-PMH identifiers to handle all harvested
> records, and that's quite robust, since that identifier should be
> persistent. However, that obviously doesn't help with matching OAI-PMH
> harvested records against local records created via other methods.

That potential matching rules should not rely on Zebra I think we can all agree
on. But only matching via OAI-PMH identifiers I do not think would work for
harvests from union catalogs. If I understand it correctly, it would force all
libraries who want to start using Koha with OAI-PMH harvests to migrate using
OAI-PMH or having a duplicate of all its records (since none of the local
records will have OAI-PMH identifiers). In our case that would be about 1,2
million duplicates.

> In the short-term, perhaps merging bibliographic records would have to occur
> manually. Or maybe a deduplication tool could be created to semi-automate
> that task... although I think that tool would have to prevent any deletion
> of OAI-PMH harvested records.

Handling 1,2 million duplicates manually, or even semi-automatically will most
likely not be possible. Although it might be difficult technically, I still
think some sort of matching rules is necessary.

> Actually, this hearkens back to my previous comment. It would be good if
> each record had a simple way of identifying its origin. So you couldn't
> delete a record obtained via OAI-PMH unless its parent repository was
> deleted from Koha or unless you used a OAI-PMH management tool to delete
> records for that repository. 
> 
> I think providing this "source" or "origin" would need to be done
> consistently or rather... extensibly. I wouldn't want it to be OAI-PMH
> specific as that would be short-sighted. 

Marking the source/origin of a record sounds like a good idea to me, if it can
be easily incorporated.

> At the moment, everything that goes through svc/import_bib uses a
> webservices import_batch... but that's not very unique. It would be
> interesting to have unique identifiers for import sources. So you might use
> the svc/import_bib with the connexion_import_daemon.pl, or with MARCEdit, or
> your home-grown script, or whatever. It would be interesting to distinguish
> those separately... and maybe prevent writes/deletions for records that are
> entered via connexion_import_daemon.pl and home-grown script XYZ, while
> leaving ones imported via MARCEdit to be managed however since you just
> exported some original records and re-imported them via MARCEdit after
> making some changes.

I was going to ask how this would tie in with the development of the REST API,
but Davids comment #42 explains that.

-- 
You are receiving this mail because:
You are watching all bug changes.