[Koha-bugs] [Bug 10662] Build OAI-PMH Harvesting Client

bugzilla-daemon at bugs.koha-community.org bugzilla-daemon at bugs.koha-community.org
Mon Nov 16 07:08:03 CET 2015


http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=10662

--- Comment #27 from David Cook <dcook at prosentient.com.au> ---
The first time I started working on this feature, I thought about using “Staged
Marc Management”, but there were problems with this which I don't recall 100%
(as it was over 2 years ago). I do have some memories though:

1. I wouldn't want the harvests accessible via the "Staged Marc Management"
tool, because selective "import"/"undo import" of harvests would be highly
problematic.

You could import 100 records, unimport 100 records, import 50 records, and then
try to re-import those original 100 records which include that 50 record
subset. In this case, you might overwrite the newer 50 records with the older
100 records. Of course, you could opt not to overwrite matches... but that
relies on there being a good matcher, which there very well might not be. 

Plus, if you don't overwrite matches and have that setting defined at a OAI-PMH
server level, you're never going to get newer records updating older records,
which is also bad.

2. The "Staged Marc Management" record matcher relies on Zebra which makes it
prone to not always matching correctly. If something hasn't been indexed
correctly, you'll get duplicate records. It also relies on that Koha's indexing
configuration. 

In some tests, I've forced the unique OAI-PMH identifier to be placed in the
035$a field... but that field isn't indexed by default. So it would be useless
for matching without an update to the Zebra indexing... which can be achieved
but it's another point of failure.

The matching also relies on import rules defined in Koha. If you have a staff
member accidentally delete your OAI-PMH matching rule, you're going to quickly
get many many duplicate records.

--

I chose to do my own import rules - using only the unique OAI-PMH identifier -
because it was the most reliable way of making sure that harvested records
weren't duplicated against themselves/each other.

In the event that you're harvesting holdings, you also need to have the
original bibliographic record in the Koha. That means that if you're having
duplicate matching, it must 100% of the time overwrite local bibliographic
records. Otherwise, your holdings won't know which bibliographic record to
which to bind. If you're using "Staged Marc Management", it's easy to
accidentally misconfigure so that you're not overwriting local bibliographic
records, and then you have problems again.

Another reason I chose to do my own import rules is because I don't think you
can trust the user to manage the OAI-PMH harvester configuration completely. 

--

That all said, I think perhaps the "Staged Marc Management" system might be
able to be leveraged... I just don't want it to be configurable by end users,
since it needs very particular settings in order to work correctly. 

Unfortunately, this means that you're going to lose some of the functionality
you want, like being able to look at all the records in a harvest.

However, the idea of a "harvest" doesn't really make sense if you're using the
harvester every few seconds. Each "harvest" might only have 1-2 records in it,
so the concept of harvests becomes a bit unhelpful.

--

Ultimately, I think we'll need to discuss the import and duplication part of
the feature more...

-- 
You are receiving this mail because:
You are watching all bug changes.


More information about the Koha-bugs mailing list