[Koha-bugs] [Bug 10662] Build OAI-PMH Harvesting Client

bugzilla-daemon at bugs.koha-community.org bugzilla-daemon at bugs.koha-community.org
Mon Jan 23 23:58:30 CET 2023


https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=10662

--- Comment #319 from David Cook <dcook at prosentient.com.au> ---
(In reply to Koha Team University Lyon 3 from comment #318)
> We would like to use an existing OAI-PMH harvester and 2 of them seem to be
> good candidates :
> - Catmandu harverster
> - HTTP::OAI::HARVESTER module
> 
> Both are in perl. At the moment, we thought that the second could be a
> better choice because it's more up to date and we don't necessarily need to
> use Catmandu.

Currently, we're still stuck using HTTP::OAI 3.27 from 2011 in Koha for OAI-PMH
server functionality. Bug 17704 is looking at trying to get a later version of
HTTP::OAI working with Koha, but it's been open for about 6 years now.
(HTTP::OAI was also a dead project for a few years but it was resurrected in
2017 by one of the Catmandu authors. On that note, I think it is a good idea to
avoid Catmandu.)

One problem with HTTP::OAI that I encountered back in 2016 was that it needed
to parse the entire XML response into a DOM Document tree rather than
processing the XML response while it parsed it. 

This usually isn't a problem because most repositories use resumptionToken
elements and limit responses to approximately 100 records. But LIBRIS in Sweden
would stream the entire response back without resumptionToken elements, so 1
XML response could contain the entire catalogue's worth of records.

That said, in theory the HTTP::OAI module uses event-driven SAX XML parsing, so
it shouldn't be building a DOM Document tree from the response. Maybe the dev
environment I was using in 2016 didn't have the correct SAX parser
dependencies, so it was using a DOM-based parser in lieu of the SAX parser
unintentionally. 

Plus, I suppose we could say that the Koha OAI-PMH harvester doesn't support
OAI-PMH repositories that don't use resumptionToken elements for flow control. 

Or, since HTTP::OAI is no longer dead, that issue can always be pursued with
the current maintainer.

So overall... HTTP::OAI is probably the way to go. Just wanted to add a warning
about my past experience with it.

> What we would like is to use all the import tools already existing in Koha
> (XSLT, Record matching rules,  MARC modification templates, Stage marc for
> import, Manage MARC overlay rules).
> 
> We would like to add a OAI-PMH setting (like Z39-50 / SRU) in the staff
> interface with URL, SET, XML Format, authentication login, biblio/authority
> records, deleted records handling, email for logs, XSLT file, encoding,
> items handling, profile import.

Sounds like a plan. I suspect it will involve a lot of testing. It might be
worthwhile to break some of that functionality out into separate tickets, so
that the whole patch set doesn't need to be re-tested for minor fixes outside
the core harvester functionality.

> Every harvesting would be scheduled only via the cronjobs.

That should make it easy to implement and test.

-- 
You are receiving this mail because:
You are watching all bug changes.


More information about the Koha-bugs mailing list