[Koha-bugs] [Bug 10662] Build OAI-PMH Harvesting Client

Wed Dec 2 08:25:54 CET 2015

http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=10662

--- Comment #50 from David Cook <dcook at prosentient.com.au> ---
(In reply to Katrin Fischer from comment #33)
> I was told recently that 2-3 seconds is quite standard for OAI-PMH harvests.
> 

Katrin, who said this to you? Andreas was also interested in every 2-3 seconds,
but that doesn't seem very feasible to me.

Today, I've tried downloading records, and I can download 21 records from a
Swedish server in 4-5 seconds. 

The OAI-PMH harvester utilizes synchronous code for downloading records, so if
you have multiple OAI-PMH servers, it will have to download first from Server
A, then Server B, then Server C... and then it will start processing records.

If each server takes 5 seconds, that's 15 seconds before you even start
processing the first record. 

I think I might be able to find some asynchronous code for downloading records
with Perl, but even then it might take 5 seconds or longer just to download
records... that's longer than the ideal 2-3 seconds. Plus, the asynchronous
code would require me to stop using the HTTP::OAI module and create my own
asynchronous version of it... which would take some time and probably be more
error-prone due to the speed which I'm trying to develop. 

I suppose 21 records might be a lot for a harvest running every 2-3 seconds...
I just tried the query
"verb=ListRecords&metadataPrefix=marcxml&from=2015-12-01T18:01:45Z&until=2015-12-01T18:01:47Z",
and my browser downloaded 4 records in 1 second.

I suppose it might only take another 1-2 seconds to process those 4 records and
import them into Koha. That's just a guess though, as I haven't written the
necessary new processing/importing code yet. 

I suppose if I'm sending HTTP requests asynchronously and if it only takes 1
second to fetch a handful of records, it might be doable in 2-3 seconds... but
the more records to fetch from the server, the longer the download is going to
take and that blows out the overall time. If 2-3 seconds is just an ideal, it
might not matter if it it takes 5-10 seconds.

I'm keen for feedback from potential uses of the OAI-PMH server. How much does
time/frequency matter to you?

This might be a case of premature optimisation. It might be better for me to
focus on building a functional system, and then worry about improving the speed
later. I did that recently on a different project, and it worked quite well. I
focused the majority of my time on meeting the functional requirements, and
then spent a few hours tweaking the code to reach massive increases in
performance. However, if the system needs to be re-designed to gain those
performance increases, then that seems wasteful.

--

Another thought I had was to build an "import_oai" API into Koha, and then
perhaps write the actual OAI-PMH harvester using a language which works
asynchronously by design, like Node.js. Not that I'm excellent with Node.js.
I've written code in my spare time which fetches data from a database and
asynchronously updates a third-party REST API, but it's certainly not
elegant... and requiring Node.js adds a layer of complexity that the Koha
community itself would not want to support in any way shape or form I would
think. But we could create an import API and then rely on individual libraries
to supply their own OAI-PMH harvesters... although for that to work
successfully, we would need standards for conversations between harvesters and
importers. 

I'm thinking the "import_oai" or "import?type=oai" API might be a good idea in
any case, although I'm not sure how Apache would cope with being hammered by an
OAI-PMH harvester sending it multiple XML records every few seconds.

Perhaps it's worthwhile having one daemon for downloading records, and another
for importing records. Perhaps it's worth writing a forking server to handle
incoming records in parallel. 

--

Honestly though, I would ask that people think further about the frequency of
harvests. Is every 2-3 seconds really necessary? Do we really need it to be
able to perform that quickly?

If so, I'm open to ideas about how to achieve it. I have lots of ideas as
outlined above, but more than happy to hear about suggestions, and even happier
to be told not to worry about the speed.

Unless people think it's a concern, I'm going to continue development with
slower synchronous code. I want to make this harvester as modular as possible,
so that future upgrades don't require a rewrite of the whole system.

Right now, I see the bottleneck being with the downloading of records and
passing those records to a processor/importer. The importer, at least for KB,
is going to be difficult in terms of the logic involved, but I'm not
necessarily that worried about its speed at this point. So I might try to
prototype a synchronous downloader as fast as I can and spend more time on the
importer and refactoring existing code.

-- 
You are receiving this mail because:
You are watching all bug changes.