[Koha-devel] [Koha] OAI-PMH harvester

Arthur Suzuki arthur.suzuki at biblibre.com
Wed Nov 23 14:44:21 CET 2022


Hello there,
If I may suggest a good harvester library, Catmandu may do the job 
pretty well.
I've not used the OAI module but used it to harvest from a JSON source 
and transform to an UNIMARC file with pretty good success so far.
It can export seamlessly to iso2709 or marcxml.
https://metacpan.org/dist/Catmandu-OAI

Best,
Arthur

On 2022-11-22 15:57, Mike D. wrote:
> Hey. Hey,
> I'm really glad to see the OAI-PMH harvester debate going on for Koha. 
> I
> think if we choose a good external harvester with support, we can save 
> a
> lot of energy and resources to implement related activities in the 
> system.
> Shoveling the logs is only part of the story. The easy part. Since the
> result of shoveling is a lot of records, most of the time we can't 
> avoid
> post-processing, merging with the records in the local database. For
> example, if you need to update records from a source where there are
> millions of records, but there are hundreds of thousands in the local
> database. Only a slice of that huge amount is relevant. If we design 
> the
> processing workflow wrong, it will take unnecessarily long and burn
> valuable resources.
> I would hereby like to invite us to be in touch, to debate and share 
> our
> experiences. Let's get this area moving towards a successful finish.
> 
> Take care.
> 
> Michal
> 
> út 22. 11. 2022 v 15:13 odesílatel BOUIS Sonia 
> <sonia.bouis at univ-lyon3.fr>
> napsal:
> 
>> Hi,
>> Thanks to David, Tomas, Michal and Michael for your replies.
>> 
>> So we have decided to evaluate several external OAI-PMH client that 
>> could
>> be used by Koha and to choose one in the end of January
>> There a lot to do after that and we discussed about the background 
>> jobs
>> and cronjobs seems to be appropriate. We thought that the settings in 
>> the
>> koha intranet should be only to define URLs, SETs, or XSLT sheets (for
>> example, to transform DC XML in MARCXML).
>> 
>> We are only at the begining of the process 😊
>> 
>> Kind regards,
>> Sonia
>> 
>> ------------------------------
>> 
>> Message: 2
>> Date: Wed, 26 Oct 2022 10:37:49 +1100
>> From: "David Cook" <dcook at prosentient.com.au>
>> To: "'Tomas Cohen Arazi'" <tomascohen at gmail.com>, "'BOUIS Sonia'"
>>         <sonia.bouis at univ-lyon3.fr>
>> Cc: "'koha'" <koha at lists.katipo.co.nz>, "'koha-devel'"
>>         <koha-devel at lists.koha-community.org>
>> Subject: Re: [Koha-devel] [Koha] OAI-PMH harvester
>> Message-ID: <07af01d8e8ca$dfbddef0$9f399cd0$@prosentient.com.au>
>> Content-Type: text/plain; charset="utf-8"
>> 
>> Hi Sonia,
>> 
>> 
>> 
>> I’m excited to hear that KohaLA would like to finance an OAI-PMH 
>> client in
>> Koha! This functionality is always brewing in the back of my mind, 
>> since I
>> first raised 10662 back in 2013.
>> 
>> 
>> 
>> As Tomas says, I think that the background jobs are a key component 
>> for
>> processing incoming OAI-PMH records.
>> 
>> 
>> 
>> However, the ***missing component right now is the scheduling of the
>> OAI-PMH harvesting tasks***, and I think this is where opinions get
>> divided. Below, I’ll provide some history and opinions on Koha 
>> OAI-PMH.
>> 
>> 
>> 
>> --
>> 
>> 
>> 
>> With 10662, the sponsored goal was for Koha library staff to schedule
>> OAI-PMH harvests through the Web UI. However, Fridolin from BibLibre 
>> raised
>> a point with me at Kohacon18 about how letting library staff control 
>> the
>> timing of harvesting tasks could be a problem for support vendors. If 
>> too
>> many libraries using the same public IP address tried to harvest from 
>> the
>> same OAI-PMH repository, they could be rate limited or blocked. There 
>> could
>> also be server load concerns. So there probably needs to be a balance
>> between user configuration and system configuration. If I recall 
>> correctly,
>> this is how DSpace’s OAI-PMH harvester works. Users set up targets and 
>> can
>> start/stop harvests, but things like frequency and concurrency are 
>> handled
>> by the system configuration.
>> 
>> 
>> 
>> Based on my experience working on OAI-PMH on and off for nearly 10 
>> years
>> and as a Koha support vendor, I think my preference would be for 
>> sysadmins
>> to handle most of the OAI-PMH harvesting details.
>> 
>> 
>> 
>> The sponsorship for 10662 had certain requirements that many other
>> libraries might not have, which is what made me think that it might be
>> better to have an external client that connects to Koha. I thought 
>> maybe I
>> could get the ordinary requirements pushed into Koha, and then handle
>> extraordinary requirements externally. However, an external harvester 
>> won’t
>> perform as fast as an internal harvester. (The compromise would be to 
>> write
>> the harvester in such a way that people could provide different 
>> OAI-PMH
>> harvester Perl modules that all stage records using the same core Koha
>> modules.)
>> 
>> 
>> 
>> Even then… the scheduling would depend on a library’s needs. Back in 
>> 2013,
>> I had a Koha OAI-PMH harvester which worked as a cronjob. It would run 
>> each
>> night. However, some libraries want to run OAI-PMH harvests as 
>> frequently
>> as every 3 seconds. A cronjob’s smallest frequency is 60 seconds, so 
>> that
>> wouldn’t work for that requirement.
>> 
>> 
>> 
>> If a cronjob isn’t suitable, then I think you’d need a daemon created 
>> by a
>> new command like “koha-oai --start <instance_name>”. It could read a
>> configuration file and handle scheduling accordingly. With 10662, I 
>> used
>> the POE module, because I knew it well and it has some timer tools for
>> scheduling tasks. If I were to work on it again, I’d probably use
>> Mojo::IOLoop instead these days, since Mojolicious is already part of 
>> Koha
>> while POE is not. (That said, using modules like Mojo and POE are
>> difficult, because they’re difficult to test using automation. That 
>> was one
>> of the stumbling blocks with 10662. While the 10662 harvester worked 
>> very
>> well, it was difficult to unit test. In hindsight, I should’ve written 
>> it
>> in a way that was easier to unit test, but it had a lot of 
>> event-driven
>> code which made things more difficult.)
>> 
>> 
>> 
>> Another option would be to create a generic daemon for task scheduling 
>> in
>> general (e.g. “koha-schedule”). Koha could use this for many things, 
>> but
>> it’s a project in itself.
>> 
>> 
>> 
>> --
>> 
>> 
>> 
>> The process of downloading OAI-PMH records and importing MARCXML into 
>> Koha
>> is actually a fairly straightforward process. The difficulty is the 
>> task
>> scheduling and management of tasks (and unit testing).
>> 
>> 
>> 
>> I don’t know the answer that will make everyone happy. There’s lots of
>> different ways of managing and scheduling the tasks. Based on my
>> experience, I’d suggest targeting the simplest approach first, because
>> complexity will make it less likely for the project to succeed.
>> 
>> 
>> 
>> On that note, I’d be happy to test/QA any OAI-PMH harvester put 
>> forward.
>> When I was writing OAI-PMH harvester patches, I found it really hard 
>> to get
>> QA, so I’m happy to be that resource for someone else. I’ve spent a 
>> lot of
>> time thinking about this topic, so happy to provide advice, warnings,
>> emotional support 😉.
>> 
>> 
>> 
>> David Cook
>> 
>> Senior Software Engineer
>> 
>> Prosentient Systems
>> 
>> Suite 7.03
>> 
>> 6a Glen St
>> 
>> Milsons Point NSW 2061
>> 
>> Australia
>> 
>> 
>> 
>> Office: 02 9212 0899
>> 
>> Online: 02 8005 0595
>> 
>> 
>> 
>> From: Koha-devel <koha-devel-bounces at lists.koha-community.org> On 
>> Behalf
>> Of Tomas Cohen Arazi
>> Sent: Wednesday, 26 October 2022 3:46 AM
>> To: BOUIS Sonia <sonia.bouis at univ-lyon3.fr>
>> Cc: koha <koha at lists.katipo.co.nz>; koha-devel <
>> koha-devel at lists.koha-community.org>
>> Subject: Re: [Koha-devel] [Koha] OAI-PMH harvester
>> 
>> 
>> 
>> I think with background jobs we have most of the framework that is 
>> needed
>> to deal with this within Koha.
>> 
>> 
>> 
>> Best regards
>> 
>> 
>> 
>> El mar, 25 oct 2022 7:08, BOUIS Sonia <sonia.bouis at univ-lyon3.fr 
>> <mailto:
>> sonia.bouis at univ-lyon3.fr> > escribió:
>> 
>> Hi,
>> KohaLA would like to finance an OAI-PMH client in Koha but, we have
>> questions that we want to raise to the community.
>> There was already tries to propose an OAI-PMH client :
>> - https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=10662 : 
>> it's
>> an old project that doesnt seem compatible with the current version of 
>> Koha
>> - https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=25905 : 
>> the
>> scope is more to use an external OAI-PMH client and to connect it to 
>> Koha
>> 
>> Our main question is about the way to handle this. Do you think that 
>> it's
>> a better idea to use an external software or PERL routine and to find 
>> a way
>> to connect it to Koha. Or would it be better to a new module in Koha 
>> from
>> scratch and that Koha have his own OAI-PMH client.
>> 
>> Please, let us hear your toughts about this projet.
>> 
>> Kind regards
>> 
>> Sonia
>> 
>> Sonia BOUIS
>> ------------------------------------------------------
>> Responsable du Service informatique documentaire Département d'Appui à 
>> la
>> Recherche et aux Projets (DARP) Bibliothèques universitaires 
>> Université
>> Jean Moulin Lyon 3 ADRESSE GÉOGRAPHIQUE > Manufacture des Tabacs | 6 
>> cours
>> Albert Thomas | LYON 8e ADRESSE POSTALE > Bibliothèque de la 
>> Manufacture |
>> 1C avenue des Frères Lumière | CS 78242 - 69372 LYON CEDEX 08
>> 
>> Ligne directe : 33 (0)4 78 78 79 03
>> 
>> http://bu.univ-lyon3.fr<http://bu.univ-lyon3.fr/>| Suivez-nous > 
>> Facebook<
>> https://www.facebook.com/bulyon3/> | 
>> Twitter<https://twitter.com/bulyon3>|
>> Instagram<https://www.instagram.com/bu.lyon3/?hl=fr>
>> 
>> _______________________________________________
>> 
>> Koha mailing list  http://koha-community.org Koha at lists.katipo.co.nz
>> <mailto:Koha at lists.katipo.co.nz>
>> Unsubscribe: https://lists.katipo.co.nz/mailman/listinfo/koha
>> 
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <
>> http://lists.koha-community.org/pipermail/koha-devel/attachments/20221026/d7712779/attachment-0001.htm
>> >
>> 
>> ------------------------------
>> 
>> Subject: Digest Footer
>> 
>> _______________________________________________
>> Koha-devel mailing list
>> Koha-devel at lists.koha-community.org
>> https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-devel
>> website : https://www.koha-community.org/ git :
>> https://git.koha-community.org/ bugs : 
>> https://bugs.koha-community.org/
>> 
>> 
>> ------------------------------
>> 
>> End of Koha-devel Digest, Vol 203, Issue 15
>> *******************************************
>> _______________________________________________
>> 
>> Koha mailing list  http://koha-community.org
>> Koha at lists.katipo.co.nz
>> Unsubscribe: https://lists.katipo.co.nz/mailman/listinfo/koha
>> 
> _______________________________________________
> 
> Koha mailing list  http://koha-community.org
> Koha at lists.katipo.co.nz
> Unsubscribe: https://lists.katipo.co.nz/mailman/listinfo/koha

-- 
Arthur Suzuki, 🌈🏔️
Développeur @BibLibre


More information about the Koha-devel mailing list