[Koha-devel] [Koha] OAI-PMH harvester

BOUIS Sonia sonia.bouis at univ-lyon3.fr
Mon Jan 23 17:47:03 CET 2023


Hi,
Just to let you know that during the KohaLa hackathon (until wednesday), we are thinking about the OAI-PMH harvester. I add our first thoughts on  the BZ ticket : https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=10662#c318

Kind regards,
Sonia

-----Message d'origine-----
De : Arthur Suzuki [mailto:arthur.suzuki at biblibre.com] 
Envoyé : mercredi 23 novembre 2022 14:44
À : Mike D. <black23 at gmail.com>; BOUIS Sonia <sonia.bouis at univ-lyon3.fr>
Cc : koha at lists.katipo.co.nz; koha-devel at lists.koha-community.org
Objet : Re: [Koha] OAI-PMH harvester

Hello there,
If I may suggest a good harvester library, Catmandu may do the job pretty well.
I've not used the OAI module but used it to harvest from a JSON source and transform to an UNIMARC file with pretty good success so far.
It can export seamlessly to iso2709 or marcxml.
https://metacpan.org/dist/Catmandu-OAI

Best,
Arthur

On 2022-11-22 15:57, Mike D. wrote:
> Hey. Hey,
> I'm really glad to see the OAI-PMH harvester debate going on for Koha. 
> I
> think if we choose a good external harvester with support, we can save 
> a lot of energy and resources to implement related activities in the 
> system.
> Shoveling the logs is only part of the story. The easy part. Since the 
> result of shoveling is a lot of records, most of the time we can't 
> avoid post-processing, merging with the records in the local database. 
> For example, if you need to update records from a source where there 
> are millions of records, but there are hundreds of thousands in the 
> local database. Only a slice of that huge amount is relevant. If we 
> design the processing workflow wrong, it will take unnecessarily long 
> and burn valuable resources.
> I would hereby like to invite us to be in touch, to debate and share 
> our experiences. Let's get this area moving towards a successful 
> finish.
> 
> Take care.
> 
> Michal
> 
> út 22. 11. 2022 v 15:13 odesílatel BOUIS Sonia 
> <sonia.bouis at univ-lyon3.fr>
> napsal:
> 
>> Hi,
>> Thanks to David, Tomas, Michal and Michael for your replies.
>> 
>> So we have decided to evaluate several external OAI-PMH client that 
>> could be used by Koha and to choose one in the end of January There a 
>> lot to do after that and we discussed about the background jobs and 
>> cronjobs seems to be appropriate. We thought that the settings in the 
>> koha intranet should be only to define URLs, SETs, or XSLT sheets 
>> (for example, to transform DC XML in MARCXML).
>> 
>> We are only at the begining of the process 😊
>> 
>> Kind regards,
>> Sonia
>> 
>> ------------------------------
>> 
>> Message: 2
>> Date: Wed, 26 Oct 2022 10:37:49 +1100
>> From: "David Cook" <dcook at prosentient.com.au>
>> To: "'Tomas Cohen Arazi'" <tomascohen at gmail.com>, "'BOUIS Sonia'"
>>         <sonia.bouis at univ-lyon3.fr>
>> Cc: "'koha'" <koha at lists.katipo.co.nz>, "'koha-devel'"
>>         <koha-devel at lists.koha-community.org>
>> Subject: Re: [Koha-devel] [Koha] OAI-PMH harvester
>> Message-ID: <07af01d8e8ca$dfbddef0$9f399cd0$@prosentient.com.au>
>> Content-Type: text/plain; charset="utf-8"
>> 
>> Hi Sonia,
>> 
>> 
>> 
>> I’m excited to hear that KohaLA would like to finance an OAI-PMH 
>> client in Koha! This functionality is always brewing in the back of 
>> my mind, since I first raised 10662 back in 2013.
>> 
>> 
>> 
>> As Tomas says, I think that the background jobs are a key component 
>> for processing incoming OAI-PMH records.
>> 
>> 
>> 
>> However, the ***missing component right now is the scheduling of the 
>> OAI-PMH harvesting tasks***, and I think this is where opinions get 
>> divided. Below, I’ll provide some history and opinions on Koha 
>> OAI-PMH.
>> 
>> 
>> 
>> --
>> 
>> 
>> 
>> With 10662, the sponsored goal was for Koha library staff to schedule 
>> OAI-PMH harvests through the Web UI. However, Fridolin from BibLibre 
>> raised a point with me at Kohacon18 about how letting library staff 
>> control the timing of harvesting tasks could be a problem for support 
>> vendors. If too many libraries using the same public IP address tried 
>> to harvest from the same OAI-PMH repository, they could be rate 
>> limited or blocked. There could also be server load concerns. So 
>> there probably needs to be a balance between user configuration and 
>> system configuration. If I recall correctly, this is how DSpace’s 
>> OAI-PMH harvester works. Users set up targets and can start/stop 
>> harvests, but things like frequency and concurrency are handled by 
>> the system configuration.
>> 
>> 
>> 
>> Based on my experience working on OAI-PMH on and off for nearly 10 
>> years and as a Koha support vendor, I think my preference would be 
>> for sysadmins to handle most of the OAI-PMH harvesting details.
>> 
>> 
>> 
>> The sponsorship for 10662 had certain requirements that many other 
>> libraries might not have, which is what made me think that it might 
>> be better to have an external client that connects to Koha. I thought 
>> maybe I could get the ordinary requirements pushed into Koha, and 
>> then handle extraordinary requirements externally. However, an 
>> external harvester won’t perform as fast as an internal harvester. 
>> (The compromise would be to write the harvester in such a way that 
>> people could provide different OAI-PMH harvester Perl modules that 
>> all stage records using the same core Koha
>> modules.)
>> 
>> 
>> 
>> Even then… the scheduling would depend on a library’s needs. Back in 
>> 2013, I had a Koha OAI-PMH harvester which worked as a cronjob. It 
>> would run each night. However, some libraries want to run OAI-PMH 
>> harvests as frequently as every 3 seconds. A cronjob’s smallest 
>> frequency is 60 seconds, so that wouldn’t work for that requirement.
>> 
>> 
>> 
>> If a cronjob isn’t suitable, then I think you’d need a daemon created 
>> by a new command like “koha-oai --start <instance_name>”. It could 
>> read a configuration file and handle scheduling accordingly. With 
>> 10662, I used the POE module, because I knew it well and it has some 
>> timer tools for scheduling tasks. If I were to work on it again, I’d 
>> probably use Mojo::IOLoop instead these days, since Mojolicious is 
>> already part of Koha while POE is not. (That said, using modules like 
>> Mojo and POE are difficult, because they’re difficult to test using 
>> automation. That was one of the stumbling blocks with 10662. While 
>> the 10662 harvester worked very well, it was difficult to unit test. 
>> In hindsight, I should’ve written it in a way that was easier to unit 
>> test, but it had a lot of event-driven code which made things more 
>> difficult.)
>> 
>> 
>> 
>> Another option would be to create a generic daemon for task 
>> scheduling in general (e.g. “koha-schedule”). Koha could use this for 
>> many things, but it’s a project in itself.
>> 
>> 
>> 
>> --
>> 
>> 
>> 
>> The process of downloading OAI-PMH records and importing MARCXML into 
>> Koha is actually a fairly straightforward process. The difficulty is 
>> the task scheduling and management of tasks (and unit testing).
>> 
>> 
>> 
>> I don’t know the answer that will make everyone happy. There’s lots 
>> of different ways of managing and scheduling the tasks. Based on my 
>> experience, I’d suggest targeting the simplest approach first, 
>> because complexity will make it less likely for the project to succeed.
>> 
>> 
>> 
>> On that note, I’d be happy to test/QA any OAI-PMH harvester put 
>> forward.
>> When I was writing OAI-PMH harvester patches, I found it really hard 
>> to get QA, so I’m happy to be that resource for someone else. I’ve 
>> spent a lot of time thinking about this topic, so happy to provide 
>> advice, warnings, emotional support 😉.
>> 
>> 
>> 
>> David Cook
>> 
>> Senior Software Engineer
>> 
>> Prosentient Systems
>> 
>> Suite 7.03
>> 
>> 6a Glen St
>> 
>> Milsons Point NSW 2061
>> 
>> Australia
>> 
>> 
>> 
>> Office: 02 9212 0899
>> 
>> Online: 02 8005 0595
>> 
>> 
>> 
>> From: Koha-devel <koha-devel-bounces at lists.koha-community.org> On 
>> Behalf Of Tomas Cohen Arazi
>> Sent: Wednesday, 26 October 2022 3:46 AM
>> To: BOUIS Sonia <sonia.bouis at univ-lyon3.fr>
>> Cc: koha <koha at lists.katipo.co.nz>; koha-devel < 
>> koha-devel at lists.koha-community.org>
>> Subject: Re: [Koha-devel] [Koha] OAI-PMH harvester
>> 
>> 
>> 
>> I think with background jobs we have most of the framework that is 
>> needed to deal with this within Koha.
>> 
>> 
>> 
>> Best regards
>> 
>> 
>> 
>> El mar, 25 oct 2022 7:08, BOUIS Sonia <sonia.bouis at univ-lyon3.fr
>> <mailto:
>> sonia.bouis at univ-lyon3.fr> > escribió:
>> 
>> Hi,
>> KohaLA would like to finance an OAI-PMH client in Koha but, we have 
>> questions that we want to raise to the community.
>> There was already tries to propose an OAI-PMH client :
>> - https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=10662 : 
>> it's
>> an old project that doesnt seem compatible with the current version 
>> of Koha
>> - https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=25905 : 
>> the
>> scope is more to use an external OAI-PMH client and to connect it to 
>> Koha
>> 
>> Our main question is about the way to handle this. Do you think that 
>> it's a better idea to use an external software or PERL routine and to 
>> find a way to connect it to Koha. Or would it be better to a new 
>> module in Koha from scratch and that Koha have his own OAI-PMH 
>> client.
>> 
>> Please, let us hear your toughts about this projet.
>> 
>> Kind regards
>> 
>> Sonia
>> 
>> Sonia BOUIS
>> ------------------------------------------------------
>> Responsable du Service informatique documentaire Département d'Appui 
>> à la Recherche et aux Projets (DARP) Bibliothèques universitaires 
>> Université Jean Moulin Lyon 3 ADRESSE GÉOGRAPHIQUE > Manufacture des 
>> Tabacs | 6 cours Albert Thomas | LYON 8e ADRESSE POSTALE > 
>> Bibliothèque de la Manufacture | 1C avenue des Frères Lumière | CS 
>> 78242 - 69372 LYON CEDEX 08
>> 
>> Ligne directe : 33 (0)4 78 78 79 03
>> 
>> http://bu.univ-lyon3.fr<http://bu.univ-lyon3.fr/>| Suivez-nous > 
>> Facebook< https://www.facebook.com/bulyon3/> | 
>> Twitter<https://twitter.com/bulyon3>|
>> Instagram<https://www.instagram.com/bu.lyon3/?hl=fr>
>> 
>> _______________________________________________
>> 
>> Koha mailing list  http://koha-community.org Koha at lists.katipo.co.nz 
>> <mailto:Koha at lists.katipo.co.nz>
>> Unsubscribe: https://lists.katipo.co.nz/mailman/listinfo/koha
>> 
>> -------------- next part -------------- An HTML attachment was 
>> scrubbed...
>> URL: <
>> http://lists.koha-community.org/pipermail/koha-devel/attachments/2022
>> 1026/d7712779/attachment-0001.htm
>> >
>> 
>> ------------------------------
>> 
>> Subject: Digest Footer
>> 
>> _______________________________________________
>> Koha-devel mailing list
>> Koha-devel at lists.koha-community.org
>> https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-devel
>> website : https://www.koha-community.org/ git :
>> https://git.koha-community.org/ bugs : 
>> https://bugs.koha-community.org/
>> 
>> 
>> ------------------------------
>> 
>> End of Koha-devel Digest, Vol 203, Issue 15
>> *******************************************
>> _______________________________________________
>> 
>> Koha mailing list  http://koha-community.org Koha at lists.katipo.co.nz
>> Unsubscribe: https://lists.katipo.co.nz/mailman/listinfo/koha
>> 
> _______________________________________________
> 
> Koha mailing list  http://koha-community.org Koha at lists.katipo.co.nz
> Unsubscribe: https://lists.katipo.co.nz/mailman/listinfo/koha

--
Arthur Suzuki, 🌈🏔️
Développeur @BibLibre


More information about the Koha-devel mailing list