[Koha-devel] [Koha] OAI-PMH harvester
Arthur Suzuki
arthur.suzuki at biblibre.com
Wed Nov 23 14:44:21 CET 2022
Hello there,
If I may suggest a good harvester library, Catmandu may do the job
pretty well.
I've not used the OAI module but used it to harvest from a JSON source
and transform to an UNIMARC file with pretty good success so far.
It can export seamlessly to iso2709 or marcxml.
https://metacpan.org/dist/Catmandu-OAI
Best,
Arthur
On 2022-11-22 15:57, Mike D. wrote:
> Hey. Hey,
> I'm really glad to see the OAI-PMH harvester debate going on for Koha.
> I
> think if we choose a good external harvester with support, we can save
> a
> lot of energy and resources to implement related activities in the
> system.
> Shoveling the logs is only part of the story. The easy part. Since the
> result of shoveling is a lot of records, most of the time we can't
> avoid
> post-processing, merging with the records in the local database. For
> example, if you need to update records from a source where there are
> millions of records, but there are hundreds of thousands in the local
> database. Only a slice of that huge amount is relevant. If we design
> the
> processing workflow wrong, it will take unnecessarily long and burn
> valuable resources.
> I would hereby like to invite us to be in touch, to debate and share
> our
> experiences. Let's get this area moving towards a successful finish.
>
> Take care.
>
> Michal
>
> út 22. 11. 2022 v 15:13 odesílatel BOUIS Sonia
> <sonia.bouis at univ-lyon3.fr>
> napsal:
>
>> Hi,
>> Thanks to David, Tomas, Michal and Michael for your replies.
>>
>> So we have decided to evaluate several external OAI-PMH client that
>> could
>> be used by Koha and to choose one in the end of January
>> There a lot to do after that and we discussed about the background
>> jobs
>> and cronjobs seems to be appropriate. We thought that the settings in
>> the
>> koha intranet should be only to define URLs, SETs, or XSLT sheets (for
>> example, to transform DC XML in MARCXML).
>>
>> We are only at the begining of the process 😊
>>
>> Kind regards,
>> Sonia
>>
>> ------------------------------
>>
>> Message: 2
>> Date: Wed, 26 Oct 2022 10:37:49 +1100
>> From: "David Cook" <dcook at prosentient.com.au>
>> To: "'Tomas Cohen Arazi'" <tomascohen at gmail.com>, "'BOUIS Sonia'"
>> <sonia.bouis at univ-lyon3.fr>
>> Cc: "'koha'" <koha at lists.katipo.co.nz>, "'koha-devel'"
>> <koha-devel at lists.koha-community.org>
>> Subject: Re: [Koha-devel] [Koha] OAI-PMH harvester
>> Message-ID: <07af01d8e8ca$dfbddef0$9f399cd0$@prosentient.com.au>
>> Content-Type: text/plain; charset="utf-8"
>>
>> Hi Sonia,
>>
>>
>>
>> I’m excited to hear that KohaLA would like to finance an OAI-PMH
>> client in
>> Koha! This functionality is always brewing in the back of my mind,
>> since I
>> first raised 10662 back in 2013.
>>
>>
>>
>> As Tomas says, I think that the background jobs are a key component
>> for
>> processing incoming OAI-PMH records.
>>
>>
>>
>> However, the ***missing component right now is the scheduling of the
>> OAI-PMH harvesting tasks***, and I think this is where opinions get
>> divided. Below, I’ll provide some history and opinions on Koha
>> OAI-PMH.
>>
>>
>>
>> --
>>
>>
>>
>> With 10662, the sponsored goal was for Koha library staff to schedule
>> OAI-PMH harvests through the Web UI. However, Fridolin from BibLibre
>> raised
>> a point with me at Kohacon18 about how letting library staff control
>> the
>> timing of harvesting tasks could be a problem for support vendors. If
>> too
>> many libraries using the same public IP address tried to harvest from
>> the
>> same OAI-PMH repository, they could be rate limited or blocked. There
>> could
>> also be server load concerns. So there probably needs to be a balance
>> between user configuration and system configuration. If I recall
>> correctly,
>> this is how DSpace’s OAI-PMH harvester works. Users set up targets and
>> can
>> start/stop harvests, but things like frequency and concurrency are
>> handled
>> by the system configuration.
>>
>>
>>
>> Based on my experience working on OAI-PMH on and off for nearly 10
>> years
>> and as a Koha support vendor, I think my preference would be for
>> sysadmins
>> to handle most of the OAI-PMH harvesting details.
>>
>>
>>
>> The sponsorship for 10662 had certain requirements that many other
>> libraries might not have, which is what made me think that it might be
>> better to have an external client that connects to Koha. I thought
>> maybe I
>> could get the ordinary requirements pushed into Koha, and then handle
>> extraordinary requirements externally. However, an external harvester
>> won’t
>> perform as fast as an internal harvester. (The compromise would be to
>> write
>> the harvester in such a way that people could provide different
>> OAI-PMH
>> harvester Perl modules that all stage records using the same core Koha
>> modules.)
>>
>>
>>
>> Even then… the scheduling would depend on a library’s needs. Back in
>> 2013,
>> I had a Koha OAI-PMH harvester which worked as a cronjob. It would run
>> each
>> night. However, some libraries want to run OAI-PMH harvests as
>> frequently
>> as every 3 seconds. A cronjob’s smallest frequency is 60 seconds, so
>> that
>> wouldn’t work for that requirement.
>>
>>
>>
>> If a cronjob isn’t suitable, then I think you’d need a daemon created
>> by a
>> new command like “koha-oai --start <instance_name>”. It could read a
>> configuration file and handle scheduling accordingly. With 10662, I
>> used
>> the POE module, because I knew it well and it has some timer tools for
>> scheduling tasks. If I were to work on it again, I’d probably use
>> Mojo::IOLoop instead these days, since Mojolicious is already part of
>> Koha
>> while POE is not. (That said, using modules like Mojo and POE are
>> difficult, because they’re difficult to test using automation. That
>> was one
>> of the stumbling blocks with 10662. While the 10662 harvester worked
>> very
>> well, it was difficult to unit test. In hindsight, I should’ve written
>> it
>> in a way that was easier to unit test, but it had a lot of
>> event-driven
>> code which made things more difficult.)
>>
>>
>>
>> Another option would be to create a generic daemon for task scheduling
>> in
>> general (e.g. “koha-schedule”). Koha could use this for many things,
>> but
>> it’s a project in itself.
>>
>>
>>
>> --
>>
>>
>>
>> The process of downloading OAI-PMH records and importing MARCXML into
>> Koha
>> is actually a fairly straightforward process. The difficulty is the
>> task
>> scheduling and management of tasks (and unit testing).
>>
>>
>>
>> I don’t know the answer that will make everyone happy. There’s lots of
>> different ways of managing and scheduling the tasks. Based on my
>> experience, I’d suggest targeting the simplest approach first, because
>> complexity will make it less likely for the project to succeed.
>>
>>
>>
>> On that note, I’d be happy to test/QA any OAI-PMH harvester put
>> forward.
>> When I was writing OAI-PMH harvester patches, I found it really hard
>> to get
>> QA, so I’m happy to be that resource for someone else. I’ve spent a
>> lot of
>> time thinking about this topic, so happy to provide advice, warnings,
>> emotional support 😉.
>>
>>
>>
>> David Cook
>>
>> Senior Software Engineer
>>
>> Prosentient Systems
>>
>> Suite 7.03
>>
>> 6a Glen St
>>
>> Milsons Point NSW 2061
>>
>> Australia
>>
>>
>>
>> Office: 02 9212 0899
>>
>> Online: 02 8005 0595
>>
>>
>>
>> From: Koha-devel <koha-devel-bounces at lists.koha-community.org> On
>> Behalf
>> Of Tomas Cohen Arazi
>> Sent: Wednesday, 26 October 2022 3:46 AM
>> To: BOUIS Sonia <sonia.bouis at univ-lyon3.fr>
>> Cc: koha <koha at lists.katipo.co.nz>; koha-devel <
>> koha-devel at lists.koha-community.org>
>> Subject: Re: [Koha-devel] [Koha] OAI-PMH harvester
>>
>>
>>
>> I think with background jobs we have most of the framework that is
>> needed
>> to deal with this within Koha.
>>
>>
>>
>> Best regards
>>
>>
>>
>> El mar, 25 oct 2022 7:08, BOUIS Sonia <sonia.bouis at univ-lyon3.fr
>> <mailto:
>> sonia.bouis at univ-lyon3.fr> > escribió:
>>
>> Hi,
>> KohaLA would like to finance an OAI-PMH client in Koha but, we have
>> questions that we want to raise to the community.
>> There was already tries to propose an OAI-PMH client :
>> - https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=10662 :
>> it's
>> an old project that doesnt seem compatible with the current version of
>> Koha
>> - https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=25905 :
>> the
>> scope is more to use an external OAI-PMH client and to connect it to
>> Koha
>>
>> Our main question is about the way to handle this. Do you think that
>> it's
>> a better idea to use an external software or PERL routine and to find
>> a way
>> to connect it to Koha. Or would it be better to a new module in Koha
>> from
>> scratch and that Koha have his own OAI-PMH client.
>>
>> Please, let us hear your toughts about this projet.
>>
>> Kind regards
>>
>> Sonia
>>
>> Sonia BOUIS
>> ------------------------------------------------------
>> Responsable du Service informatique documentaire Département d'Appui à
>> la
>> Recherche et aux Projets (DARP) Bibliothèques universitaires
>> Université
>> Jean Moulin Lyon 3 ADRESSE GÉOGRAPHIQUE > Manufacture des Tabacs | 6
>> cours
>> Albert Thomas | LYON 8e ADRESSE POSTALE > Bibliothèque de la
>> Manufacture |
>> 1C avenue des Frères Lumière | CS 78242 - 69372 LYON CEDEX 08
>>
>> Ligne directe : 33 (0)4 78 78 79 03
>>
>> http://bu.univ-lyon3.fr<http://bu.univ-lyon3.fr/>| Suivez-nous >
>> Facebook<
>> https://www.facebook.com/bulyon3/> |
>> Twitter<https://twitter.com/bulyon3>|
>> Instagram<https://www.instagram.com/bu.lyon3/?hl=fr>
>>
>> _______________________________________________
>>
>> Koha mailing list http://koha-community.org Koha at lists.katipo.co.nz
>> <mailto:Koha at lists.katipo.co.nz>
>> Unsubscribe: https://lists.katipo.co.nz/mailman/listinfo/koha
>>
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <
>> http://lists.koha-community.org/pipermail/koha-devel/attachments/20221026/d7712779/attachment-0001.htm
>> >
>>
>> ------------------------------
>>
>> Subject: Digest Footer
>>
>> _______________________________________________
>> Koha-devel mailing list
>> Koha-devel at lists.koha-community.org
>> https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-devel
>> website : https://www.koha-community.org/ git :
>> https://git.koha-community.org/ bugs :
>> https://bugs.koha-community.org/
>>
>>
>> ------------------------------
>>
>> End of Koha-devel Digest, Vol 203, Issue 15
>> *******************************************
>> _______________________________________________
>>
>> Koha mailing list http://koha-community.org
>> Koha at lists.katipo.co.nz
>> Unsubscribe: https://lists.katipo.co.nz/mailman/listinfo/koha
>>
> _______________________________________________
>
> Koha mailing list http://koha-community.org
> Koha at lists.katipo.co.nz
> Unsubscribe: https://lists.katipo.co.nz/mailman/listinfo/koha
--
Arthur Suzuki, 🌈🏔️
Développeur @BibLibre
More information about the Koha-devel
mailing list