[Koha-devel] zebraidx with 1 large MARC file rather than many small MARC files
Fridolin SOMERS
fridolin.somers at biblibre.com
Tue Feb 26 09:36:28 CET 2019
Hi,
The problem with redinxing full catalogue is the huge XML files that
will be generated in /tmp/.
Some servers dont have space enought.
So we at Biblibre use a shell script to reindex step by steps :
https://git.biblibre.com/biblibre/tools/src/branch/master/zebra/rebuild_full.sh
Ah if it crashes you may restart from last good step ;)
Best regards,
Le 04/02/2019 à 07:24, David Cook a écrit :
> To answer my own question.
>
>
>
> I have a zebraidx running on 461MB (or 77000 records) and it's only using 2%
> of memory on a 4GB system, so I'm thinking it is using a stream reader and
> updating the shadow files on disk as it goes through the massive MARC file.
>
>
>
> In that case, while it might be slow to export records to that file,
> zebraidx probably does read 1 large file much faster than many small files.
>
>
>
> David Cook
>
> Systems Librarian
>
> Prosentient Systems
>
> 72/330 Wattle St
>
> Ultimo, NSW 2007
>
> Australia
>
>
>
> Office: 02 9212 0899
>
> Direct: 02 8005 0595
>
>
>
> From: koha-devel-bounces at lists.koha-community.org
> [mailto:koha-devel-bounces at lists.koha-community.org] On Behalf Of David Cook
> Sent: Monday, 4 February 2019 5:10 PM
> To: 'Koha Devel' <koha-devel at lists.koha-community.org>
> Cc: tomascohen at theke.io
> Subject: [Koha-devel] zebraidx with 1 large MARC file rather than many small
> MARC files
>
>
>
> Hi all,
>
>
>
> I haven't looked into it too deeply, but I was curious if Zebra would have
> better performance indexing with 1 large MARC file versus many small MARC
> files.
>
>
>
> At the moment, we generate 1 huge MARC file and then pass that to zebraidx
> as an argument.
>
>
>
> Is that something we've always done or was it done as a performance
> enhancement?
>
>
>
> I haven't looked at the Zebra internals to see whether it reads the entire
> file into memory and then processes it or if it parses the XML using a
> stream reader. Zebraidx can also take a list of files from stdin*, but if
> you had tonnes of small files that could be troublesome.
>
>
>
> I suppose it doesn't matter too much as we march on to ElasticSearch, but I
> figure lots of people are using Zebra still and probably will for a long
> time, so perhaps worth thinking about.
>
>
>
> https://software.indexdata.com/zebra/doc/zebraidx.html
>
>
>
> David Cook
>
> Systems Librarian
>
> Prosentient Systems
>
> 72/330 Wattle St
>
> Ultimo, NSW 2007
>
> Australia
>
>
>
> Office: 02 9212 0899
>
> Direct: 02 8005 0595
>
>
>
>
>
> _______________________________________________
> Koha-devel mailing list
> Koha-devel at lists.koha-community.org
> http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-devel
> website : http://www.koha-community.org/
> git : http://git.koha-community.org/
> bugs : http://bugs.koha-community.org/
>
--
Fridolin SOMERS <fridolin.somers at biblibre.com>
BibLibre, France - software and system maintainer
More information about the Koha-devel
mailing list