[Koha-devel] zebraidx with 1 large MARC file rather than many small MARC files

Fridolin SOMERS fridolin.somers at biblibre.com
Tue Feb 26 09:36:28 CET 2019


Hi,

The problem with redinxing full catalogue is the huge XML files that 
will be generated in /tmp/.
Some servers dont have space enought.
So we at Biblibre use a shell script to reindex step by steps :
https://git.biblibre.com/biblibre/tools/src/branch/master/zebra/rebuild_full.sh
Ah if it crashes you may restart from last good step ;)

Best regards,

Le 04/02/2019 à 07:24, David Cook a écrit :
> To answer my own question.
> 
>   
> 
> I have a zebraidx running on 461MB (or 77000 records) and it's only using 2%
> of memory on a 4GB system, so I'm thinking it is using a stream reader and
> updating the shadow files on disk as it goes through the massive MARC file.
> 
>   
> 
> In that case, while it might be slow to export records to that file,
> zebraidx probably does read 1 large file much faster than many small files.
> 
>   
> 
> David Cook
> 
> Systems Librarian
> 
> Prosentient Systems
> 
> 72/330 Wattle St
> 
> Ultimo, NSW 2007
> 
> Australia
> 
>   
> 
> Office: 02 9212 0899
> 
> Direct: 02 8005 0595
> 
>   
> 
> From: koha-devel-bounces at lists.koha-community.org
> [mailto:koha-devel-bounces at lists.koha-community.org] On Behalf Of David Cook
> Sent: Monday, 4 February 2019 5:10 PM
> To: 'Koha Devel' <koha-devel at lists.koha-community.org>
> Cc: tomascohen at theke.io
> Subject: [Koha-devel] zebraidx with 1 large MARC file rather than many small
> MARC files
> 
>   
> 
> Hi all,
> 
>   
> 
> I haven't looked into it too deeply, but I was curious if Zebra would have
> better performance indexing with 1 large MARC file versus many small MARC
> files.
> 
>   
> 
> At the moment, we generate 1 huge MARC file and then pass that to zebraidx
> as an argument.
> 
>   
> 
> Is that something we've always done or was it done as a performance
> enhancement?
> 
>   
> 
> I haven't looked at the Zebra internals to see whether it reads the entire
> file into memory and then processes it or if it parses the XML using a
> stream reader. Zebraidx can also take a list of files from stdin*, but if
> you had tonnes of small files that could be troublesome.
> 
>   
> 
> I suppose it doesn't matter too much as we march on to ElasticSearch, but I
> figure lots of people are using Zebra still and probably will for a long
> time, so perhaps worth thinking about.
> 
>   
> 
> https://software.indexdata.com/zebra/doc/zebraidx.html
> 
>   
> 
> David Cook
> 
> Systems Librarian
> 
> Prosentient Systems
> 
> 72/330 Wattle St
> 
> Ultimo, NSW 2007
> 
> Australia
> 
>   
> 
> Office: 02 9212 0899
> 
> Direct: 02 8005 0595
> 
>   
> 
> 
> 
> _______________________________________________
> Koha-devel mailing list
> Koha-devel at lists.koha-community.org
> http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-devel
> website : http://www.koha-community.org/
> git : http://git.koha-community.org/
> bugs : http://bugs.koha-community.org/
> 

-- 
Fridolin SOMERS <fridolin.somers at biblibre.com>
BibLibre, France - software and system maintainer


More information about the Koha-devel mailing list