[Koha-devel] zebraidx with 1 large MARC file rather than many small MARC files

Wed Feb 27 06:29:49 CET 2019

That's interesting. I do have some large Koha instances that have run into
that space problem I think. 

David Cook
Systems Librarian
Prosentient Systems
72/330 Wattle St
Ultimo, NSW 2007
Australia

Office: 02 9212 0899
Direct: 02 8005 0595

-----Original Message-----
From: koha-devel-bounces at lists.koha-community.org
[mailto:koha-devel-bounces at lists.koha-community.org] On Behalf Of Fridolin
SOMERS
Sent: Tuesday, 26 February 2019 7:36 PM
To: koha-devel at lists.koha-community.org
Subject: Re: [Koha-devel] zebraidx with 1 large MARC file rather than many
small MARC files

Hi,

The problem with redinxing full catalogue is the huge XML files that will be
generated in /tmp/.
Some servers dont have space enought.
So we at Biblibre use a shell script to reindex step by steps :
https://git.biblibre.com/biblibre/tools/src/branch/master/zebra/rebuild_full
.sh
Ah if it crashes you may restart from last good step ;)

Best regards,

Le 04/02/2019 à 07:24, David Cook a écrit :
> To answer my own question.
> 
>   
> 
> I have a zebraidx running on 461MB (or 77000 records) and it's only 
> using 2% of memory on a 4GB system, so I'm thinking it is using a 
> stream reader and updating the shadow files on disk as it goes through the
massive MARC file.
> 
>   
> 
> In that case, while it might be slow to export records to that file, 
> zebraidx probably does read 1 large file much faster than many small
files.
> 
>   
> 
> David Cook
> 
> Systems Librarian
> 
> Prosentient Systems
> 
> 72/330 Wattle St
> 
> Ultimo, NSW 2007
> 
> Australia
> 
>   
> 
> Office: 02 9212 0899
> 
> Direct: 02 8005 0595
> 
>   
> 
> From: koha-devel-bounces at lists.koha-community.org
> [mailto:koha-devel-bounces at lists.koha-community.org] On Behalf Of 
> David Cook
> Sent: Monday, 4 February 2019 5:10 PM
> To: 'Koha Devel' <koha-devel at lists.koha-community.org>
> Cc: tomascohen at theke.io
> Subject: [Koha-devel] zebraidx with 1 large MARC file rather than many 
> small MARC files
> 
>   
> 
> Hi all,
> 
>   
> 
> I haven't looked into it too deeply, but I was curious if Zebra would 
> have better performance indexing with 1 large MARC file versus many 
> small MARC files.
> 
>   
> 
> At the moment, we generate 1 huge MARC file and then pass that to 
> zebraidx as an argument.
> 
>   
> 
> Is that something we've always done or was it done as a performance 
> enhancement?
> 
>   
> 
> I haven't looked at the Zebra internals to see whether it reads the 
> entire file into memory and then processes it or if it parses the XML 
> using a stream reader. Zebraidx can also take a list of files from 
> stdin*, but if you had tonnes of small files that could be troublesome.
> 
>   
> 
> I suppose it doesn't matter too much as we march on to ElasticSearch, 
> but I figure lots of people are using Zebra still and probably will 
> for a long time, so perhaps worth thinking about.
> 
>   
> 
> https://software.indexdata.com/zebra/doc/zebraidx.html
> 
>   
> 
> David Cook
> 
> Systems Librarian
> 
> Prosentient Systems
> 
> 72/330 Wattle St
> 
> Ultimo, NSW 2007
> 
> Australia
> 
>   
> 
> Office: 02 9212 0899
> 
> Direct: 02 8005 0595
> 
>   
> 
> 
> 
> _______________________________________________
> Koha-devel mailing list
> Koha-devel at lists.koha-community.org
> http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-devel
> website : http://www.koha-community.org/ git : 
> http://git.koha-community.org/ bugs : http://bugs.koha-community.org/
> 

-- 
Fridolin SOMERS <fridolin.somers at biblibre.com>
BibLibre, France - software and system maintainer
_______________________________________________
Koha-devel mailing list
Koha-devel at lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-devel
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/