<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Hi all,<br>
<br>
Those days, we're working on a pretty large DB ( >1M biblio), for
a customer that want to do many statistics on some fields.<br>
We discovered that something "simple" like:<br>
SELECT publicationyear, count(publicationyear) FROM biblioitems
GROUP BY publicationyear;<br>
<br>
was giving <b>no result in 10mn</b>.<br>
This is a test DB, not optimized, but we were surprised by the
results.<br>
After investigating we had the idea to create a biblioitems2 table
with the same structure <b>EXCEPT MARCXML and MARC fields<br>
<br>
</b>launch the same SQL query : <b>result in 3seconds</b> !<b><br>
</b>This could be reproduced on any query (on fields without index).<br>
<br>
I think it's because the innoDB is storing each line in one
"object", so, even if you need only one column, you have to read
everything.<br>
In our case, that was 12GB+ of data to read.<br>
biblioitems2 is just a few dozen MB.<br>
(all caching values are minimum and there's no index, so not
involved in the results)<br>
<br>
MY CONCLUSIONS:<br>
* the biblioitems.marc field must be removed quickly: it's useless
since years, and is only resulting in slowing things<br>
* the bilbioitems.marcxml field should be moved outside from this
table. Something like biblio_blob, with biblionumber,
biblioitemnumber and marcxml. When we need it, just join the tables.<br>
<br>
I'm almost sure it would have an important impact on Koha, as
biblioitems table is called and used "everywhere".<br>
<br>
any opinion ?<br>
<b></b>
<pre class="moz-signature" cols="72">--
Paul Poulain, Associé-gérant / co-owner
BibLibre, Services en logiciels libres pour les bibliothèques
BibLibre, Open Source software and services for libraries</pre>
</body>
</html>