[Koha-devel] DB design (MARC structure)

Mon Jun 21 02:51:02 CEST 2004

Joshua Ferraro a écrit :

>Paul a écrit,
>
>I'm still showing some problems with the searching speed with the latest
>CVS ... maybe there are still some index problems.  I tried an author 
>search on "o'brian, patrick" and the result took about 10-15 seconds to 
>return.  Directly from mysql I get:
>
>mysql> select distinct m1.bibid from biblio,biblioitems,marc_biblio,marc_word as m1,marc_word as m2,marc_word as m3 where biblio.biblionumber=marc_biblio.biblionumber and biblio.biblionumber=biblioitems.biblionumber and m1.bibid=marc_biblio.bibid and (m1.bibid=m2.bibid and m1.bibid=m3.bibid) and ((m1.word  like 'o%' and m1.tagsubfield in ('100a','110a', '700a', '710a'))and (m2.word like 'brian%' and m2.tagsubfield in('100a','110a', '700a', '710a'))and (m3.word like 'patrick%' and m3.tagsubfield in('100a','110a', '700a', '710a'))) order by biblio.title;
>
>77 rows in set (5.34 sec)
>
>here's the explain on that query:
>
>mysql> explain select distinct m1.bibid from biblio,biblioitems,marc_biblio,marc_word as m1,marc_word as m2,marc_word as m3 where biblio.biblionumber=marc_biblio.biblionumber and biblio.biblionumber=biblioitems.biblionumber and m1.bibid=marc_biblio.bibid and (m1.bibid=m2.bibid and m1.bibid=m3.bibid) and ((m1.word  like 'o%' and m1.tagsubfield in ('100a','110a', '700a', '710a'))and (m2.word like 'brian%' and m2.tagsubfield in('100a','110a', '700a', '710a'))and (m3.word like 'patrick%' and m3.tagsubfield in('100a','110a', '700a', '710a'))) order by biblio.title;
>+-------------+--------+------------------------+-------------+---------+--------------------------+------+-----------------------------------------------+
>| table       | type   | possible_keys          | key         | key_len | ref                      | rows | Extra                                        |
>+-------------+--------+------------------------+-------------+---------+--------------------------+------+-----------------------------------------------+
>| m2          | range  | bibid,word,Search_Marc | Search_Marc |     259 | NULL                     |  366 | Using where; Using temporary; Using filesort |
>| m1          | ref    | bibid,word,Search_Marc | bibid       |       8 | m2.bibid                 |   56 | Using where                                  |
>| marc_biblio | eq_ref | PRIMARY,biblionumber   | PRIMARY     |       8 | m1.bibid                 |    1 | Using where; Distinct                        |
>| biblio      | eq_ref | PRIMARY,blbnoidx       | PRIMARY     |       4 | marc_biblio.biblionumber |    1 | Distinct                                     |
>| biblioitems | ref    | bibnoidx               | bibnoidx    |       4 | biblio.biblionumber      |   12 | Using index; Distinct                        |
>| m3          | ref    | bibid,word,Search_Marc | bibid       |       8 | m1.bibid                 |   20 | Using where; Distinct                        |
>+-------------+--------+------------------------+-------------+---------+--------------------------+------+-----------------------------------------------+
>6 rows in set (0.04 sec)
>
>mysql>
>
>So it still looks like we're using temp and filesort--which I assume is
>causing the hangup ... if that is the best we can do we may need to start
>thinking about breaking up marc_word into sections (e.g., marc_word_title;
>marc_word_author, etc.).  The search is really accurate but just too slow
>for a production database as large as ours.  What do folks think, would that
>speed things up?
>  
>
I think it's too complex to code. The DB would need a big rewritte & 
multi-marc support would be a pain.
Some ideas to continue speeding things :
* analyse the table : 
http://www.databasejournal.com/features/mysql/article.php/10897_1382791_3
* is your my.cnf correctly set ? (ie with big caches : "Eliminating the 
filesort to speed things up is best done by calculating how big your 
result set of this operation can become and then increase the server 
variable sort_buffer_size which is the maximum size that mysqld will 
keep in memory before using a filesort instead. Note that if your 
resultset is extremely big then you might consume more memory than is 
advisable and then things might slow down because of swaping" hint given 
here : http://forums.devshed.com/showthread.php?t=149288. Setting 
sort_buffer_size=16M seems to make my harddisk silent ;-) )
* change the index to make it "unique". The problem being to change the 
table definition (with a lot of values in it, is will be quite hard). 
Not sure of the speed improvement
* move temporary index to RAM (how ?)

* if we add I "limit 0,200" to the query, things are faster. Maybe we 
could add a "0,50" or some systempref value.
* how many lines have you in your marc_word table ? mine is 9 000 000, 
and answers are faster than for you (around 5seconds), except when the 
result is >200 entries.
* do you have in marc_word only tags that are interesting (= did you 
discard, for example, 0xx tags). I think yes, even if, for instance, 
it's not a standard Koha hack ;-)
* multi-word is slower than single word (in your case o'brian is a 2 
word seach). (note I'm still thinking it's a not a good idea to index 1 
letter words).
* An idea could be to split marc_word into 10 tables, one for each nXX 
tag (0xx, 1xx, 2xx...) I'm not sure we had a big improvement here, 
because for some searches, we could have to query a lot of differents 
tables. another problem is that some tables would be almost empty (like 
0xx), and some would be huge (like 7xx probably).
* Remove ordering of the result. It will remove the "using filesort", 
which is bad. which improvement does it give Joshua ? maybe we could 
order the result with a perl script AFTER building the result (I'm not 
sure it would be a good idea, as it's not compatible with any limit 
clause (& supposes to retrieve all the result list)

HTH

-- 
Paul POULAIN
Consultant indépendant en logiciels libres
responsable francophone de koha (SIGB libre http://www.koha-fr.org)