[Koha-devel] DB design (MARC structure)

Mon Jun 21 06:31:06 CEST 2004

Paul a écrit,
> >So it still looks like we're using temp and filesort--which I assume is
> >causing the hangup ... if that is the best we can do we may need to start
> >thinking about breaking up marc_word into sections (e.g., marc_word_title;
> >marc_word_author, etc.).  The search is really accurate but just too slow
> >for a production database as large as ours.  What do folks think, would 
> >that
> >speed things up?
> > 
> >
> I think it's too complex to code. The DB would need a big rewritte & 
> multi-marc support would be a pain.
> Some ideas to continue speeding things :
> * analyse the table : 
> http://www.databasejournal.com/features/mysql/article.php/10897_1382791_3
done:

mysql> analyze table marc_word;
+----------------+---------+----------+----------+
| Table          | Op      | Msg_type | Msg_text |
+----------------+---------+----------+----------+
| Koha.marc_word | analyze | status   | OK       |
+----------------+---------+----------+----------+
1 row in set (18 min 34.90 sec)

> * is your my.cnf correctly set ? (ie with big caches : "Eliminating the 
> filesort to speed things up is best done by calculating how big your 
> result set of this operation can become and then increase the server 
> variable sort_buffer_size which is the maximum size that mysqld will 
> keep in memory before using a filesort instead. Note that if your 
> resultset is extremely big then you might consume more memory than is 
> advisable and then things might slow down because of swaping" hint given 
> here : http://forums.devshed.com/showthread.php?t=149288. Setting 
> sort_buffer_size=16M seems to make my harddisk silent ;-) )
yep ... set to 64M

> * change the index to make it "unique". The problem being to change the 
> table definition (with a lot of values in it, is will be quite hard). 
> Not sure of the speed improvement
not sure how to do this ...

> * move temporary index to RAM (how ?)
??

> * if we add I "limit 0,200" to the query, things are faster. Maybe we 
> could add a "0,50" or some systempref value.
even when the result set is only about 70 or so it's taking about 18
seconds to return a result DIRECTLY FROM MYSQL.

> * how many lines have you in your marc_word table ? mine is 9 000 000, 
> and answers are faster than for you (around 5seconds), except when the 
> result is >200 entries.
mysql> select count(*) from marc_word;
+----------+
| count(*) |
+----------+
|  3933356 |
+----------+
1 row in set (0.00 sec)

> * do you have in marc_word only tags that are interesting (= did you 
> discard, for example, 0xx tags). I think yes, even if, for instance, 
> it's not a standard Koha hack ;-)
yes, our marc_word table is VERY clean ... it was rebuilt using the 
build_marc_word.pl script in updater/

> * multi-word is slower than single word (in your case o'brian is a 2 
> word seach). (note I'm still thinking it's a not a good idea to index 1 
> letter words).
we really need the accuracy that one-word indexing provides and there is not
much difference in size for us between indexing one-word and two/three-word...

> * An idea could be to split marc_word into 10 tables, one for each nXX 
> tag (0xx, 1xx, 2xx...) I'm not sure we had a big improvement here, 
> because for some searches, we could have to query a lot of differents 
> tables. another problem is that some tables would be almost empty (like 
> 0xx), and some would be huge (like 7xx probably).
yea, I still like the idea of setting up seperate marc_word tables for
each search type ... tho I realize as you said that it would require 
quite a big change and lots of work ...

> * Remove ordering of the result. It will remove the "using filesort", 
> which is bad. which improvement does it give Joshua ? maybe we could 
> order the result with a perl script AFTER building the result (I'm not 
> sure it would be a good idea, as it's not compatible with any limit 
> clause (& supposes to retrieve all the result list)
hmmm, removing ordering gets rid of "using filesort" and does speed
things up from 18-25 secs to 8-10 secs directly from mysql (still too 
long?).

I've been looking at the optimization section of the mysql manual
http://dev.mysql.com/doc/mysql/en/EXPLAIN.html and I have some questions
about our design.

first, the two explains:

mysql> explain marc_word;
+---------------+--------------+------+-----+---------+-------+
| Field         | Type         | Null | Key | Default | Extra |
+---------------+--------------+------+-----+---------+-------+
| bibid         | bigint(20)   |      | MUL | 0       |       |
| tagsubfield   | varchar(4)   |      | MUL |         |       |
| tagorder      | tinyint(4)   |      | MUL | 1       |       |
| subfieldorder | tinyint(4)   |      | MUL | 1       |       |
| word          | varchar(255) |      | MUL |         |       |
| sndx_word     | varchar(255) |      | MUL |         |       |
+---------------+--------------+------+-----+---------+-------+
6 rows in set (0.00 sec)

mysql> explain select distinct m1.bibid from biblio,biblioitems,marc_biblio,marc_word as m1,marc_word as m2,marc_word as m3 where biblio.biblionumber=marc_biblio.biblionumber and biblio.biblionumber=biblioitems.biblionumber and m1.bibid=marc_biblio.bibid and (m1.bibid=m2.bibid and m1.bibid=m3.bibid) and ((m1.word  like 'o%' and m1.tagsubfield in ('100a','110a', '700a', '710a'))and (m2.word like 'brian%' and m2.tagsubfield in('100a','110a', '700a', '710a'))and (m3.word like 'patrick%' and m3.tagsubfield in('100a','110a', '700a', '710a'))) order by biblio.title;
+-------------+--------+------------------------+-------------+---------+--------------------------+------+-----------------------------------------------+
| table       | type   | possible_keys          | key         | key_len | ref                      | rows | Extra                                        |
+-------------+--------+------------------------+-------------+---------+--------------------------+------+-----------------------------------------------+
| m2          | range  | bibid,word,Search_Marc | Search_Marc |     259 | NULL                     |  366 | Using where; Using temporary; Using filesort |
| m1          | ref    | bibid,word,Search_Marc | bibid       |       8 | m2.bibid                 |   28 | Using where                                  |
| marc_biblio | eq_ref | PRIMARY,biblionumber   | PRIMARY     |       8 | m1.bibid                 |    1 | Using where; Distinct                        |
| biblio      | eq_ref | PRIMARY,blbnoidx       | PRIMARY     |       4 | marc_biblio.biblionumber |    1 | Distinct                                     |
| biblioitems | ref    | bibnoidx               | bibnoidx    |       4 | biblio.biblionumber      |   12 | Using index; Distinct                        |
| m3          | ref    | bibid,word,Search_Marc | bibid       |       8 | m1.bibid                 |   28 | Using where; Distinct                        |
+-------------+--------+------------------------+-------------+---------+--------------------------+------+-----------------------------------------------+
6 rows in set (0.01 sec)

So according to the manual "you can get a good indication of how good
a join is by taking the product of the values in the rows column of 
the EXPLAIN output. This should tell you roughly how many rows MySQL
must examine to execute the query."  So for this example we have:
366 * 28 * 12 * 28 = 3 443 328