[Koha-devel] inverted list Proof of Concept [quest for search]

Fri May 27 03:01:35 CEST 2005

Hi,

After some chat with Kados, I've spent some time to write a proof of 
concept for my inverted list idea. It's not perfect, but it works quite 
good. I've commited it in CVS a few minuts ago (misc/build_marc_Tword.pl 
and C4::SearchMarcTest.pm files)

NPL should run it on their DB, that is really huge, and report how 
performances are.

How it works :
===============
* create the table marc_Tword with the following structure :
CREATE TABLE `marc_Tword` (
   `word` varchar(80) NOT NULL default '',
   `usedin` text NOT NULL,
   `tagsubfield` varchar(4) NOT NULL default '',
   PRIMARY KEY  (`word`,`tagsubfield`)
) TYPE=MyISAM;
* open a console & type export PERL5LIB & export KOHA_CONF as usual.
* fill this table with misc/build_marc_Tword.pl. Warning, this script 
uses a very very consumming but very fast method to fill the table : it 
does everything in memory, then write everything. Another method is 
provided (& commented), but it's 100x times slower (really !)
* open opac-search.pl and replace
	use C4::SearchMarc;
by
	use C4::SearchMarcTest;
as the API hasn't changed, it will work immediatly.
* go to opac-search (advanced search) & search whatever you want. It 
Should work fine (except in "Any word" that is not implemented for instance)

LIMITS :
==========
* build_marc_Tword has problem with extended chars (accented ones 
mainly). So don't be afraid if you get sql errors. They are not a 
problem for a POC
* search works always order by title, whatever you choose.
* search works only search WORDA and WOARDB, not yet WORDA or WORDB or 
WORDA except WORDB. Due to structure, those search should cost exactly 
the same thing as others.
you can test WORDA or WORDB very easily. in SearchMarcTest, at line 240,
replace
	@result = keys %intersect;
by
	@result = keys %union;

Some infos on perfs/size :
============================
On my largest DB (1 900 000 lines in marc_subfield_table), the 
build_marc_Tword.pl script requires 60s for the 1st SQL read, then 120s 
for reading the whole table, then another 100s to write the table. It 
takes up to 350-400 MB of RAM, and at the end, the marc_Tword table is 
245 000 lines, for 100MB of disk space. Note we could probably lower 
this number by checking better %ignore_list.

Tests :
========
A search on a title word (socia*, with a *) that get 4500 results 
arrives in less than 2 seconds (mysql cache being empty).
A search on another word (chan*) that get 590 results is the same time.
A search on both words (socia* chan*) is 446 results and get the same 
duration.
Doing the same test on a 2.2 install is VERY different : my SCSI hard 
disk does a lot of noise & the result for socia* chan* requires almost 
20 seconds to appear.

Testing with a OR (chan* OR socia*) get 4930 results, in probably less 
than 2 seconds.

(Those results could probably be improved a little, as there are a lot 
of SQL queries to get item number & status & biblio subtitle.)

Note that in the logs, you'll get something for every sql->execute (with 
the sql executed). You can see that most queries are not about the 
search itself, but about the item number & status.

-- 
Paul POULAIN
Consultant indépendant en logiciels libres
responsable francophone de koha (SIGB libre http://www.koha-fr.org)