[Koha-devel] marc_word and searching

Wed May 26 07:28:35 CEST 2004

At what point does marc_word become so big and clunky that it becomes a
liability instead of an asset?  NPL's marc-word file is full of 'junk'
entries like "(pa." (picked up when an ISBN number has "(pa.)" after it to
denote paperback) and other such MARC oddities.  Our stopword file should
ideally be expanded to catch all of this junk, but I haven't done that
yet.  Now we're talking about adding punctuation marks and single letters!
 I agree with Joshua that this is what should be done if we're going to
depend on using marc_word and expect to get any meaningful search results.
 My question is:  maybe it would be more efficient to just use
marc_subfield_table for these searches and forget about marc_word?

Stephen

Joshua Ferraro said:
> Paul et al,
>
> I've been trying to figure out how best to solve our ' and , problem
> with the marc searching and I've got a few comments to make about the
> way that the searches are currently done (using marc_word) and the
> problems with how marc_word stores data.
>
> So here's a classic example of an author that fails currently:
> o'brian, patrick
>
> right now the search seperates the 'o' and the 'brian' and the 'patrick'
> and the resulting query looks like this:
>
> select distinct m1.bibid from biblio,biblioitems,marc_biblio,marc_word as
> m1,marc_word as m2,marc_word as m3,marc_word as m4 where
> biblio.biblionumber=marc_biblio.biblionumber and
> biblio.biblionumber=biblioitems.biblionumber and
> m1.bibid=marc_biblio.bibid and (m1.bibid=m2.bibid and m1.bibid=m3.bibid
> and m1.bibid=m4.bibid) and ((m1.word  like 'o%' and m1.tag+m1.subfieldid
> in ('100a','110a', '700a', '710a'))and (m2.word like '\'%' and
> m2.tag+m2.subfieldid in('100a','110a', '700a', '710a'))and (m3.word like
> 'brian%' and m3.tag+m3.subfieldid in('100a','110a', '700a', '710a'))and
> (m4.word like 'patrick%' and m4.tag+m4.subfieldid in('100a','110a',
> '700a', '710a'))) order by biblio.title
>
> So there is at least one major problem with this query which does not
> return
> any results): marc_word does not store values as small as ' or o.  So of
> course
> there are no results ...
>
> Even if I strip out the ' and , from the query and search on something
> like
> (I add the following after line 117 in SearchMarc.pm):
>
> @$value[$i] =~ s/'/ /g;
> @$value[$i] =~ s/,/ /g;
>
> which turns out like:
>
> 'o brian patrick'
>
> it fails ('o' is too small for marc_word); and of course
>
> @$value[$i] =~ s/'//g;
> @$value[$i] =~ s/,//g;
>
> resulting in:
>
> 'obrian patrick'
>
> fails too--the data simply isn't stored right for this kind of search.
>
> So I see two ways to fix this problem: 1) stop using marc_word for these
> kinds of searches and use marc_subfield_table (which has the whole
> 'o'brian, patrick' in subfield_value) or 2) fix the way that marc_word
> stores small values (it should store everything including , ' and single
> letters like 'a', 'o', etc.
>
> Any comments?  Further suggestions?
>
> Joshua
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: Oracle 10g
> Get certified on the hottest thing ever to hit the market... Oracle 10g.
> Take an Oracle 10g class now, and we'll give you the exam FREE.
> http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
> _______________________________________________
> Koha-devel mailing list
> Koha-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/koha-devel
>

-- 
Stephen Hedges
Skemotah Solutions, USA
www.skemotah.com  --  shedges at skemotah.com