[Koha-devel] marc_word and searching

Sun May 30 05:43:12 CEST 2004

On Fri, May 28, 2004 at 09:27:05AM +0200, paul POULAIN wrote:
> Joshua Ferraro a écrit :
> 
> >Paul a écrit:
> >
> >NPL had a tech meeting today focusing on the opac searching and we have
> >reached some tentative conclusions about how to proceed.  Running some
> >test searches using marc_subfield_table we realized that a search model
> >based on that table is inadequate for our needs.  For example, a search 
> >on 'patrick o'brian' using the 'like' syntax produces no results if the
> >database entry is stored as 'o'brian, patrick' (when author is stored in
> >the 100a that is the format).  On the other hand, a search using the 
> >current
> >marc_word model fails for reasons we have already talked about (marc_word
> >does not keep track of single characters, &c.).  But if the marc_word table
> >did index single charcters, a search model based on marc_word  would work 
> >very well.  For example, a search on 'o'brian, patrick' or 'patrick 
> >o'brian'
> >would both return the correct records.  So our idea is to re-create our
> >marc_word table so that it indexes all characters from the tags and 
> >subfields
> >that we want to use for searches (we don't need all of them as you pointed
> >out; for instance, we will never use 300 for a search).  So we have three
> >basic tasks:
> > 
> >
> another idea, that would be better maybe :
> replace ' by _.
> Thus, o'brian searches o_brian, that will be stored in the DB.
> The only limit is that a search on brian won't be successful. Tell me if 
> it's a problem.
> 
> Otherwise, we could add a 'index also 1 letter words', but, imho, ONLY 
> with the 'do not index this subfield feature'.
It seems to me like indexing on single-character words will lead to a more
accurate search--though the marc_word table will be a bit bloated.  I suppose
we should also think about other punctuation marks too--do we change them
all to _ or do we leave them in the database?

> Everybody can give it's opinion here. Both solutions are easy to code.
> 
> >1.) write a script to re-create marc_word using the parameters we choose
> >for searching and including all characters.
> >
> >2.) fix Biblio.pm so that it will include all characters when it adds 
> >records
> >to marc_word (currently we add to our holdings using a modified version of
> >bulkmarcimport.pl that relies on Biblio.pm)
> >
> >3.) write a clean-up script to delete all the tags and subfields from 
> >marc_word
> >that we will never use (like 300)
> >
> >Does that sound like a sound plan to you Paul?  Do you have any scripts 
> >that
> >will speed up the process of re-building our marc_word table--if not we 
> >will write one ourselves.  Can you make the changes to Biblio.pm that will 
> >force
> >it to index single characters?
> > 
> >
> yep, if we decide to do it.
> I've no speedy script to rebuild marc_word table :-(
> 
> >One final point about search results.  Currently the marc searching does
> >not pass all the variables to the template so that we can choose what
> >values to display (for example, Lord of the Rings: The Two Towers currently
> >displays as 'Lord of the Rings:' without the subtitle).  I suggest that
> >we setup a method of easily making marc fields available to the template
> >so that each library can decide exactly what marc fields they want to 
> >display for the initial search results.
> > 
> >
> already planned. I'll try to commit some code on CVS ASAP.
> "MARC view" is ready (in OPAC).
> we plan to add a systempreference called 'ISBD' where the library could 
> define it's own biblio presentation.
> Something like :
> [200a;][200b/][(100c)]
> 
> The ; means a ; is added AFTER the 200a, the ( means a ( is added BEFORE 
> the 100c.
> Not exactly a ISBD view, but not too far either.
Thanks Paul.

Joshua