[Koha-devel] marc_word and searching

Joshua Ferraro jferraro at athenscounty.lib.oh.us
Thu May 27 19:42:02 CEST 2004


Paul a écrit:

NPL had a tech meeting today focusing on the opac searching and we have
reached some tentative conclusions about how to proceed.  Running some
test searches using marc_subfield_table we realized that a search model
based on that table is inadequate for our needs.  For example, a search 
on 'patrick o'brian' using the 'like' syntax produces no results if the
database entry is stored as 'o'brian, patrick' (when author is stored in
the 100a that is the format).  On the other hand, a search using the current
marc_word model fails for reasons we have already talked about (marc_word
does not keep track of single characters, &c.).  But if the marc_word table
did index single charcters, a search model based on marc_word  would work 
very well.  For example, a search on 'o'brian, patrick' or 'patrick o'brian'
would both return the correct records.  So our idea is to re-create our
marc_word table so that it indexes all characters from the tags and subfields
that we want to use for searches (we don't need all of them as you pointed
out; for instance, we will never use 300 for a search).  So we have three
basic tasks:

1.) write a script to re-create marc_word using the parameters we choose
for searching and including all characters.

2.) fix Biblio.pm so that it will include all characters when it adds records
to marc_word (currently we add to our holdings using a modified version of
bulkmarcimport.pl that relies on Biblio.pm)

3.) write a clean-up script to delete all the tags and subfields from marc_word
that we will never use (like 300)

Does that sound like a sound plan to you Paul?  Do you have any scripts that
will speed up the process of re-building our marc_word table--if not we will 
write one ourselves.  Can you make the changes to Biblio.pm that will force
it to index single characters?

One final point about search results.  Currently the marc searching does
not pass all the variables to the template so that we can choose what
values to display (for example, Lord of the Rings: The Two Towers currently
displays as 'Lord of the Rings:' without the subtitle).  I suggest that
we setup a method of easily making marc fields available to the template
so that each library can decide exactly what marc fields they want to 
display for the initial search results.

Comments? Suggestions?

Joshua

On Wed, May 26, 2004 at 04:38:37PM +0200, paul POULAIN wrote:
> Stephen Hedges a écrit :
> 
> >At what point does marc_word become so big and clunky that it becomes a
> >liability instead of an asset?  NPL's marc-word file is full of 'junk'
> >entries like "(pa." (picked up when an ISBN number has "(pa.)" after it to
> >denote paperback) and other such MARC oddities.  Our stopword file should
> >ideally be expanded to catch all of this junk, but I haven't done that
> >yet.  Now we're talking about adding punctuation marks and single letters!
> >I agree with Joshua that this is what should be done if we're going to
> >depend on using marc_word and expect to get any meaningful search results.
> >My question is:  maybe it would be more efficient to just use
> >marc_subfield_table for these searches and forget about marc_word?
> >
> you're right stephen...
> I have an other idea that could be coded quickly : in the MARC 
> framework, we could add a checkbox called "do NOT index this subfield".
> If checked, the subfield wouldn't be stored in marc_word (but stored in 
> marc_subfield_table)
> (Needs a script to clean the DB too, should be quite easy :
> foreach subfield in marc_subfield_structure {
>    if checkbox checked {
>       delete from marc_word where subfield= this one
>    }
> }
> ...)
> 
> -- 
> Paul POULAIN
> Consultant indépendant en logiciels libres
> responsable francophone de koha (SIGB libre http://www.koha-fr.org)
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by: Oracle 10g
> Get certified on the hottest thing ever to hit the market... Oracle 10g. 
> Take an Oracle 10g class now, and we'll give you the exam FREE.
> http://ads.osdn.com/?ad_id149&alloc_id66&opÌk
> _______________________________________________
> Koha-devel mailing list
> Koha-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/koha-devel




More information about the Koha-devel mailing list