[Koha-zebra] Koha Zebra Searching Report (from NPL)

Thu Mar 23 02:28:26 CET 2006

Joshua Ferraro wrote:

>Hi all,
>
>I had a day-long session with NPL staff members today to get some 
>feedback on the Koha Zebra search options as they stand right now.
>The demo we used is here: http://kohatest.liblime.com. It's the
>entire (150K or so records from NPL's database indexed using the
>record.abs that I committed to CVS yesterday and the latest 
>SearchMarc.pm.
>
>PRELIMINARY COMMENTS
>
>NPL was very pleased with the new system and we got a lot of
>positive feedback. They are of course quite familiar now with
>the open source process of feedback so were more than happy to
>offer suggestions for how to improve.
>So ... here we go ...
>  
>
Congrats on the demo!

>The advanced search should include true boolean searching as is
>standard on many library advanced search interfaces (select 
>from multiple type of searches and have the ability to AND OR 
>XOR NOT).
>  
>
Can't do XOR today. I suppose it would be a possible new feature, but 
I've frankly never heard of it in an ILS.. can a XOR b be mapped to

(a OR b) NOT (a AND b) ?   or am I just showing my fading math skills to 
ill effect, here?

>Search by format should work (currently doesn't). Also, there is
>some question as to whether 'format' differs from 'type' differs
>from 'category'. We need to take a closer look at the bib1 profile
>and see what is available.
>
>Search by Branch should work -- and really, I'm not sure that
>bib1 suppors a 'Branch' option so I'm not sure how that would
>work ...
>  
>
Why do you see yourelf limited to Bib-1? Within Koha, you can do 
whatever you want -- specifically extend Bib-1 into the 8000-range 
(IIRC) for local USE attributes or define a private set.

>Search by Call # (LCC or Dewey) should work.
>
>Search by Publisher didn't seem to be working -- it should use 
>at least 260$c but we need to better define where pub info is 
>supposed to be kept.
>
>Sort by date should use the 008 field's dates (no other field
>is normalized enough to have sort by be consistant -- unless
>someone on this list knows of another field).
>  
>
I think 008 is it if you want a really predictable date.

>SPELLCHECKING
>
>Does zebra have an integrated spellcheck feature? NPL was hoping
>that zebra would use variant spelling and misspelling (via a 
>soundex or something) to pull up results even when the record's
>form differs from the search term. If not, we should expand the
>spellchecking feature LibLime wrote to be more complete.
>  
>
It isn't soundex, but it will behave somewhat the same in many cases. 
Try searching with truncation=Regexp-2  (103). This enables 
error-tolerant searching. By default, one error (insert/delete/replace) 
per term will still lead to a match. More at 
http://www.indexdata.com/zebra/doc/protocol-support.tkl#search

>TITLE SEARCHING
>
>The consensus seems to be that the default relevance ranking should not
>be how many times the search terms appear in the record, but whether
>the title starts with or contains the search terms. So for instance,
>a title search on "the civil war" should pull up all the titles with
>that exact title first, followed by records that contain that phrase
>in the title (even if not starting with it), followed by records where
>the terms appear in some title field somewhere in the record. Contrast
>that with the current behavior which puts "Voices from the Civil War"
>first (I'm assuming this is because the terms "the, civil, war" appear
>more times in that record than any other, or because they appear the
>same number of times but that record was indexed more recently.)
>
>One-word titles like "Tango" should come up first with a title search
>for "Tango".
>  
>
This would, I believe, require new development. It's possible that one 
of the experimental ranking algorithms that are included might provide 
better results for these people, but I *think* that boosting the score 
for one field in a ranked keyword search would require an extension to 
the index structure.

>AUTHOR SEARCHING
>
>Again, the current relevance ranking doesn't quite cut it. A good
>example is a relevance ranked author search on "James Joyce". Some
>records sneak into high relevance because they have multiple authors
>with names like "James Henry" and "Paul Joyce" (take  "Bob the Builder
>in the NPL database as an example
>
It might be worth checking whether one of the custom ranking algos did 
better on this..you an look in the NEWS file for instructions on how to 
enable them.

> relevance ranking
>should account for proximity and use that as the highest ranking
>consideration to ensure that a search on "James Joyce" returns all the
>books by "James Joyce" first. Also, they requested that the default
>ranking secondarily sort the items by date as well because they often 
>are asked to find the 'latest' book by so and so. We concluded that 
>the copyright date stored in the 008 is probably the only date 
>normalized enough to use for sorting though I'm not sure if zebra can 
>use that for sorting.
>  
>
It could with the XSLT index rules of Zebra 1.4.

>SUBJECT SEARCHING
>
>They seemed pleased with the way subject searching was working, it
>will correctly find things like "horses--psychology" where the first
>term is in 650$a and the second in $x. However, it seems not to 
>rank things based on proximity within a tag -- meaning that a search
>on horses--psychology will pull up records containing:
>
>650$a horses
>$x pets
>
>650$a humans
>$x psychology
>
>and records with the actual 'horses--psychology' (650$a$x) subject
>heading aren't given any favor in the ranking (I misplaced my actual 
>example and the one above is one I invented).
>  
>
Same thing. I don't know how hard it would be to add a score for 
proximity.. that data is at least in the index structure, but I've no 
idea how hard it would be to fit into the code. We can ask the Zebra 
wranglers what it would entail if you're interested.

>SUBJECT HEADING SEARCH
>
>NPL would like to see a demonstration of a 'Subject Heading' search
>using authorities generated from the data to compile a list of
>authoritative headings (which would be compiled from multiple fields
>within a given subject tag such as $650$a$v$x, etc.). So I think 
>to do this right we'd need to look at putting our authority records
>in Zebra as well.
>  
>
Hmm. Not sure I fully grok the requirement here.. you seem to suggest 
both constructing a specific index key based on a concatenation of 
multiple fields (easy in the XSLT indexing rules of 1.4, not compatible 
with the 'melm' directive.

>SERIES TITLE SEARCHING
>
>Series title should pull from series title fields, not just general
>title fields. We need to compile a list of these fields and create
>an index for just series titles.
>  
>
Seems easy enough.

>KEYWORD SEARCHING
>
>I saved this one for last on purpose. We're not really sure what
>keyword searching is supposed to be :-). We suspect that patrons
>expect that the keyword search is equivilent to a web search engine
>search box like Google. Given our limitations in the MARC format,
>what are our ranking options? Should titles be given priority,
>authors, subjects? We're not really sure. However, NPL did like the
>idea of offering a two-column results page for keyword searching--
>one side with library items and the other with results from the web
>using Google's API or something.
>
>Well, that's it for now. NPL staff are going to continue looking at
>the demo and providing me with feedback so I'll pass it along as I
>recieve it. Please feel free to add your own comments on the demo
>as well as comment on NPL's ideas.
>  
>
I think boosting the title is probably not the worst thing that could be 
done, maybe a slightly smaller boost to the subejct keywords

Neat. Interesting feedback.

--Sebastian

-- 
Sebastian Hammer, Index Data
quinn at indexdata.com   www.indexdata.com
Ph: (603) 209-6853