[Koha-zebra] Koha Zebra Searching Report (from NPL)

Thu Mar 23 00:27:57 CET 2006

Hi all,

I had a day-long session with NPL staff members today to get some 
feedback on the Koha Zebra search options as they stand right now.
The demo we used is here: http://kohatest.liblime.com. It's the
entire (150K or so records from NPL's database indexed using the
record.abs that I committed to CVS yesterday and the latest 
SearchMarc.pm.

PRELIMINARY COMMENTS

NPL was very pleased with the new system and we got a lot of
positive feedback. They are of course quite familiar now with
the open source process of feedback so were more than happy to
offer suggestions for how to improve.
So ... here we go ...

The advanced search should include true boolean searching as is
standard on many library advanced search interfaces (select 
from multiple type of searches and have the ability to AND OR 
XOR NOT).

Search by format should work (currently doesn't). Also, there is
some question as to whether 'format' differs from 'type' differs
from 'category'. We need to take a closer look at the bib1 profile
and see what is available.

Search by Branch should work -- and really, I'm not sure that
bib1 suppors a 'Branch' option so I'm not sure how that would
work ...

Search by Call # (LCC or Dewey) should work.

Search by Publisher didn't seem to be working -- it should use 
at least 260$c but we need to better define where pub info is 
supposed to be kept.

Sort by date should use the 008 field's dates (no other field
is normalized enough to have sort by be consistant -- unless
someone on this list knows of another field).

SPELLCHECKING

Does zebra have an integrated spellcheck feature? NPL was hoping
that zebra would use variant spelling and misspelling (via a 
soundex or something) to pull up results even when the record's
form differs from the search term. If not, we should expand the
spellchecking feature LibLime wrote to be more complete.

TITLE SEARCHING

The consensus seems to be that the default relevance ranking should not
be how many times the search terms appear in the record, but whether
the title starts with or contains the search terms. So for instance,
a title search on "the civil war" should pull up all the titles with
that exact title first, followed by records that contain that phrase
in the title (even if not starting with it), followed by records where
the terms appear in some title field somewhere in the record. Contrast
that with the current behavior which puts "Voices from the Civil War"
first (I'm assuming this is because the terms "the, civil, war" appear
more times in that record than any other, or because they appear the
same number of times but that record was indexed more recently.)

One-word titles like "Tango" should come up first with a title search
for "Tango".

AUTHOR SEARCHING

Again, the current relevance ranking doesn't quite cut it. A good
example is a relevance ranked author search on "James Joyce". Some
records sneak into high relevance because they have multiple authors
with names like "James Henry" and "Paul Joyce" (take  "Bob the Builder
in the NPL database as an example). The relevance ranking
should account for proximity and use that as the highest ranking
consideration to ensure that a search on "James Joyce" returns all the
books by "James Joyce" first. Also, they requested that the default
ranking secondarily sort the items by date as well because they often 
are asked to find the 'latest' book by so and so. We concluded that 
the copyright date stored in the 008 is probably the only date 
normalized enough to use for sorting though I'm not sure if zebra can 
use that for sorting.

SUBJECT SEARCHING

They seemed pleased with the way subject searching was working, it
will correctly find things like "horses--psychology" where the first
term is in 650$a and the second in $x. However, it seems not to 
rank things based on proximity within a tag -- meaning that a search
on horses--psychology will pull up records containing:

650$a horses
$x pets

650$a humans
$x psychology

and records with the actual 'horses--psychology' (650$a$x) subject
heading aren't given any favor in the ranking (I misplaced my actual 
example and the one above is one I invented).

SUBJECT HEADING SEARCH

NPL would like to see a demonstration of a 'Subject Heading' search
using authorities generated from the data to compile a list of
authoritative headings (which would be compiled from multiple fields
within a given subject tag such as $650$a$v$x, etc.). So I think 
to do this right we'd need to look at putting our authority records
in Zebra as well.

SERIES TITLE SEARCHING

Series title should pull from series title fields, not just general
title fields. We need to compile a list of these fields and create
an index for just series titles.

KEYWORD SEARCHING

I saved this one for last on purpose. We're not really sure what
keyword searching is supposed to be :-). We suspect that patrons
expect that the keyword search is equivilent to a web search engine
search box like Google. Given our limitations in the MARC format,
what are our ranking options? Should titles be given priority,
authors, subjects? We're not really sure. However, NPL did like the
idea of offering a two-column results page for keyword searching--
one side with library items and the other with results from the web
using Google's API or something.

Well, that's it for now. NPL staff are going to continue looking at
the demo and providing me with feedback so I'll pass it along as I
recieve it. Please feel free to add your own comments on the demo
as well as comment on NPL's ideas.

Cheers,

-- 
Joshua Ferraro               VENDOR SERVICES FOR OPEN-SOURCE SOFTWARE
President, Technology       migration, training, maintenance, support
LibLime                                Featuring Koha Open-Source ILS
jmf at liblime.com |Full Demos at http://liblime.com/koha |1(888)KohaILS