[Koha-zebra] Koha Zebra Searching Report (from NPL)

Thu Mar 23 04:43:40 CET 2006

Joshua Ferraro wrote:

>On Wed, Mar 22, 2006 at 08:28:26PM -0500, Sebastian Hammer wrote:
>  
>
>>Can't do XOR today. I suppose it would be a possible new feature, but 
>>I've frankly never heard of it in an ILS.. can a XOR b be mapped to
>>
>>(a OR b) NOT (a AND b) ?   or am I just showing my fading math skills to 
>>ill effect, here?
>>    
>>
>Yep, that's the correct mapping. Voyager's where NPL originally
>saw the XOR function.
>  
>
Ok. It can be faked in the front-end then, or implemented deeper in the 
guts of Zebra.

>>Why do you see yourelf limited to Bib-1? Within Koha, you can do 
>>whatever you want -- specifically extend Bib-1 into the 8000-range 
>>(IIRC) for local USE attributes or define a private set.
>>    
>>
>Right, I was just hoping there was some way to map it to bib-1 as
>I assume that would be useful in cross-domain searching. If not we
>can certainly do a locally defined attribute or set.
>  
>
I think beyond what's in the Bath profile or the US national profile, 
you have little hope of interoperable search.. in my experience, 
cross-domain searching still entails the need to do query-mapping 
independently per target or for groups of targets with similar 
characteristics. I use the CCL parser that's available through the YAZ 
ZOOM API, and include a reference to a set of mapping directives as part 
of the configuration for each target.. that allows you to get pretty far 
towards an interoperable-feeling search with a minimum of code.

>>This would, I believe, require new development. It's possible that one 
>>of the experimental ranking algorithms that are included might provide 
>>better results for these people, but I *think* that boosting the score 
>>for one field in a ranked keyword search would require an extension to 
>>the index structure.
>>    
>>
>I've looked high and low for documentation on the ranking algorithms in
>Zebra but haven't found much more than a few sentences in the official
>docs and some list messages ...
>  
>
 It isn't documented beyond what's in the code, AFAIK.

>>>AUTHOR SEARCHING
>>>
>>>Again, the current relevance ranking doesn't quite cut it. A good
>>>example is a relevance ranked author search on "James Joyce". Some
>>>records sneak into high relevance because they have multiple authors
>>>with names like "James Henry" and "Paul Joyce" (take  "Bob the Builder
>>>in the NPL database as an example
>>>
>>>      
>>>
>>It might be worth checking whether one of the custom ranking algos did 
>>better on this..you an look in the NEWS file for instructions on how to 
>>enable them.
>>    
>>
>Will do.
>
>>>relevance ranking
>>>should account for proximity and use that as the highest ranking
>>>consideration to ensure that a search on "James Joyce" returns all the
>>>books by "James Joyce" first. Also, they requested that the default
>>>ranking secondarily sort the items by date as well because they often 
>>>are asked to find the 'latest' book by so and so. We concluded that 
>>>the copyright date stored in the 008 is probably the only date 
>>>normalized enough to use for sorting though I'm not sure if zebra can 
>>>use that for sorting.
>>>
>>>
>>>      
>>>
>>It could with the XSLT index rules of Zebra 1.4.
>>    
>>
>Cool, and are there docs on that somewhere? :-)
>  
>
There will be by the time Zebra 1.4 is released. For now, it's 
pre-release stuff. However, the CVS version of Zebra contains an example 
setup under examples/alvis-oai/conf. I think for really gnarly indexing 
schemes, this is probably the wave of the future, since it's pretty much 
infinitely flexible. It should also be pretty easy to perl-map one of 
the existing ABS files into this format.

>>Same thing. I don't know how hard it would be to add a score for 
>>proximity.. that data is at least in the index structure, but I've no 
>>idea how hard it would be to fit into the code. We can ask the Zebra 
>>wranglers what it would entail if you're interested.
>>    
>>
>Yes, please do, we're very interested in that particular one.
>  
>
Ok.

>>>SUBJECT HEADING SEARCH
>>>
>>>NPL would like to see a demonstration of a 'Subject Heading' search
>>>using authorities generated from the data to compile a list of
>>>authoritative headings (which would be compiled from multiple fields
>>>within a given subject tag such as $650$a$v$x, etc.). So I think 
>>>to do this right we'd need to look at putting our authority records
>>>in Zebra as well.
>>>
>>>      
>>>
>>Hmm. Not sure I fully grok the requirement here.. you seem to suggest 
>>both constructing a specific index key based on a concatenation of 
>>multiple fields (easy in the XSLT indexing rules of 1.4, not compatible 
>>with the 'melm' directive.
>>    
>>
>I'm unclear about the differences between 'elm' and 'melm'. The docs
>seem to indicate that they are the same...
>  
>
They are actually described as being quite different, but I can see how 
the nature of the difference could be more clear.

The 'elm' directive is the original.. it's parameter structure is based 
on the way that Z39.50 abstract record models were typically represented 
in the old days.. hence the weird ordering of elements, etc. It also has 
the limitation that you can't address attributes, because the old Z39.50 
record model didn't have attributes.  The xelm directive was introduced 
to fix that.. it allows you to express tag paths in the XPATH style, and 
to address attributes, either in [predicates] or directly, for indexing.

The usmarc.abs file that comes with Zebra assumes that records were 
ingested in ISO2709 using the record type grs.marc.<absfilename>. The 
grs.marc input filter actually generates an internal abstract structure 
which is incompatible with MARCXML.. it looks more like 
<245><11><a>content</a></11></245>. When MARCXML came along it became 
clear that it'd be nicer to work with that.. so the grs.marcxml input 
filter was introduced to parse ISO2709 and map them internally to 
MARCXML. Of course, if you're starting with MARCXML, you can just use 
grs.xml with the same effect.

But now the old usmarc.abs file won't work anymore, because MARCXML is 
all about attributes for field names and subfield codes, and the 'elm' 
directive can't handle that... in fact, to index 245$a, you'd have to 
write something like

xelm /*/datafield[@tag=245]/subfield[@code=a]     title

At some point, we got a bit of money from the LoC to develop a simple 
set of Bath level 0 indexing rules for Zebra.. I started working on 
that, but got so fed up with the syntax above that I rebelled and 
implemented the 'melm' directive (and it takes a lot for me to touch the 
innards of Zebra, in my old days), so instead of the above, I could write

melm 245$a  title

Which is totally equivalent to the above, but nice and to the point.. 
however, none of these mechanisms allows you to construct phrase indexes 
that span multiple subfields.. and they don't allow you to do cool stuff 
like extract a date from the guts of 008... in fact, there are lots of 
situations where you'd like to do some form of massaging on the input 
before processing. In the past, I would sometimes translate MARC records 
to an ASCII-line based format, and use the magic of the regexp input 
filters (http://www.indexdata.com/zebra/doc/record-model.tkl#id2530050) 
to massage the data at index/retrieval time... because I can write Tcl 
code in the input filters to do stuff to the data, the sky is the 
limit.. but, because I have to write Tcl code to accomplish anything, I 
become sad and gray-haired.  So when I build applications on Zebra these 
days, I am more likely to do some form of preprocessing of the records 
in Perl or similar BEFORE feeding them to Zebra.. not very satisfying, 
but it brings home the bacon.

Well, in Zebra 1.4, XSLT comes to the rescue, in a way that only XSLT 
can do it, with lots of angular brackets and much verbosity.... for 
instance, in an XSLT index filter,

melm 245$a title:w

becomes

<xsl:template 
match="marc:record/marc:datafield[@tag='245']/marc:subfield[@code='a']">
  <z:index name="title"type="w">
    <xsl:value-of select="."/>
  </z:index>
</xsl:template>

Eek.

But of course the magic of that is that you could put just about 
anything you could possibly imagine instead of that simple 
<xsl:value-of> in the middle... using substr() to extract a date from 
008, a code from the leader, combining subfields, doing math, looking 
stuff up in supporting tables, etc... the sky is the limit, and I'd  
prefer this to programming in Tcl anytime. And of course, if you want a 
more compact configuration file, you could write something like

<koha:melm field="245$a" index="title:w"/>

and use XSLT to map that into the diatribe above before sending it to 
Zebra.. we might even offer some options like that as part of the 
software down the road. In addition to the stylesheet which maps records 
to 'index documents' like above, Zebra 1.4 can be configured to support 
multiple retrieval schemas (i.e. DC, MODS, MARCXML), simply by providing 
stylesheets for each desired schema -- the translation is done on the 
fly when records are retrieved.

--Sebastian

>Thanks!
>
>  
>

-- 
Sebastian Hammer, Index Data
quinn at indexdata.com   www.indexdata.com
Ph: (603) 209-6853