[Koha-devel] Zebra hyphen truncating search

Wed Feb 4 02:51:23 CET 2015

Hi Colin:

That seems strange to me, since I'm bypassing Koha's Perl interface entirely
and just using Zebra via yaz-client (obviously with the config we have for
Zebra in Koha). 

I've been doing some research into ICU and Indexdata's treatment of it, and
I'm thinking it's an issue with the tokenization in newer versions of Zebra
(or rather the ICU utility in YAZ).

> It would be the current releases from indexdata on Debian and also built
> from source on Fedora. The indexdata debian packages track the source
> closely.

I figured that it would be a case of using recent versions of Zebra/YAZ from
Indexdata...

So, Zebra gained support for "Unicode-based indexing using ICU" in version
2.0.20 in 2007. "The implementation is based on the ICU utility part of YAZ
3.0.16 and later." (http://www.indexdata.com/zebra/doc/NEWS). 

I'm guessing that there was a change somewhere along the line. I'm wondering
if it has to do with the introduction of the "join" element in YAZ 4.2.49
(http://www.indexdata.com/yaz/doc/yaz-icu.html). When you use "join", it
creates one string from your tokens. If you don't use "join", it seems to
just use the first token. 

Anyway, I think I'm going to try to get in touch with Indexdata about this
one, as it really does look like a bug in the YAZ ICU utility.

In the meantime, I'm going to try some creative
tranformations/transliterations, but I might have to try to code around this
one. Probably just by removing all hyphens from search terms. 

Further reading for folks curious about ICU:

This lets you interactively try out ICU transform/transliteration rules, so
that's pretty cool.
http://demo.icu-project.org/icu-bin/translit

A list of properties used in Unicode Regular Expressions, which ICU uses:
http://www.unicode.org/reports/tr18/#General_Category_Property

Some documentation and examples on ICU Transformation Rules:
http://userguide.icu-project.org/transforms/general/rules

David Cook
Systems Librarian
Prosentient Systems
72/330 Wattle St, Ultimo, NSW 2007

> -----Original Message-----
> From: Colin Campbell [mailto:colin.campbell at ptfs-europe.com]
> Sent: Tuesday, 3 February 2015 8:03 PM
> To: David Cook
> Cc: 'Colin Campbell'; koha-devel at lists.koha-community.org
> Subject: Re: Zebra hyphen truncating search
> 
> On Tue, Feb 03, 2015 at 05:16:05PM +1100, David Cook wrote:
> > Hi Colin!
> >
> >
> >
> > I stumbled across an email you sent to the Indexdata Zebralist in 2013:
> > http://lists.indexdata.dk/pipermail/zebralist/2013-August/002576.html
> >
> >
> >
> > Were you ever able to solve those problems?
> 
> I think its still the case. It can be further complicated by what the
settings in
> the system are and how Search.pm interprets them in constructing the
> search. Especially the relevancy options.
> >
> > Also, what version of Zebra were you running? I've noticed this
> > problem with Zebra using 2.0.59, but I haven't been able to produce it
> > using Zebra 2.0.47. I had the exact same configuration files, MySQL
> > database, and Zebra indexes. Here's an example:
> It would be the current releases from indexdata on Debian and also built
> from source on Fedora. The indexdata debian packages track the source
> closely.
> 
> Cheers
> Colin
> 
> --
> Colin Campbell
> Chief Software Engineer,
> PTFS Europe Limited
> Content Management and Library Solutions
> +44 (0) 800 756 6803 (phone)
> +44 (0) 7759 633626  (mobile)
> colin.campbell at ptfs-europe.com
> skype: colin_campbell2
> 
> http://www.ptfs-europe.com