[Koha-bugs] [Bug 15555] Index 024$a into Identifier-other:u url register when source $2 is uri

Tue Jan 12 06:43:48 CET 2016

http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=15555

--- Comment #3 from David Cook <dcook at prosentient.com.au> ---
NOTE: The output I've provided in the above comments has come from a Koha using
CHR indexing.

ICU indexing has different output, but the same behaviour.

Here's what I see with "format xml" and "elements zebra::index":

<index name="Identifier-other" type="w" seq="21"></index>
<index name="Identifier-other" type="w" seq="1"></index>
<index name="Identifier-other" type="w" seq="22"></index>
<index name="Identifier-other" type="w" seq="23"></index>
<index name="Identifier-other" type="w" seq="24"></index>
<index name="Identifier-other" type="w" seq="25"></index>
<index name="Identifier-other" type="p" seq="21"></index>
<index name="Identifier-other" type="p" seq="22"></index>
<index name="Identifier-other" type="p" seq="23"></index>
<index name="Identifier-other" type="p" seq="24"></index>
<index name="Identifier-other" type="p" seq="25"></index>
<index name="Identifier-other" type="u"
seq="26">http://libris.kb.se/resource/bib/219553</index>

Here's what I see with "format xml" and "elements index":

<z:index name="Identifier-other:w
Identifier-other:p">http://libris.kb.se/resource/bib/219553</z:index>
<z:index
name="Identifier-other:u">http://libris.kb.se/resource/bib/219553</z:index>

However, this output is misleading. That's basically just the output of
"xsltproc biblio-zebra-indexdefs.xsl <record>". It's pre-normalization and thus
essentially meaningless.

-----------

Only advanced users will look at yaz-client though.

That being said, there are functional differences between ICU and CHR.

For instance, the following query will work in CHR but NOT in ICU:
id-other,phr=http libris kb se resource bib 219553

Conversely, the following query will work in ICU but not in CHR:
id-other,phr=http libriskbse resource bib 219553

That's because we've configured tokenization and normalization to work
differently between the two schemes. Fun, right?

ICU uses the "l" tokenize rule from
http://www.indexdata.com/yaz/doc/yaz-icu.html. That means it tokenizes based on
slashes, spaces, and maybe some other characters I haven't discovered yet. You
can verify that with the following commands:

echo "THIS IS A TEST" | yaz-icu -x -c ./etc/zebradb/etc/phrases-icu.xml
echo "THIS/IS/A/TEST" | yaz-icu -x -c ./etc/zebradb/etc/phrases-icu.xml

Indeed, check out the following yaz-client output when using ICU:

Z> f id-other,phr=http://libris.kb.se/resource/bib/219553
Sent searchRequest.
Received SearchResponse.
Search was a success.
Number of hits: 12, setno 17
SearchResult-1: term=http cnt=12, term=libriskbse cnt=12, term=resource cnt=12,
term=bib cnt=12, term=219553 cnt=12
records returned: 0
Elapsed: 0.001458

You can see the URL has been broken into 5 terms/tokens with ICU, while CHR
does the following:

Z> f id-other,phr=http://libris.kb.se/resource/bib/219553
Sent searchRequest.
Received SearchResponse.
Search was a success.
Number of hits: 12, setno 13
SearchResult-1: term=http cnt=12, term=libris cnt=12, term=kb cnt=12, term=se
cnt=12, term=resource cnt=12, term=bib cnt=12, term=219553 cnt=12
records returned: 0
Elapsed: 0.001119

We actually have 7 terms/tokens in the case of CHR!

And that means that we don't want to try to outsmart Zebra by pre-normalizing
our queries. We want to query Zebra with the exact same data that it indexed,
because that way the normalization will be the same! Science!

-- 
You are receiving this mail because:
You are watching all bug changes.