[Koha-bugs] [Bug 15541] Create additional normalizers for Record Matching Rules

Tue Jan 12 06:52:06 CET 2016

http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=15541

--- Comment #1 from David Cook <dcook at prosentient.com.au> ---
Here's my latest findings:

Input: "http://libris.kb.se/resource/bib/219553" 

C4::Matcher::_normalize() = "HTTPLIBRISKBSERESOURCEBIB219553"
Zebra CHR = "http libris kb se resource bib 219553"
Zebra ICU = "http libriskbse resource bib 219553"

It seems to me that the smartest thing to do is NOT to normalize with
C4::Matcher::_normalize(), because we're probably going to get it wrong as we
have above.

Zebra indexes "http://libris.kb.se/resource/bib/219553" as "http libris kb se
resource bib 219553" (CHR Phrase) or as "http libriskbse resource bib 219553"
(ICU Phrase) or as "http://libris.kb.se/resource/bib/219553" (URL, which is a
Charmap when using either CHR or ICU).

If we query Zebra with "http://libris.kb.se/resource/bib/219553", it will
normalize the query the same way that it normalized
"http://libris.kb.se/resource/bib/219553" when it was originally indexing it,
and we'll get a match.

Of course, we can't necessarily stop using C4::Matcher::_normalize() as it's
the default behaviour. Many people may count on that _normalize() without even
knowing it... even if it's potentially working badly.

I think what I want to do is create a new normalizer which does nothing, and
call it "none" or "raw". 

That way, I'm passing to Zebra the same thing that it's seen before, and it
will normalize it exactly the same way and the likelihood of an accurate match
increases considerably.

-- 
You are receiving this mail because:
You are watching all bug changes.