[Koha-bugs] [Bug 15541] New: Prevent normalization during matching/import process

Mon Jan 11 07:15:07 CET 2016

http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=15541

            Bug ID: 15541
           Summary: Prevent normalization during matching/import process
 Change sponsored?: ---
           Product: Koha
           Version: master
          Hardware: All
                OS: All
            Status: NEW
          Severity: enhancement
          Priority: P5 - low
         Component: MARC Bibliographic record staging/import
          Assignee: gmcharlt at gmail.com
          Reporter: dcook at prosentient.com.au
        QA Contact: testopia at bugs.koha-community.org

By default, when you're using a matching rule during an import, the data from
the incoming record for a match point is normalized using the following regex:

$value =~ s/[.;:,\]\[\)\(\/'"]//g;
$value =~ s/^\s+//;
$value =~ s/\s+$//;
$value =~ s/\s+/ /g;

The first line removes all sorts of punctuation and other marks.

The second removes leading spaces.

The third removes trailing spaces.

The fourth converts multiple spaces into a single space.

--

While this might work well in many cases, it's a problem when you're trying to
match using a URL.

During matching, using CHR Zebra indexing, the following occurs:
"http://libris.kb.se/resource/bib/219553" becomes
"HTTPLIBRISKBSERESOURCEBIB219553"

If you're using a URL index, it's stored in Zebra as
"http://libris.kb.se/resource/bib/219553", so no match is made.

If you're using a Phrase index, it's stored in Zebra as 
"http libris kb se resource bib 219553".

If you're using a Word index, it's tokenized so that each little alphanumeric
piece between punctuation parts is indexed separately. 

--

The solution seems to be to either prevent normalization all together, or to
"replace punctuation with spaces" in accordance with the default
word-phrase-utf.chr file. I'd have to review what would need to be done for ICU
indexing...

--

An additional problem is C4::Search::SimpleSearch, as it converts all colons
(ie ":") into equal signs (ie "="). This is obviously a problem when you're
searching for a URL as the URL is stored as
"http://libris.kb.se/resource/bib/219553" but you're searching for
"http=//libris.kb.se/resource/bib/219553". No match will be made.

In this case, I think we could pass an additional flag to
C4::Search::SimpleSearch asking it not to normalize the query. Since it's a
simple search, we are often able to make the CCL qualifier choices before
passing in the query argument, so this seems reasonable.

-- 
You are receiving this mail because:
You are watching all bug changes.