[Koha-bugs] [Bug 13064] New: Indexing problem with ICU on control characters

bugzilla-daemon at bugs.koha-community.org bugzilla-daemon at bugs.koha-community.org
Fri Oct 10 15:02:53 CEST 2014


http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=13064

            Bug ID: 13064
           Summary: Indexing problem with ICU on control characters
 Change sponsored?: ---
           Product: Koha
           Version: master
          Hardware: All
                OS: All
            Status: NEW
          Severity: normal
          Priority: P5 - low
         Component: Searching
          Assignee: gmcharlt at gmail.com
          Reporter: fridolyn.somers at biblibre.com
        QA Contact: testopia at bugs.koha-community.org

The ICU configuration files contains a rule to remove control characters :
  <transform rule="[:Control:] Any-Remove"/>
This rule is before tokenization.

The problem is that "[:Control:]" regex contains linefeed, carriage return and
tab. See http://www.regular-expressions.info/posixbrackets.html.
So when several lines are indexed, last word of line is joined with first line
of next line. Thoses words are then not searchable.

For example :
  First line
  Second line
This will become "First lineSecond line", tokenized as "First", "lineSecond"
and "line".

-- 
You are receiving this mail because:
You are watching all bug changes.


More information about the Koha-bugs mailing list