[Koha-bugs] [Bug 13064] Indexing problem with ICU on control characters

bugzilla-daemon at bugs.koha-community.org bugzilla-daemon at bugs.koha-community.org
Fri Oct 24 16:13:51 CEST 2014


http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=13064

Kyle M Hall <kyle at bywatersolutions.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #32296|0                           |1
        is obsolete|                            |

--- Comment #3 from Kyle M Hall <kyle at bywatersolutions.com> ---
Created attachment 32676
  -->
http://bugs.koha-community.org/bugzilla3/attachment.cgi?id=32676&action=edit
[PASSED QA] Bug 13064 - Indexing problem with ICU on control characters

The ICU configuration files contains a rule to remove control characters :
  <transform rule="[:Control:] Any-Remove"/>
This rule is before tokenization.

The problem is that "[:Control:]" regex contains line feed, carriage return and
tab. See http://www.regular-expressions.info/posixbrackets.html.
So when several lines are indexed, last word of line is joined with first line
of next line. Thoses words are then not searchable.

For example :
  First line
  Second line
This will become "First lineSecond line", tokenized as "First", "lineSecond"
and "line".

Test plan :
- Use ICU in Zebra configuration
- Choose an indexed field, like 300$a
- Create a new record
- Enter several lines in choosen field, like :
  First line
  Second line
- Index this record
=> Without patch the search on "Second" does not return the record
=> With patch the search on "Second" returns the record
- Same tests with tab and carriage return instead of line feed

Signed-off-by: Chris Cormack <chris at bigballofwax.co.nz>

Signed-off-by: Kyle M Hall <kyle at bywatersolutions.com>

-- 
You are receiving this mail because:
You are watching all bug changes.


More information about the Koha-bugs mailing list