[Koha-bugs] [Bug 13064] New: Indexing problem with ICU on control characters
bugzilla-daemon at bugs.koha-community.org
bugzilla-daemon at bugs.koha-community.org
Fri Oct 10 15:02:53 CEST 2014
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=13064
Bug ID: 13064
Summary: Indexing problem with ICU on control characters
Change sponsored?: ---
Product: Koha
Version: master
Hardware: All
OS: All
Status: NEW
Severity: normal
Priority: P5 - low
Component: Searching
Assignee: gmcharlt at gmail.com
Reporter: fridolyn.somers at biblibre.com
QA Contact: testopia at bugs.koha-community.org
The ICU configuration files contains a rule to remove control characters :
<transform rule="[:Control:] Any-Remove"/>
This rule is before tokenization.
The problem is that "[:Control:]" regex contains linefeed, carriage return and
tab. See http://www.regular-expressions.info/posixbrackets.html.
So when several lines are indexed, last word of line is joined with first line
of next line. Thoses words are then not searchable.
For example :
First line
Second line
This will become "First lineSecond line", tokenized as "First", "lineSecond"
and "line".
--
You are receiving this mail because:
You are watching all bug changes.
More information about the Koha-bugs
mailing list