[Koha-bugs] [Bug 9729] Unable to use IT search terms such as C#, .NET, C++ in searching

Thu Jun 17 03:22:36 CEST 2021

https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=9729

--- Comment #10 from David Cook <dcook at prosentient.com.au> ---
Oh I've had some fun playing with ICU...

chain.xml:
<icu_chain locale="">
  <tokenize rule="l"/>
  <transliterate rule="[:Punctuation:] } [:WhiteSpace:] > ''"/>
  <transform rule="[:WhiteSpace:] Remove "/>
  <display/>
  <casemap rule="l"/>
</icu_chain>

echo -n '.NET. test' | yaz-icu -c chain.xml
1 1 '.net'' '.NET''
2 1 'test' 'test'

--
Here we tokenize based on the line break (ie space), and then we perform our
transliteerate and transform rules as per
http://userguide.icu-project.org/transforms/general. 

With the transliterate, we can use the following syntax:

"before_context { text_to_replace } after_context > completed_result |
result_to_revisit ;"

So here the "text_to_replace" is the [:Punctuation:] and the "after_context" is
[:WhiteSpace:], and the completed result is transliterating the punctuation
into nothing. 

So we trim the "." from the end of NET but we don't trim the "." from the
start. 

Of course, that doesn't really work in practice, because it misses sooo many
other scenarios:

echo -n 'Was that a good idea?' | yaz-icu -c chain.xml
1 1 'was' 'Was'
2 1 'that' 'that'
3 1 'a' 'a'
4 1 'good' 'good'
5 1 'idea?' 'idea?'

I'm not really sure how to solve this problem in an efficient way. We could
just map "C#", "C++", and ".NET" to "csharp", "cplusplus', and 'dotnet', but
that's not a very scalable or comprehensive solution for all Koha users.

-- 
You are receiving this mail because:
You are watching all bug changes.