[Koha-bugs] [Bug 24720] Remove special characters from beginning of search fields

Fri Sep 10 15:47:34 CEST 2021

https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=24720

--- Comment #11 from David Gustafsson <glasklas at gmail.com> ---
(In reply to David Cook from comment #6)
> I will say though that my own manual texting of the regular expression looks
> good. 
> 
> I've plugged in English, French, and Chinese into my tests and it's all
> working as expected. (I tried Arabic as well and I think that worked too,
> although I imagine Séverine could confirm that better than me.)
> 
> If we can just get those automated unit tests to prove that, then I think
> we'd be good to go.

I added one simple test. I first attempted to test multiple variants with
different scripts but it turned out to be more effort than think is warranted
for this simple line of code. Firstly, since sort field are concatenated you
have to add multiple different fields with mappings for different strings to be
testable (since multiple values on same field are concatenated), no big deal
but is it really worth it? (Could break into separate method and test that, but
then we are missing out making sure it's actually called in
Koha::SearchEngine::Elasticsearch::marc_records_to_documents and don't get the
same coverage, and we would be creating a method just to be able to test this
piece of code in isolation, which if we where to be consistent and treat the
whole code-base this way would create a mess). I don't know if an issue with my
terminal/editor or something else, but when testing other scripts, like Chinese
I also get error messages like "UTF-8 "\xC3" does not map to Unicode at
/usr/share/perl5/MARC/File/Encode.pm line 35." before even reaching the regular
expression.

What we are testing is effectively be the trivial regular expression
"s/^\W+//u", that is to replace all non word unicode characters at the
beginning of the string with the empty string (remove them). I think the Perl
core implementation regarding separating word and non-word characters in
different scripts is not something that we should be writing tests for, I would
trust Perl in this case. But sure, to make sure we get to that piece of code,
and that the regexp does not do something completely unexpected doesn't hurt.

(In reply to Nick Clemens from comment #7)
> I agree with the ask for unit tests - they can be added to tests for
> marc_records_to_documents in t/db_dependent/Koha/SearchEngine/Elasticsearch.t
> 
> Additionally, I think this should be optional in search field configs:
> 1 - Title already has options for 'non-filing' characters - so if you want
> to ignore a leading '[' you can 
> 2 - Author names may contain real punctation, consider the band '!!!'
>     https://en.wikipedia.org/wiki/!!!

(In reply to David Cook from comment #8)
> Good point, Nick!
> 
> There's nothing more frustrating for a library user than the search not
> retrieving records that you know it has...

I made a mistake describing the bug as "Strip special chars from search
fields", so I can understand the confusion. It should be sort, not search
fields. I have corrected this in the commit message. This fix doesn't have any
impact on search fields so will not effect what is search on (the stripping of
special characters for search fields is already handled in Elastic). So the
band name '!!!' would still be searchable, but would probably have the same
sort weight as other strings containing only non-word characters, but this is
more of an edge case that will probably never have any major impact in real
life situations. To for example start allowing some special characters and not
others I think could be a potientially larger source of confusion, it's also
hard to know how to decide which characters should be exempted.

-- 
You are receiving this mail because:
You are watching all bug changes.