[Koha-bugs] [Bug 28316] Fix ES crashes related to various punctuation characters

bugzilla-daemon at bugs.koha-community.org bugzilla-daemon at bugs.koha-community.org
Sun Aug 22 00:20:40 CEST 2021


https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=28316

--- Comment #75 from Andrew Nugged <nugged at gmail.com> ---
QueryRegexEscapeOptions
=========================

Now let's turn "QueryAutoTruncate" off and check with QueryRegexEscapeOptions:

It's all about RegEx in ES queries,
https://www.elastic.co/guide/en/elasticsearch/reference/6.8/query-dsl-regexp-query.html

Koha admin settings QueryRegexEscapeOptions has three options:



- escape,

  i.e. if the request will have any regular expressions between slashes:
      /book/
  this will be passed in "disabled, not RegEx anymore" to ES:
      \/book\/
  so ES will get this just as regular text to search, and in this example
  will just search for the word 'book', ignoring other non-alphanum chars.

  (*)NOTE: but this means that passing RegEx to ES is impossible, at all.
           and this means any book name like in the search field will work:
               We are / They are
           so if we copy-paste this to the search field, ES gets:
               We are \/ They are
           and Koha just searches it as it is.

  I assume this was added as a feature to disable "/" meaning as a special
RegEx
  limiter and allow librarians to search "We are / They are" text directly.


- don't escape, 

  so:
      /go+d/
  will be passed as it is to ES, so this will be special "Regular Expressions"
  in this example, this means "'g' then 'o' one or more times then 'd'"
  so search in ES for /go+d/ example is similar to search:
      god OR good
  ( yet not the same because RegExp 'go+d' is not limited by a few 'o's,
    so might match for 'gooooood' too)

  BTW it founds 7 books for me in both ways (/go+d/ and god OR good) in
  the current master dev-test DB.

  (*)NOTE: but this means that now any '/' which appears in the text will
           cause ES to throw an exception if the syntax is not understood,
           i.e. search for exactly this:
               We are / They are
           now won't work so the users need to do escaping themself and ask:
               We are \/ They are
           (by backslashing '/' in the request field) to make ES not "die"
           and search to work.

   That's why I assume to make 1+2 both working, 3rd one was introduced:



- unescape escaped,

  yeah, that's "The Trick" to make 1st way "We are / They are" requests to work
  and ES not to fail, but at the same time allowing to bring Regular
Expressions 
  by users, just in "pre-slashed form", i.e. what pre-slashed WILL BE RegExs, 
  what isn't -- won't be but won't "throw a syntax error" too, so:
        We are / They are
  will be tossed to ES as "We are \/ They are" and will search
  for "We are They are", bit this (NOW THE TRICK! BEHOooooLD!):
      We are / They are \/go+\/
  will be transferred to ES in such form:
      We are \/ They are /go+/
  and ES will search for 'we', 'they', 'are' words, all to be present, plus
  one of the any: go, goo, gooo, goooo.. words
  ( this founds, btw, 
      "Seesaw [sound recording] / Beth Hart, Joe Bonamassa"
    in current test DB :-), hey! )




This is how it is expected to work. But the reality in current master a little
worse,

Let's have "unescape escaped", let's invent such request:
    \/go{1,2}d\/

which EXPECTED by us to be transferred to ES as RegEx:
    /go{1,2}d/

what means we want ES to find for us: 'g', then one or max two 'o', and 'd'
i.e. this is exactly ONLY for finding "good" or "god".
But current master branch code breaks these expectations: replaces "{" and "}"
with double quotes), so for real this transferred to ES as the request:
    /go"1,2"d/
and obviously, nothing was found. 

But with this patch, the request:
    \/go{1,2}d\/

will be passed to ES as:
    /go{1,2}d/

and voila, 
  "7 result(s) found for '\/go{1,2}d\/'." in our test DB.
Huh!



More, because this patch respects {,} in regex but escapes those outside,
it will pass those escaped, so if you will search:
    \/go{1,2}d\/ [go]

this will pass to ES request WITHOUT patch:
    /go"1,2"d/ [go]

and bring yellow "Error" text box,
and this will be passed to ES request WITH patch:
    /go{1,2}d/ \[go\]

and because "\[" and "\]" ignored by ES, it searches just for "good" or "god"
plus the word "go", so the final searched text should contain "good to go"
or "go god" combinations (or others like those).

And yes, on test DB it founds 3 results! With patch.
And gives "error/nothing" without the patch.

-- 
You are receiving this mail because:
You are watching all bug changes.


More information about the Koha-bugs mailing list