[Koha-bugs] [Bug 29697] Excessive use of StripNonXmlChars

Mon Dec 20 16:21:59 CET 2021

https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=29697

--- Comment #2 from Jonathan Druart <jonathan.druart+koha at gmail.com> ---
?(In reply to Katrin Fischer from comment #1)
> Hi Joubu, can you give an example for such characters that would be
> stripped? I want to help, but not sure about why it was added. 

Short answer? I don't know.

The long answer is a rabbit hole.
The regex is

391     $str =~
s/[^\x09\x0A\x0D\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//g;

And it's related to: https://en.wikipedia.org/wiki/Valid_characters_in_XML

    U+0009, U+000A, U+000D: these are the only C0 controls accepted in XML 1.0;
    U+0020–U+D7FF, U+E000–U+FFFD: this excludes some (not all) non-characters
in the BMP (all surrogates, U+FFFE and U+FFFF are forbidden);
    U+10000–U+10FFFF: this includes all code points in supplementary planes,
including non-characters.

So, some weird characters/non-characters :)

> Maybe it would be enough to do it on import/saving a record?

Yes, that's why I was suggesting actually with "Either we assume the MARC::XML
that is stored is correct, or we need to add more StripNonXmlChars calls."

Also note that Galen wrote, as the time:

commit b549d7e1f1b7d518e16fa48af7360a38e8233fec
Date:   Fri Feb 8 16:35:18 2008 -0600
    added StripNonXmlChars to C4::Charset

"StripNonXmlChars should not necessarily be used, as it may be better to reject
a file or record if it contains that kind of encoding error."

We ended up using it from almost everywhere, inconsistently.

-- 
You are receiving this mail because:
You are watching all bug changes.