[Koha-bugs] [Bug 29697] Excessive use of StripNonXmlChars
bugzilla-daemon at bugs.koha-community.org
bugzilla-daemon at bugs.koha-community.org
Mon Dec 20 16:21:59 CET 2021
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=29697
--- Comment #2 from Jonathan Druart <jonathan.druart+koha at gmail.com> ---
?(In reply to Katrin Fischer from comment #1)
> Hi Joubu, can you give an example for such characters that would be
> stripped? I want to help, but not sure about why it was added.
Short answer? I don't know.
The long answer is a rabbit hole.
The regex is
391 $str =~
s/[^\x09\x0A\x0D\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//g;
And it's related to: https://en.wikipedia.org/wiki/Valid_characters_in_XML
U+0009, U+000A, U+000D: these are the only C0 controls accepted in XML 1.0;
U+0020–U+D7FF, U+E000–U+FFFD: this excludes some (not all) non-characters
in the BMP (all surrogates, U+FFFE and U+FFFF are forbidden);
U+10000–U+10FFFF: this includes all code points in supplementary planes,
including non-characters.
So, some weird characters/non-characters :)
> Maybe it would be enough to do it on import/saving a record?
Yes, that's why I was suggesting actually with "Either we assume the MARC::XML
that is stored is correct, or we need to add more StripNonXmlChars calls."
Also note that Galen wrote, as the time:
commit b549d7e1f1b7d518e16fa48af7360a38e8233fec
Date: Fri Feb 8 16:35:18 2008 -0600
added StripNonXmlChars to C4::Charset
"StripNonXmlChars should not necessarily be used, as it may be better to reject
a file or record if it contains that kind of encoding error."
We ended up using it from almost everywhere, inconsistently.
--
You are receiving this mail because:
You are watching all bug changes.
More information about the Koha-bugs
mailing list