[Koha-devel] MySQL silently truncating strings due to 3 byte limit in 'utf8' charset

Barton Chittenden barton at bywatersolutions.com
Fri Mar 24 14:05:28 CET 2017


Excellent catch, David!

Please file a bug for that, if you haven't already, the use of MARC 880 for
CJK languages is quite common in US libraries.

Does Zebra handle 4 byte characters correctly?

--Barton

On Fri, Mar 24, 2017 at 2:08 AM, David Cook <dcook at prosentient.com.au>
wrote:

> Hi all,
>
>
>
> I ran into a problem this afternoon where MySQL was silently truncating my
> UTF8 encoded string in biblio_metadata.metadata and biblio.notes.
> Admittedly, the string had some funny looking data*, but it was all valid
> UTF8. I had zero problems with XML::LibXML or MARC::Record. However,
> whenever I would use DBI or the mysql command line client, I’d get a
> truncated string**. This meant a completely invalid XML record which is
> irretrievable from the database using Koha’s XML handling methods.
>
>
>
> However, I eventually found https://bugs.mysql.com/bug.php?id=53005 where
> they state that the MySQL ‘utf8’ charset has a 3 byte limit (
> https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8.html). That
> tends to be fine in most cases, but 𝔤 is 4 bytes (f0 9d 94 a4
> http://dev.networkerror.org/utf8/?start=120100&end=120200&
> cols=4&show_uni_int=on&show_uni_hex=on&show_html_ent=on&
> show_raw_hex=on&show_raw_bin=on). It was on this character that the
> truncation happened. Apparently, there is a 4 byte UTF-8 charset called
> utf8mb4: https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-
> utf8mb4.html. If you “SET NAMES = ‘utf8mb4’” for your client and set the
> charset for your database column, that will allow you to correctly insert
> the data.
>
>
>
> So the truncation doesn’t generate an error but it does generate a
> warning, which is much more obvious in the mysql CLI client than when using
> DBI:
>
> MariaDB [devkohadcook]> update biblio set abstract = 'We introduce a new
> perspective and a generalization of spectral networks for 4d N=2"
> style="position: relative;" tabindex="0" id="MathJax-Element-1-Frame"
> class="MathJax">=2 theories of class S" style="position: relative;"
> tabindex="0" id="MathJax-Element-2-Frame" class="MathJax"> associated
> to Lie algebras g=An" style="position: relative;" tabindex="0"
> id="MathJax-Element-3-Frame" class="MathJax">𝔤=An, Dn"
> style="position: relative;" tabindex="0" id="MathJax-Element-4-Frame"
> class="MathJax">Dn, E6" style="position: relative;" tabindex="0"
> id="MathJax-Element-5-Frame" class="MathJax">E6, and E7"
> style="position: relative;" tabindex="0" id="MathJax-Element-6-Frame"
> class="MathJax">E7. Spectral networks directly compute the BPS spectra
> of 2d theories on surface defects coupled to the 4d theories. A Lie
> algebraic interpretation of these spectra emerges naturally from our
> construction, leading to a new description of 2d-4d wall-crossing
> phenomena. Our construction also provides an efficient framework for the
> study of BPS spectra of the 4d theories. In addition, we consider novel
> types of surface defects associated with minuscule representations of g"
> style="position: relative;" tabindex="0" id="MathJax-Element-7-Frame"
> class="MathJax">𝔤.' where biblionumber = 50314;
>
> Query OK, 0 rows affected, 1 warning (0.00 sec)
>
> Rows matched: 1  Changed: 0  Warnings: 1
>
>
>
> Because it doesn’t raise an exception, I’d say it’s silent. You have to
> check for it. Using mysql, you can send “SHOW WARNINGS” after the insert
> query and you can see something like:
>
> +---------+------+------------------------------------------
> --------------------------------------+
>
> | Level   | Code | Message
>                                           |
>
> +---------+------+------------------------------------------
> --------------------------------------+
>
> | Warning | 1366 | Incorrect string value: '\xF0\x9D\x94\xA4=A...' for
> column 'abstract' at row 1 |
>
> +---------+------+------------------------------------------
> --------------------------------------+
>
> 1 row in set (0.01 sec)
>
>
>
> With DBI, it’s harder: http://stackoverflow.com/
> questions/7946655/dbi-perl-logging-mysql-warnings. Apparently you can use
> $dbh->{mysql_warning_count} to check for warnings, and then send “SHOW
> WARNINGS” via the same database handle and that should give you your
> warning result. Or you can run MySQL in “traditional” mode which treats
> warnings as errors.
>
>
>
> So this seems like an edgecase… it’s some wonky data (albeit valid UTF8
> data) in an otherwise English record.
>
>
>
> However, according to MySQL (I haven’t verified for myself), Chinese,
> Japanese, and Korean sometimes use 4 bytes for characters. So people using
> these languages with Koha could be getting truncated data without realising
> that’s what is happening or why.
>
>
>
> I’ve tried out utf8mb4 and it works great. I think the downside is that it
> uses more space for certain types of columns because it needs to allocate
> more space for that 4th byte. I don’t know whether we should consider
> switching from utf8 to utf8mb4, so that we’re inclusive of CJK and comply
> with RFC3629 which states that there’s a 4 byte maximum.
>
>
>
> I think it’s worth discussion though. This wouldn’t just affect libraries
> in Asia. It would affect any library which stores records with CJK
> characters in a MARC 880 “Alternate Graphic Representation” field.
>
>
>
> I’m at the end of my Friday work day, so I’m going to leave it there, but
> food for thought!
>
>
>
>
>
>
>
>
>
> *Aforementioned strange data:
>
> We introduce a new perspective and a generalization of spectral networks
> for 4d N=2" style="position: relative;" tabindex="0"
> id="MathJax-Element-1-Frame" class="MathJax">=2 theories of class S"
> style="position: relative;" tabindex="0" id="MathJax-Element-2-Frame"
> class="MathJax"> associated to Lie algebras g=An" style="position:
> relative;" tabindex="0" id="MathJax-Element-3-Frame" class="MathJax">𝔤=An,
> Dn" style="position: relative;" tabindex="0" id="MathJax-Element-4-Frame"
> class="MathJax">Dn, E6" style="position: relative;" tabindex="0"
> id="MathJax-Element-5-Frame" class="MathJax">E6, and E7"
> style="position: relative;" tabindex="0" id="MathJax-Element-6-Frame"
> class="MathJax">E7. Spectral networks directly compute the BPS spectra
> of 2d theories on surface defects coupled to the 4d theories. A Lie
> algebraic interpretation of these spectra emerges naturally from our
> construction, leading to a new description of 2d-4d wall-crossing
> phenomena. Our construction also provides an efficient framework for the
> study of BPS spectra of the 4d theories. In addition, we consider novel
> types of surface defects associated with minuscule representations of g"
> style="position: relative;" tabindex="0" id="MathJax-Element-7-Frame"
> class="MathJax">𝔤.
>
>
>
> ** Truncated string:
>
> We introduce a new perspective and a generalization of spectral networks
> for 4d N=2" style="position: relative;" tabindex="0"
> id="MathJax-Element-1-Frame" class="MathJax">=2 theories of class S"
> style="position: relative;" tabindex="0" id="MathJax-Element-2-Frame"
> class="MathJax"> associated to Lie algebras g=An" style="position:
> relative;" tabindex="0" id="MathJax-Element-3-Frame" class="MathJax">𝔤
>
>
>
> David Cook
>
> Systems Librarian
>
> Prosentient Systems
>
> 72/330 Wattle St
>
> Ultimo, NSW 2007
>
> Australia
>
>
>
> Office: 02 9212 0899
>
> Direct: 02 8005 0595
>
>
>
> _______________________________________________
> Koha-devel mailing list
> Koha-devel at lists.koha-community.org
> http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-devel
> website : http://www.koha-community.org/
> git : http://git.koha-community.org/
> bugs : http://bugs.koha-community.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.koha-community.org/pipermail/koha-devel/attachments/20170324/a468b877/attachment.html>


More information about the Koha-devel mailing list