[Koha-devel] Latin in Mysql [was: Questions from the great wall of install...]

Tue Sep 18 16:05:21 CEST 2012

At 02:37 PM 9/17/2012 +0800, Mark Tompsett wrote:
[snip]
>>>5) In MySQL â€œselect variables like â€˜%colla%â€™;â€ and
>>>â€œselect variables like â€˜%char%â€™;â€ generate non-utf8
>>>entries. Is this correct? Nothing like this is mentioned
>>>in the Debian instructions for packages.
>
>>I don't know what this question means.

Mysql and character sets can be daunting, because of the dominance over the 
years of Latinx (tnx WG) but is improving in the direction of UTF8 as more 
and more RFCs and developers wake up (on one of our servers I even have to 
use a dedicated, non-UTF8 instance of Mysql to take care of our "legacy" 7 
year-old, updated every year, accounting programs.)

What you should ideally find is:

mysql> show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | utf8                       |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

What you often find (particularly over a network, even using quite 
up-to-date tunneling) is:mysql> show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | latin1                     |
| character_set_server     | utf8                       |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

>Those two MySQL queries confirm how default databases and tables are 
>created, as far as I know. Latin1 is not UTF-8, and may cause issues with 
>non-Latin charaters, correct? The question was what steps, if any, need to 
>be taken to ensure everything is UTF-8 compliant? I didn't read the 
>koha-create script yet, but I assume it would have forced UTF-8 compliance 
>on the database generation. Am I making a bad assumption?

I was fairly certain that so long as character_set_database | system | 
server are UTF8, Koha should not be in trouble (the potential danger is not 
being able to *restore* a dump, as Mysql does not necessarily report 
corruption while actually making the backup.)

BUT ... Mysql also uses "collation" char sets for textual fields. I've just 
had a looked at our older 3.6.7 db on the sandbox and the production 3.8.4, 
and find:

mysql> show table status where collation like '%latin%' \G
*************************** 1. row ***************************
            Name: fundmapping
[snip]
       Collation: latin1_swedish_ci
[snip]
1 row in set (0.09 sec)

I *think* (see <http://wiki.koha-community.org/wiki/DB_schema_bugs>) this 
is probably a legacy from 3.2 and is no longer required, then:

'DROP TABLE IF EXISTS fundmapping;'

If it's required (I just restored after DROPping the table) then:

'ALTER TABLE fundmapping CHARACTER SET utf8 COLLATE utf8_general_ci;'

Both command lines work on our sandbox:

mysql> show table status where collation like '%latin%' \G
Empty set (0.08 sec)

Note that we do not use acquisitions in Koha, so I really have no data to 
test with.  Maybe someone else could confirm the 'DROP' *if* the table is 
not required, or test whether non-latin utf8 entries in this table are 
correctly restorable?

Best - Paul