[Koha-translate] Language, Script, Country, Encoding - an Explanation

Dorian Meid meid at backstage.org
Sun Jan 20 15:53:04 CET 2008


I recognize slight uncertainties when submitting the metadata for  
your translations. So I wanted to explain the basics a little.
Koha uses RFC4646 http://rfc.net/rfc4646.html for language  
identification.
It states, that a language is identified by several tags, separated  
by a hyphen:

Language tag - Script tag - Region/Country tag

The language tag is written in lowercase, the script tag is written  
in lowercase with the first letter in uppercase and the region or  
country tag is written in uppercase.
Example: zh-Hans-CN
zh is Chinese, Hans is the simplified Chinese script and CN is China.

The language is how you speak or what word you use to name a thing.
The language tags are standardised in ISO 639-1 or ISO 639-2 http:// 
www.loc.gov/standards/iso639-2/php/code_list.php
As we are in a library environment it may be useful to mention the  
difference between ISO 639-2/T and ISO 639-2/B.
T refers the terminology code and B refers the bibliographic code,  
e.g. german has the tag "deu" in ISO 639-2/T and "ger" in ISO 639-2/B.
The reason for this inconvenience is that some libraries assigned  
some tags for languages (the B-tags) before the ISO (T)  
standardisation was made.
The T and B differences are only in the three-letter tags of ISO  
639-2. So far we use the two-letter tags of ISO 639-1, but RFC4646  
allows also 639-2.

The script is how your characters look like or what you paint to  
produce a specific sound.
The script tags are standardised in ISO 15942 http://www.unicode.org/ 
iso15924/codelists.html
You have to add the script tag if your language can be written in  
more than one script, e.g. Hans for simplified Chinese or Hant for  
traditional chinese, or if the specified language is not written in  
the normal script e.g. de-Latf-DE for German in Fraktur.
You should, but don't have to omit the script tag if there is only  
one commonly used script for your language.

The region or country is where the language is spoken, this is  
important because there often are differences between countries,  
which basically share the same language, e.g. British English and  
American English.
The region/country tag is either a two letter Country code as  
sandardised in ISO 3166-1 http://www.iso.org/iso/country_codes/ 
iso_3166_code_lists.htm or a three digit Region code as standardised  
in UN M.49 http://unstats.un.org/unsd/methods/m49/m49.htm
Normally we use the ISO letter code, but the UN region code can be  
handy when specifying a language spoken in more than one country,  
e.g. es-005 (Spanish as spoken in South America).

When given a script tag we know how your script should look like, but  
computers are dumb. They don't know written characters, the just know  
bytes. The assignment of written (visual) characters to byte values  
is called character encoding. There are many different character  
encodings and to make it even worse there are some scripts, which can  
be successfully encoded in different ways.
Normal character encodings are capable of assigning 128 or 256  
characters. Unicode is capable of several billions of characters and  
can encode all used scripts, so it is the preffered choice for Koha  
themes and translations.
So please use UTF-8 for your document character encoding http:// 
www.unicode.org/standard/WhatIsUnicode.html
If you can't use UTF-8 or don't know how to use it please ask the  
list or at least specifiy the encoding you are using, so we can  
transcode your document.

Hope that helps.
Maybe this should be added to the readme on translate or the wiki.

Dorian Meid









More information about the Koha-translate mailing list