[Koha-translate] Language, Script, Country, Encoding - an Explanation
Dorian Meid
meid at backstage.org
Sun Jan 20 15:53:04 CET 2008
I recognize slight uncertainties when submitting the metadata for
your translations. So I wanted to explain the basics a little.
Koha uses RFC4646 http://rfc.net/rfc4646.html for language
identification.
It states, that a language is identified by several tags, separated
by a hyphen:
Language tag - Script tag - Region/Country tag
The language tag is written in lowercase, the script tag is written
in lowercase with the first letter in uppercase and the region or
country tag is written in uppercase.
Example: zh-Hans-CN
zh is Chinese, Hans is the simplified Chinese script and CN is China.
The language is how you speak or what word you use to name a thing.
The language tags are standardised in ISO 639-1 or ISO 639-2 http://
www.loc.gov/standards/iso639-2/php/code_list.php
As we are in a library environment it may be useful to mention the
difference between ISO 639-2/T and ISO 639-2/B.
T refers the terminology code and B refers the bibliographic code,
e.g. german has the tag "deu" in ISO 639-2/T and "ger" in ISO 639-2/B.
The reason for this inconvenience is that some libraries assigned
some tags for languages (the B-tags) before the ISO (T)
standardisation was made.
The T and B differences are only in the three-letter tags of ISO
639-2. So far we use the two-letter tags of ISO 639-1, but RFC4646
allows also 639-2.
The script is how your characters look like or what you paint to
produce a specific sound.
The script tags are standardised in ISO 15942 http://www.unicode.org/
iso15924/codelists.html
You have to add the script tag if your language can be written in
more than one script, e.g. Hans for simplified Chinese or Hant for
traditional chinese, or if the specified language is not written in
the normal script e.g. de-Latf-DE for German in Fraktur.
You should, but don't have to omit the script tag if there is only
one commonly used script for your language.
The region or country is where the language is spoken, this is
important because there often are differences between countries,
which basically share the same language, e.g. British English and
American English.
The region/country tag is either a two letter Country code as
sandardised in ISO 3166-1 http://www.iso.org/iso/country_codes/
iso_3166_code_lists.htm or a three digit Region code as standardised
in UN M.49 http://unstats.un.org/unsd/methods/m49/m49.htm
Normally we use the ISO letter code, but the UN region code can be
handy when specifying a language spoken in more than one country,
e.g. es-005 (Spanish as spoken in South America).
When given a script tag we know how your script should look like, but
computers are dumb. They don't know written characters, the just know
bytes. The assignment of written (visual) characters to byte values
is called character encoding. There are many different character
encodings and to make it even worse there are some scripts, which can
be successfully encoded in different ways.
Normal character encodings are capable of assigning 128 or 256
characters. Unicode is capable of several billions of characters and
can encode all used scripts, so it is the preffered choice for Koha
themes and translations.
So please use UTF-8 for your document character encoding http://
www.unicode.org/standard/WhatIsUnicode.html
If you can't use UTF-8 or don't know how to use it please ask the
list or at least specifiy the encoding you are using, so we can
transcode your document.
Hope that helps.
Maybe this should be added to the readme on translate or the wiki.
Dorian Meid
More information about the Koha-translate
mailing list