[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

bugzilla-daemon at bugs.koha-community.org bugzilla-daemon at bugs.koha-community.org
Tue Dec 8 01:13:48 CET 2015


http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #9 from David Cook <dcook at prosentient.com.au> ---
(In reply to Galen Charlton from comment #8)
> (In reply to David Cook from comment #7)
> > Do we really need to remove accents
> > for that?
> 
> Per bug 7411, there was apparently an issue searching on usernames with
> diacritics, although in retrospect that may simply have been an issue with
> mismatched Unicode normalization forms -- impossible to tell now.
> 
> The current patcheset for bug 7679 also proposes to use Text::Unaccent, but
> I'm dubious about that one.

It's surprising that Text::Unaccent doesn't appear to be working correctly,
since it is using iconv for the heavy lifting, and iconv seems to be pretty
good when it comes to character conversions.

I can't speak to Hebrew or Greek (while I thought I wasn't bad with the modern
Greek alphabet, I didn't know they used accents...), Arabic is sure
interesting.

So we have the following string:
مُدَرِّسَة

If we run the following:
echo "مُدَرِّسَة" | xxd -p

We get this hex:
d985d98fd8afd98ed8b1d990d991d8b3d98ed8a90a

If we look at the first couple bytes there using a UTF-8 table
(http://www.utf8-chartable.de/unicode-utf8-table.pl)

d985 = م = ARABIC LETTER MEEM
d98f = ُُ = ARABIC DAMMA
Together, these are written like مُ 

However, if you add the letter "dal":
d8af = د = ARABIC LETTER DAL

You'll get something like the following:
مُد

We'd recognize that from the "English end/Arabic start" of the string: "مُدَرِّسَة"

I had forgotten that Hebrew only has consonants in its alphabet, and it appears
Arabic is the same. So that "damma" indicates a vowel sound but isn't a letter
per se. I'd say it's a diacritic and this would agree:
https://en.wikipedia.org/wiki/Arabic_diacritics#.E1.B8.8Cammah

So the output for "Strip Nonspacing Mark" looks good in the very first case at
least:

Strip NonspacingMark     - مُدَرِّسَة => مدرسة

Although I don't know if it makes sense semantically as I don't read Arabic. If
I understand correctly, you can omit vowel sounds from written Arabic and rely
purely on context for meaning?
(https://en.wikipedia.org/wiki/Arabic_alphabet#Vowels)

At a glance, the Strip NonspacingMark looks OK for Greek too as those
diacritics appear to be there purely for pronunciation like in languages
written in the Roman alphabet.
(https://en.wikipedia.org/wiki/Modern_Greek#Phonology_and_orthography)

-- 
You are receiving this mail because:
You are watching all bug changes.


More information about the Koha-bugs mailing list