[Koha-devel] Diacriticals, Unicode, and PDF's

Tue Sep 29 05:50:27 CEST 2009

The problem is not really with Koha, it is with the PDF format.  I worked on
this a while back, and concluded it will not be possible to cleanly solve
without serious trade-offs:

   - controlling more aspects of the process, like requiring specific fonts
   on the user's system, or
   - dramatic increase in filesize (orders of magnitude larger), embedding
   the font in the PDF, or producing effectively page-sized images, or
   - non-free PDF components, or not supporting common versions of Acrobat
   Reader, or
   - custom character set conversion into ASCII (as much as possible), i.e.
   data loss.

The CPAN modules had really quite poor APIs for dealing with Unicode data.
Any of the available methods would require heavy overhaul of the code and
the approach to labels in general.

In my opinion, development time might be better spent on piping the data
into an external known good UNICODE-capable print tool or something like
Open Office.  Generating PDFs out of (FOSS) perl just didn't seem to be a
viable answer.

I would be interested to see any counter-examples with FOSS perl producing
compact, cross-platform PDFs with some UTF-8 data like Chinese, or
Lithuanian... that don't require specific fonts.

--Joe

2009/9/28 Nathan Gray <kolibrie at graystudios.org>

> On Mon, Sep 28, 2009 at 09:21:39PM -0400, Chris Nighswonger wrote:
> > The UTF to PDF conversion issue appears to be primarily caused by the
> > fact that the PDF stream uses glyphIDs rather than unicode to display
> > strings. Thus there is not a direct, one-to-one unicode-gliphID
> > relationship. The reason that *some* unicode chars come across ok is
> > more ascribable to chance than to design. This happens when the
> > unicode *happens* to match the font gliphID. What really should be
> > happening is that there should be a "ToUnicode" table built and
> > embedded in the PDF file so that the relationship from unicode to
> > gliphID may be properly defined.
>
> [snip]
>
> > Any thoughts, information, suggestions, etc. is most gratefully
> appreciated.
>
> The cairographics project has done a lot of work on PDFs and text
> to glyph translation, if I remember correctly.
>
>  http://cairographics.org
>
> A google search with these terms is a good start:
>
>  cairo graphics pdf text to glyph
>
> It looks like they rely on pango libraries (something called
> pangocairo in particular).
>
> -kolibrie
>
> http://lists.koha.org/mailman/listinfo/koha-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: </pipermail/koha-devel/attachments/20090928/6b9d29dd/attachment-0003.htm>