[Koha-patches] [PATCH] Bug 4828: Clean diacritics from SIP-written messages

Thu Feb 17 21:23:39 CET 2011

Le 17/02/2011 18:43, Ian Walls a écrit :
> Non-ASCII characters and information tends to break SIP machines.  This patch
> scrubs diacritics off any message written out to the SIP client.  It won't help
> with non-Roman based scripts, but any accent marks will be removed with a Unicode
> Normalization.

Nice stuff from Dan :D
There is also Text::Unidecode or Text::Undiacritic which does quite the
same thing.
Nonetheless, from my experience and point of view, SIP has been
succesfully installed and used with UTF8 support without the error
detection in some libraries we sustain.
And for French Libraries, having diacritic for title is a great requirement.
Hope that helps.
-- 
Henri-Damien LAURENT

> 
> Based on work described by Dan Scott in his post to open-ils-dev at list.georgialibraries.org
> on Jan 04, 2010.  http://www.mail-archive.com/open-ils-dev@list.georgialibraries.org/msg04127.html
> 
> Tested with 3M SIP emulator, as well as live on two different Koha installs
> ---
>  C4/SIP/Sip.pm |   36 +++++++++++++++++++++++++++++++++++-
>  1 files changed, 35 insertions(+), 1 deletions(-)
> 
> diff --git a/C4/SIP/Sip.pm b/C4/SIP/Sip.pm
> index 8a0f067..7841d6a 100644
> --- a/C4/SIP/Sip.pm
> +++ b/C4/SIP/Sip.pm
> @@ -9,6 +9,9 @@ use warnings;
>  use English;
>  use Exporter;
>  
> +use Encode;
> +use Unicode::Normalize;
> +
>  use Sys::Syslog qw(syslog);
>  use POSIX qw(strftime);
>  use Socket qw(:crlf);
> @@ -142,6 +145,37 @@ sub boolspace {
>      return $bool ? 'Y' : ' ';
>  }
>  
> +sub clean_text {
> +    my $text = shift || '';
> +
> +    # hardcoded to ASCII since Koha configs don't take encoding as institution params
> +    my $target_encoding = 'ascii';
> +
> +    # Convert our incoming UTF8 data into Perl's internal string format
> +
> +    # Also convert to Normalization Form D, as the ASCII, iso-8859-1,
> +    # and latin-1 encodings (at least) require this to substitute
> +    # characters rather than simply returning a string truncated
> +    # after the first non-ASCII character
> +    $text = NFD(decode_utf8($text));
> +
> +    if ($target_encoding eq 'ascii') {
> +
> +        # Try to maintain a reasonable version of the content by
> +        # stripping diacritics from the text, given that the SIP client
> +        # wants just plain ASCII. This is the base requirement according
> +        # to the SIP2 specification.
> +
> +        $text =~ s/\pM+//og;
> +    }
> +
> +    # Characters that cannot be represented in the target encoding will
> +    # generally be replaced with a question mark (?) character.
> +    $text = encode($target_encoding, $text);
> +
> +    return $text;
> +}
> +
>  
>  # read_SIP_packet($file)
>  #
> @@ -218,7 +252,7 @@ sub write_msg {
>      my ($self, $msg, $file) = @_;
>      my $cksum;
>  
> -    # $msg = encode_utf8($msg);
> +    $msg = clean_text($msg);
>      if ($error_detection) {
>          if (defined($self->{seqno})) {
>              $msg .= 'AY' . $self->{seqno};