[Koha-patches] [PATCH] bug 2926: fix staging import hang

Galen Charlton galen.charlton at liblime.com
Sat Jun 6 23:45:54 CEST 2009


Fixes a hang of the staging import tool when it
attempts to process a MARC21 record that claims
that it's UTF-8 when it is not.  The staging import
will now attempt to fix the character encoding of such
records.

Also added a FIXME to bulkmarcimport.pl, which because
of its use of MARC::Batch will skip over such records -
better than the original hang of the staging import, but
worse than the staging import's new ability to fix such
records.
---
 C4/Charset.pm                          |   18 +++++++++++++++++-
 misc/migration_tools/bulkmarcimport.pl |    6 ++++++
 2 files changed, 23 insertions(+), 1 deletions(-)

diff --git a/C4/Charset.pm b/C4/Charset.pm
index 001cf01..5c5e7ce 100644
--- a/C4/Charset.pm
+++ b/C4/Charset.pm
@@ -153,7 +153,23 @@ sub MarcToUTF8Record {
         $marc =~ s/^\s+//;
         $marc =~ s/\s+$//;
         $marc_blob_is_utf8 = IsStringUTF8ish($marc);
-        $marc_record = MARC::Record->new_from_usmarc($marc);
+        eval {
+            $marc_record = MARC::Record->new_from_usmarc($marc);
+        };
+        if ($@) {
+            # if we fail the first time, one likely problem
+            # is that we have a MARC21 record that says that it's
+            # UTF-8 (Leader/09 = 'a') but contains non-UTF-8 characters.
+            # We'll try parsing it again.
+            substr($marc, 9, 1) = ' ';
+            eval {
+                $marc_record = MARC::Record->new_from_usmarc($marc);
+            };
+            if ($@) {
+                # it's hopeless; return an empty MARC::Record
+                return MARC::Record->new(), 'failed', ['could not parse MARC blob'];
+            }
+        }
     }
 
     # If we do not know the source encoding, try some guesses
diff --git a/misc/migration_tools/bulkmarcimport.pl b/misc/migration_tools/bulkmarcimport.pl
index 5fc8e4c..2a48540 100755
--- a/misc/migration_tools/bulkmarcimport.pl
+++ b/misc/migration_tools/bulkmarcimport.pl
@@ -164,6 +164,12 @@ RECORD: while (  ) {
     eval { $record = $batch->next() };
     if ( $@ ) {
         print "Bad MARC record: skipped\n";
+        # FIXME - because MARC::Batch->next() combines grabbing the next
+        # blob and parsing it into one operation, a correctable condition
+        # such as a MARC-8 record claiming that it's UTF-8 can't be recovered
+        # from because we don't have access to the original blob.  Note
+        # that the staging import can deal with this condition (via
+        # C4::Charset::MarcToUTF8Record) because it doesn't use MARC::Batch.
         next;
     }
     last unless ( $record );
-- 
1.5.6.5




More information about the Koha-patches mailing list