[Koha-patches] [PATCH] Script to repair MARC21 leader/09.

Joe Atzberger joe.atzberger at liblime.com
Fri Apr 3 22:42:20 CEST 2009


Acquisitions process seems to be adding records with incorrect
representation of the MARC encoding in leader/09.  It should be
'a' meaning UTF-8, for all Koha's internalized records, but in
many cases it appears blank (for MARC-8).  This script diagnoses
and repairs the value in the leader, depending on runtime options.

The symptom of this problem is that high-value UNICODE characters
in the record will cause Koha to crash whenever it tries to parse
the MARCXML, giving a "Wide character" fatal.  While we work on
fixing the input, this script will fix the existing data.
---
 misc/maintenance/leader_fix.pl |  221 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 221 insertions(+), 0 deletions(-)
 create mode 100755 misc/maintenance/leader_fix.pl

diff --git a/misc/maintenance/leader_fix.pl b/misc/maintenance/leader_fix.pl
new file mode 100755
index 0000000..e0ca66f
--- /dev/null
+++ b/misc/maintenance/leader_fix.pl
@@ -0,0 +1,221 @@
+#!/usr/bin/perl
+#
+# Copyright 2009 Liblime
+#
+# This file is part of Koha.
+#
+# Koha is free software; you can redistribute it and/or modify it under the
+# terms of the GNU General Public License as published by the Free Software
+# Foundation; either version 2 of the License, or (at your option) any later
+# version.
+#
+# Koha is distributed in the hope that it will be useful, but WITHOUT ANY
+# WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR
+# A PARTICULAR PURPOSE.  See the GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License along with
+# Koha; if not, write to the Free Software Foundation, Inc., 59 Temple Place,
+# Suite 330, Boston, MA  02111-1307 USA
+
+use strict;
+use warnings;
+
+use MARC::Record;
+use MARC::File::XML;
+use Getopt::Long qw(:config auto_help auto_version);
+use Pod::Usage;
+
+use C4::Biblio;
+use C4::Charset;
+use C4::Context;
+use C4::Debug;
+
+use vars qw($VERSION);
+
+BEGIN {
+    # find Koha's Perl modules
+    # test carefully before changing this
+    use FindBin;
+    eval { require "$FindBin::Bin/../kohalib.pl" };
+    $VERSION = 0.02;
+}
+
+our $debug;
+
+## OPTIONS
+my $help    = 0;
+my $man     = 0;
+my $verbose = 0;
+
+my $limit;      # undef, not zero.
+my $offset  = 0;
+my $dump    = 0;
+my $summary = 1;
+my $fix     = 0;
+
+GetOptions(
+    'help|?'    => \$help,
+    'man'       => \$man,
+    'verbose=i' => \$verbose,
+    'limit=i'   => \$limit,
+    'offset=i'  => \$offset,
+    'dump!'     => \$dump,
+    'summary!'  => \$summary,
+    'fix!'      => \$fix,
+) or pod2usage(2);
+pod2usage( -verbose => 2 ) if ($man);
+pod2usage( -verbose => 2 ) if ($help and $verbose);
+pod2usage(1) if $help;
+
+if ($debug) {
+    $summary++;
+    $verbose++;
+}
+
+my $marcflavour = C4::Context->preference('marcflavour');
+
+my $all = C4::Context->dbh->prepare("SELECT COUNT(*) FROM biblioitems");
+$all->execute;
+my $total = $all->fetchrow;
+
+my $count_query = "SELECT COUNT(*) FROM biblioitems WHERE substr(marc, 10, 1)  = ?";
+my $query       = "SELECT     *    FROM biblioitems WHERE substr(marc, 10, 1) <> ?";
+
+my $sth = C4::Context->dbh->prepare($count_query);
+$sth->execute('a');
+my $count    = $sth->fetchrow;
+my $badcount = $total-$count;
+
+if ($summary) {
+    print  "# biblioitems with leader/09 = 'a'\n";
+    printf "# %9s match\n",   $count;
+    printf "# %9s  BAD \n",   $badcount;
+    printf "# %9s total\n\n", $total;
+    printf "# Examining %s BAD record(s), offset %d:\n", ($limit || 'all'), $offset;
+}
+
+my $bad_recs = C4::Context->dbh->prepare($query);
+$bad_recs->execute('a');
+$limit or $limit = $bad_recs->rows();   # limit becomes max if unspecified
+$limit += $offset if $offset;           # increase limit for offset
+my $i = 0;
+
+$marcflavour or die "No marcflavour (MARC21 or UNIMARC) set in syspref";
+
+MARC::File::XML->default_record_format($marcflavour) or die "FAILED MARC::File::XML->default_record_format($marcflavour)";
+
+while ( my $row = $bad_recs->fetchrow_hashref() ) {
+    (++$i > $limit) and last;
+    (  $i > $offset) or next;
+    my $xml = $row->{marcxml};
+    $xml =~ s/.*(\<leader\>)/$1/s;
+    $xml =~ s/(\<\/leader\>).*/$1/s;
+    # $xml now pared down to just the <leader> element
+    printf "# %4d of %4d: biblionumber %s : %s\n", $i, $badcount, $row->{biblionumber}, $xml;
+    my $stripped = StripNonXmlChars($row->{marcxml});
+    ($stripped eq $row->{marcxml}) or printf STDERR "%d NON-XML Characters removed!!\n", (length($row->{marcxml}) - length($stripped));
+    my $record = eval { MARC::Record::new_from_xml( $stripped, 'utf8', $marcflavour ) };
+    if ($@ or not $record) {
+        print STDERR "ERROR in MARC::Record::new_from_xml(\$marcxml, 'utf8', $marcflavour): $@\n\tSkipping $row->{biblionumber}\n";
+        next;
+    }
+    if ($fix) {
+        $record->encoding('UTF-8');
+        if (ModBiblioMarc($record, $row->{biblionumber})) {
+            printf "# %4d of %4d: biblionumber %s : <leader>%s</leader>\n", $i, $badcount, $row->{biblionumber}, $record->leader();
+        } else {
+            print STDERR "ERROR in ModBiblioMarc(\$record, $row->{biblionumber})\n";
+        }
+    }
+    $dump and print $row->{marcxml}, "\n";
+}
+
+__END__
+
+=head1 NAME
+
+leader_fix.pl - Repair missing leader position 9 value ("a" for MARC21 - UTF8).
+
+=head1 SYNOPSIS
+
+leader_fix.pl [ -h ] [ -m ] [ -v ] [ -d ] [ -s ] [ -l 7 ] [ -o 4 ] [ -f ]
+
+Help Options:
+   -h --help -?   Brief help message
+   -m --man       Full documentation, same as --help --verbose
+      --version   Prints version info
+
+Feeback Options:
+   -d --dump      Dump MARCXML of biblioitems processed, default OFF
+   -s --summary   Print initial summary of good and bad biblioitems counted, default ON
+   -v --verbose   Increase verbosity of output, default OFF
+
+Run Options:
+   -f --fix       Save repaired leaders to biblioitems.marcxml, 
+   -l --limit     Number of biblioitems to display or fix
+   -o --offset    Number of biblioitems to skip (not displayed or fixed)
+
+=head1 OPTIONS
+
+=over 8
+
+=item B<--fix>
+
+This is the most important option.  Without it, the script just tells you about the problem records.
+With --fix, the script fixes the same records.
+
+=item B<--limit=N>
+
+Like a LIMIT statement in SQL, this contrains the number of records targeted by the script to an integer N.  
+The default is to target all records with bad leaders.
+
+=item B<--offset=N>
+
+Like an OFFSET statement in SQL, this tells the script to skip N of the targetted records.
+The default is 0, i.e. skip none of them.
+
+=back
+
+The binary ON/OFF options can be negated like:
+   B<--nosummary>   Do not display summary.
+   B<--nodump>      Do not dump MARCXML.
+   B<--nofix>       Do not change any records.  This is the default mode.
+
+=head1 DESCRIPTION
+
+Koha expects to have all MARXML records internalized in UTF-8 encoding.  This 
+presents a problem when records have been inserted with the leader/09 showing
+blank for MARC8 encoding.  This script is used to determine the extent of the 
+problem and to fix the affected leaders.
+
+Run leader_fix.pl the first time with no options, and assuming you agree that the leaders
+presented need fixing, run it again with B<--fix>.  
+
+=head1 USAGE EXAMPLES
+
+B<leader_fix.pl>
+
+In the most basic form, displays summary of biblioitems examined
+and the leader from any found without /09 = a.
+
+B<leader_fix.pl --fix>
+
+Fixes the same biblioitems, displaying summary and each leader before/after change.
+
+B<leader_fix.pl --limit=3 --offset=15 --nosummary --dump>
+
+Dumps MARCXML from the 16th, 17th and 18th bad records found.
+
+B<leader_fix.pl -l 3 -o 15 -s 0 -d>
+
+Same thing as previous example in terse form.
+
+=head1 TO DO
+
+Allow biblionumbers to be piped into STDIN as the selection mechanism.
+
+=head1 SEE ALSO
+
+C4::Biblio
+
+=cut
-- 
1.5.6.5




More information about the Koha-patches mailing list