[Koha-bugs] [Bug 7419] Add authority deduplication script

Wed Sep 4 21:01:24 CEST 2013

http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=7419

Jared Camins-Esakov <jcamins at cpbibliography.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #18714|0                           |1
        is obsolete|                            |

--- Comment #35 from Jared Camins-Esakov <jcamins at cpbibliography.com> ---
Created attachment 20783
  -->
http://bugs.koha-community.org/bugzilla3/attachment.cgi?id=20783&action=edit
Bug 7419: General-purpose record deduplicator

This patch adds a script for deduplicating records. It is most useful
for authority records but by design could be easily extended for use
with bibliographic records, if someone had a good use case.

See the follow-up for an updated test plan.

Complete POD documentation:

SYNOPSIS
         dedup_records.pl --match=1 -a

         dedup_records.pl --match="LC-card-number/010a" --select="date" \
           --limit="authid > 367123592" -a

         dedup_records.pl --match="Match/100abcdefghijklmnopqrstuvwxyz" \
           --select="source=DLC" --select="date" \
           --limit="authtypecode='PERSO_NAME'" -a

DESCRIPTION
       This script will identify duplicate records, and either suggest
       that you merge them (in the case of bibliographic records) or
       automatically merge them for you (in the case of authority records).

OPTIONS
       --help  Prints this help

       -v|--verbose
               Print verbose log information (warning: very verbose!).

       -t|--test
               Do not actually make any changes to the database, just
               report what changes would be made.

       -r|--report
               Print a report of what happened during the run.

       -l|--limit=S
               Only process those records that match the user-specified
               WHERE clause (the WHERE is implied and should not be
               included on the command line).

       -a|--authorities
               Check for duplicate authorities rather that duplicate
               bibliographic records.

       -s|--select=s
               Repeatable. Specify how to identify which record to
               prefer. See the section on SELECTORS below.

       -m|--match=s
               Specifies the matching rule to use. This can be the
               numeric ID of a matching rule that you have already
               configured (preferred), or you can specify a matching
               rule on the command-line in the following format:

<index1>/<tag1><subfield1>[##<index2>/<tag2><subfield2>[##...]]

               Examples:

                   at/152b##he-main/2..a##he/2..bxyzt##ident/009@
                   authtype/152b##he-main,ext/2..a##he,ext/2..bxyz

sn,ne,st-numeric/001##authtype/152b##he-main,ext/2..a##he,ext/2..bxyz

       -c|--check=s
               Only relevant when you are using a matching rule
               specified on the command line. Specifies sanity checks to
               use to ensure that the records are really duplicate. The
               format is <tag1><subfields1>[,<tag2><subfields2>[,...]]

               Examples:

                   200abxyz will check subfields a,b,x,y,z of 200 fields
                   009@,152b will check 009 data and 152$b subfields

SELECTORS
       This script supports a number of selectors for choosing which
       record is "better."

       score   Prefer the record which is the best match based on the
               specified matching rule. This will probably only be
               useful in cases where the matching rule will not match
               the source record, since the source record will
               automatically be given a score of 2 * the matching rule
               threshold if it wasn't picked up by the matcher.

       date    Prefer the record which is newer based on the 005 field.

       source=ABC
               MARC21 only. Prefer records which come from ABC based on
               the 003 field.

       usage   Authorities only. Prefer the record used in the most
               bibliographic records.

       ppn     UNIMARC only. Prefer records which have a PPN in the 009
               field.

-- 
You are receiving this mail because:
You are watching all bug changes.