[Koha-bugs] [Bug 7419] Add authority deduplication script
bugzilla-daemon at bugs.koha-community.org
bugzilla-daemon at bugs.koha-community.org
Wed Sep 4 21:01:24 CEST 2013
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=7419
Jared Camins-Esakov <jcamins at cpbibliography.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #18714|0 |1
is obsolete| |
--- Comment #35 from Jared Camins-Esakov <jcamins at cpbibliography.com> ---
Created attachment 20783
-->
http://bugs.koha-community.org/bugzilla3/attachment.cgi?id=20783&action=edit
Bug 7419: General-purpose record deduplicator
This patch adds a script for deduplicating records. It is most useful
for authority records but by design could be easily extended for use
with bibliographic records, if someone had a good use case.
See the follow-up for an updated test plan.
Complete POD documentation:
SYNOPSIS
dedup_records.pl --match=1 -a
dedup_records.pl --match="LC-card-number/010a" --select="date" \
--limit="authid > 367123592" -a
dedup_records.pl --match="Match/100abcdefghijklmnopqrstuvwxyz" \
--select="source=DLC" --select="date" \
--limit="authtypecode='PERSO_NAME'" -a
DESCRIPTION
This script will identify duplicate records, and either suggest
that you merge them (in the case of bibliographic records) or
automatically merge them for you (in the case of authority records).
OPTIONS
--help Prints this help
-v|--verbose
Print verbose log information (warning: very verbose!).
-t|--test
Do not actually make any changes to the database, just
report what changes would be made.
-r|--report
Print a report of what happened during the run.
-l|--limit=S
Only process those records that match the user-specified
WHERE clause (the WHERE is implied and should not be
included on the command line).
-a|--authorities
Check for duplicate authorities rather that duplicate
bibliographic records.
-s|--select=s
Repeatable. Specify how to identify which record to
prefer. See the section on SELECTORS below.
-m|--match=s
Specifies the matching rule to use. This can be the
numeric ID of a matching rule that you have already
configured (preferred), or you can specify a matching
rule on the command-line in the following format:
<index1>/<tag1><subfield1>[##<index2>/<tag2><subfield2>[##...]]
Examples:
at/152b##he-main/2..a##he/2..bxyzt##ident/009@
authtype/152b##he-main,ext/2..a##he,ext/2..bxyz
sn,ne,st-numeric/001##authtype/152b##he-main,ext/2..a##he,ext/2..bxyz
-c|--check=s
Only relevant when you are using a matching rule
specified on the command line. Specifies sanity checks to
use to ensure that the records are really duplicate. The
format is <tag1><subfields1>[,<tag2><subfields2>[,...]]
Examples:
200abxyz will check subfields a,b,x,y,z of 200 fields
009@,152b will check 009 data and 152$b subfields
SELECTORS
This script supports a number of selectors for choosing which
record is "better."
score Prefer the record which is the best match based on the
specified matching rule. This will probably only be
useful in cases where the matching rule will not match
the source record, since the source record will
automatically be given a score of 2 * the matching rule
threshold if it wasn't picked up by the matcher.
date Prefer the record which is newer based on the 005 field.
source=ABC
MARC21 only. Prefer records which come from ABC based on
the 003 field.
usage Authorities only. Prefer the record used in the most
bibliographic records.
ppn UNIMARC only. Prefer records which have a PPN in the 009
field.
--
You are receiving this mail because:
You are watching all bug changes.
More information about the Koha-bugs
mailing list