[Koha-bugs] [Bug 7818] New: support DOM mode for Zebra indexing of bibliographic records

Fri Mar 23 22:07:47 CET 2012

http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=7818

          Priority: P5 - low
 Change sponsored?: ---
            Bug ID: 7818
          Assignee: gmcharlt at gmail.com
           Summary: support DOM mode for Zebra indexing of bibliographic
                    records
        QA Contact: koha.sekjal at gmail.com
          Severity: enhancement
    Classification: Unclassified
                OS: All
          Reporter: gmcharlt at gmail.com
          Hardware: All
            Status: NEW
           Version: unspecified
         Component: Architecture, internals, and plumbing
           Product: Koha

Koha DOM Indexing for bibliographic records
-------------------------------------------

Overview
========
Koha uses the Zebra search engine to implement bibliographic and authority
record search.  Zebra comes with several filters that support the indexing of
various document types, including MARC and XML.  Koha currently uses Zebra’s
GRS-1 filter for indexing MARC and Zebra’s DOM filter for indexing authority
records (represented as MARCXML).

To oversimplify a bit, the GRS-1 filter allows for constructing the following
kinds of search indexes for MARC records:
Keyword and phrase indexes of the contents of a MARC subfield
Keyword and phrase indexes of the contents of a fixed field

What the GRS-1 index cannot provide, however, is the ability to index a phrase
that spans subfield boundaries.  For example, giving the subject heading:

650 $a Cats $x pharmacology $x catnip

it is not possible to construct a search that will reliably match all instances
of the phrase “cats pharmacology catnip”.  This in turn means that it is not
possible to reliably search on a bibliographic or authority heading with out
extensive post-search pruning of the result set.

The DOM filter, in contrast, takes an XML document (in this case, a MARC XML
record) and applies a stylesheet to generate another XML document that contains
all of the strings to be indexed by Zebra.  This allows for much more
flexibility, and in particular, allows for retrieving records on normalized
heading.

Furthermore, since the indexing definitions are ultimately expressed as an XML
stylesheet, the DOM filter allows for an arbitrary amount of manipulation of
tokens and phrases to be made during indexing.

Goals
=====

The goals of this development project are to:

* Add support for the Zebra DOM filter for indexing bibliographic records in
Koha, similar to existing support for DOM indexing of authority records.
* Make use of DOM indexing to allow for complete and partial searches of
headings fields.
* Provide indexing definitions to allow for searching of title phrases that
span subfield boundaries in the 245.

Components
==========

Indexing Preparation
====================
A new class (Koha::Indexer) hierarchy will be created to manage preparation of
biblio and authority XML records for indexing.  In the case of a bib record, 

biblionumber 
→ MARC::Record representation of bib
→ one or more Koha::Indexer::MARC::RecordNormalizer routines that perform
operations on the MARC::Record object.  The one normalizer to be added for this
project will be Koha::Indexer::MARC::RecordNormalizer::EmbedItems.
→ MARCXML representation of bib
→ Zebra indexing pipeline

Two tables will be defined to allow for specifying pre-indexing record
normalizer profiles

record_norm_profile
  code varchar2(10) PK
  name varchar2(100)
  metadata_type varchar2(10) -- currently only ‘MARC’
  record_type varchar2(10) -- ‘bilbio’ or ‘authority’

record_norm_profile_normalizer
  seq int -- order in which normalizer is to be applied
  record_norm_profile FK references record_norm_profile.code
  normalizer varchar2(100) -- references available 

Koha::Indexer::$metadata_type::RecordNormalizer classes

The normalization pipeline will be used by rebuild_zebra.pl.
DOM indexing

The koha-indexdefs-to-zebra.xsl stylesheets currently used for DOM indexing of
authorities will be made available to biblio as well.  In addition XSLT
functions will be added that can

-- trim nonfiling characters
-- calculate the applicable thesaurus of a bibliographic subject heading

A script (misc/maintenance/convert_record_abs_to_dom.pl) will be written that
will convert indexing definitions from record.abs to DOM format.  For example:

melm 022$a      ISSN:w,Identifier-standard:w

would be come something like this:
  <kohaidx:index_subfields tag="022" subfields="a">
    <kohaidx:target_index>ISSN:w</kohaidx:target_index>
    <kohaidx:target_index>Identifier-standard:w</kohaidx:target_index>
  </kohaidx:index_subfields>

In addition to handling converting existing record.abs config files, additional
new standard indexing definitions will be added to index authority-controlled
headings fields the same way as they are currently indexed in authority
records.  For example, topical subject headings will use:

  <kohaidx:index_heading tag="650" subfields="abvxyz" subdivisions="vxyz">
    <kohaidx:target_index>Subject-topical-heading:w</kohaidx:target_index>
    <kohaidx:target_index>Subject-topical-heading:p</kohaidx:target_index>
    <kohaidx:target_index>Subject-topical-heading:s</kohaidx:target_index>
    <kohaidx:target_index>Heading:w</kohaidx:target_index>
    <kohaidx:target_index>Heading:p</kohaidx:target_index>
    <kohaidx:target_index>Heading:s</kohaidx:target_index>
  </kohaidx:index_heading>

Headings index definitions will be provided in at least two variants:
heading as a phrase
heading as a phrase with insertions to include subdivision markers and
thesaurus -- this type is meant to be used for future work for linking bib
records to authority headings

Searching enhancements
======================
The new standard indexing definitions will include two whole-245-as-phrase
indexes, with one variant including nonfiling characters and the other
excluding them.

OPAC display enhancements
=========================
The OPAC display XSLT (MARC21slim2OPACDetail.xsl) will be changed so that
subject heading links will construct phase searches instead of searches on the
subfield $9:

The Koha interface should be updated to allow for user to select partial
subject headings from the OPAC and staff client Details pages.
for example, on a record with subject heading “History -- United States -- 18th
Century -- Periodicals”, clicking:
History would search for “History”
United States would search for “History -- United States”
18th Century would search for “History -- United States -- 18th Century”
Periodicals would search for “History -- United States -- 18th Century --
Periodicals”

The advanced search page template will include an exact-title search that will
perform an whole-phrase search on the new whole-245-as-phrase indexes.

The subject search in the masthead and advanced search templates will be
changed to use the new subject heading index, allowing for partial heading
searches.

Installation and upgrade considerations
=======================================
During installation, a new query will be adding to Makefile.PL to specify the
biblio indexing mode.  The default value will be DOM.
During upgrade, an option will be presented to convert the indexing definitions
to DOM and reindex the biblio records.  The specific upgrade steps will
include:

-- running misc/maintenance/convert_record_abs_to_dom.pl to create a
biblio-koha-indexdefs.xml that includes the existing indexing definitions as
well as the new standard bib headings indexes.
-- using koha-indexdefs-to-zebra.xsl and an XSLT processor to create an
biblio-zebra-indexdefs.xml to be used as the stylesheet that Zebra uses
directly to convert incoming MARCXML records to their indexed phrases.
-- running rebuild_zebra.pl -b -r on the entire database

-- 
You are receiving this mail because:
You are watching all bug changes.