[Koha-devel] switching from marc_words to zebra [LONG]

Mon Jul 4 11:10:06 CEST 2005

As discussed with Joshua on irc, here is my views on how to move from 
2.2 mysql-based DB to 3.0 zebra-based DB :

2.2 structure :
===============
in Koha 2.2, there are a lot of tables to manage biblios :
* biblio / items / biblioitems / additionalauthors / bibliosubtitle / 
bibliosubject : they contains the data in a "decoded form". It means, 
not depending on marc flavour. A title is called "biblio.title", not 
NNN$x where NNN is 245 in MARC21 and 200 in UNIMARC ! The primary key 
for a biblio is biblio.biblionumber.

* marc_biblio, is a table that contains only a few informations :
- biblionumber (biblio PK)
- bibid (marc PK. It's a design mistake I made, for sure)
- frameworkcode (the frameworkcode is used to know which cataloguing 
form Koha must use, see marc_*_ structure below)

* marc_subfield_table : this table contains marc datas, one line for 
each subfield. With "order" row to keep trace of MARC field & subfields 
order (to be sure to retrieve repeated fields in good order).

* marc_word : the -huge- table that is the index for searches. The 
structure works correctly with small to medium data size, but 50 000 
complete MARC biblio is the upper limit the system can handle.

* marc_*_structure (* being tag & subfield) : this table is the table 
where the library defines how it's MARC works. It contains, for each 
field/subfield a lot of informations : what the field/subfield contains 
"1st responsability statement", where to put it in a given framework (in 
MARC editor), if the value must be put in "decoded" part of the DB 
(200$f => biblio.author in UNIMARC), and what kind of constraint there 
is during typing (for example, show a list of possible values)

2.2 DB API
==========
The DB API is located in C4/Biblio.pm for biblio/items management & 
C4/SearchMarc.pm for search tools.

The main word here is : "MARC::Record heavy use".

All biblios are stored in a MARC::Record object, as well as items. Just 
be warned that all items informations must be in the same MARC tag, thus 
the MARC::Record contains only one MARC::Field.
In UNIMARC, it's usually the 995 field and in MARC21, the 952. (can be 
anything in Koha, but all item info must be in the same field, and this 
field must contain only item info)

Biblio.pm :
------------
All record creation/modification is done through NEWxxxyyyy sub.
NEWxxx
perldoc C4/Biblio.pm will give you some infos on how it works.

In a few words : when adding a biblio, Koha calls NEWnewbiblio, with a 
MARC::Record.
NEWnewbiblio calls MARCnewbiblio to handle MARC storing, then 
MARCmarc2koha to create structure for non-MARC part of the DB, then 
calls OLDnewbiblio to create the non-MARC (decoded) part of the biblio.

SearchMarc.pm :
---------------
The sub that does the search is catalogsearch. The sub parameters :
$dbh, => DB handler
  $tags, => array with MARC tags/subfields : ['200a','600f']
  $and_or, => array with 'and' or 'or' for each search term
  $excluding, => array with 'not' for each search term that must be excluded
  $operator, => =, <=, >=, ..., contains, start
  $value, => the value to search. Can have a * or % at end of each word
  $offset, => the offset in the complete list, to return only needed info
  $length, => the number of results to return
  $orderby, => how to order the search
  $desc_or_asc, => order asc or desc
  $sqlstring => an alternate sqlstring that can replace 
tags/and_or/excluding/operator

The catalogsearch retrive a list of bibids, then, for each bibid to 
return, find interesting values (values that are shown in result list). 
This includes the complete item status (available, issued, not for loan...)

move to ZEBRA
=============

DB structure : biblio handling
------------------------------
I think we can remove marc_biblio, marc_subfield_table, and marc_word 
(of course)
marc_biblio contains only one important information, the framework (used 
to know which cataloguing form Koha must use in MARC editor). It can be 
moved to biblio table.
marc_subfield_table contains marc datas. We could either store it in raw 
iso2709 format in biblio table or only in zebra. I suspect it's better 
to store it twice (in zebra AND in biblio table). When you do a search 
(in Zebra), you enter a query, get a list of results. This list can be 
builded with datas returned by zebra . Then the user clics on a given 
biblio to see the detail.
Here, we can read the raw marc in biblio table, and thus not needing a 
useless zebra called (at the price of a SQL query, but it's based on the 
primary key, so as fast as possible)

marc_word is completly dropped, as search is done with zebra.

DB structure : items handling
-----------------------------
item info can be stored in the same structure as for biblio : save the 
raw item MARC data in items table.

Koha <=> Zebra
--------------
It should really not be a pain to move to zebra with this structure : 
every call with a MARC::Record (NEWxxxxyyyy subs) manages the storing of 
the MARC::Record in marc_* tables. We could replace this code with a 
zebra insert/update, using biblio.biblionumber as primary key.
How to manage biblios and items ? My idea here would be to store biblio 
+ all items informations in zebra, using a full MARC::Record, that 
contains biblio and items.
When NEWnewitem (or NEWmoditem) is called, the full biblio MARC::Record 
is rebuilded with biblio MARC::Record and all items MARC::Records, and 
updated in zebra. it can be a little cpu consuming to update zebra every 
time an item is modified, but it should not be so much, as in libraries, 
biblios & items don't change so often.

So we would have :
NEWnewbiblio :
* create biblio/biblioitems table entry (including MARC record in raw 
format)
* create zebra entry, with the provided Perl API.

NEWnewitem :
* create the items entry (includint MARC record in raw format)
* read biblio MARC record & previously existing items // append new item 
// update zebra entry with the provided Perl API.

NEWmodbiblio :
* modify the biblio entry (in biblio/biblioitems table)
* read the full MARC record (including items) // update Zebra entry

NEWmoditem :
* modify the item entry (in items table)
* read the full MARC record (including items) // update Zebra entry

Note that this relys on Zebra iso2709 results returns. We could use XML 
or nice-top-tech possibilities. But Koha makes heavy use of 
MARC::Record, so we don't need to reinvent the wheel.
What is great with Zebra is that we can index iso2709 datas, but show 
what we want to users (including XML). So Koha internals can be whatever ;-)

The MARC Editor
===============
Some users thinks Koha MARC Editor could be improved. The best solution 
would be, imho, to provide an API to use an external MARC editor if the 
library prefers.
However, some libraries are happy with what exists. So the MARC editor 
should be kept (& improved where possible). so marc_*_structure tables 
are still needed. Some fields could be removed probably, as they are 
related to search (like seealso), and will be handled by zebra config 
file. This still has to be investigated.

For libraries that prefers an external MARC editor, we could create a 
webservice, where the user does an http request, with iso2709 data in 
parameters, with the requested operation.
This should be quite easy to do (the problem being to know how the 
external software can handle this. If someone has an idea or an 
experience on this, feel free to post here ;-) )

Data search
==========
I won't speak a lot about search, as someone else has taken the ball for 
this ;-) I just think SearchMarc.pm should be deeply modified ! As every 
information will be in zebra, it can use only zebra search API.

A question remaining is :
in a biblio/item, the item status (when issued, transfered, returned, 
reverved, waiting...), changes quite often. So is it better to save the 
status in zebra DB, and thus update the zebra entry (biblio+items) 
everytime an item status is modified, or is it better to keep this 
information only in items/reserve/issues tables & read it in mySQL every 
time it's needed ?
Open question that zebra guys can probably answer. NPL has, for example, 
600 000 issues per year (and hopefully 600 000 returns ;-) ), plus some 
(how many ?) reserves, branch transfers...

The authority problem
=====================
Authorities have to be linked to the biblio that uses them. Thus, when 
an authority is modified, all biblios using them are automatically 
modified (script in misc/merge_authority.pl in Koha cvs & 2.2.x)

To keep trace of the link, Koha uses a $9 local subfield. In UNIMARC, 
the $3 can also be used for this. I don't know if something equivalent 
to $3 exists in MARC21 (could not find information on 
http//www.loc.gov/marc/)
Many scripts make a heavy use of marc_subfield_table $9 data. For 
example, when you find an authority in authority module, you get the 
number of biblios using this authority. This number is calculated with a 
SQL request on $9 subfield.
To handle this with zebra, we have 2 solutions :
- create a table just with the link (biblionumber / authority number) 
that we could query
- query zebra with exact $9 subfield value

I don't know zebra enough to be sure of the best way to do it. Any 
suggestion/experience welcomed.

The authority problem (another one...)
======================================
Authorities are MARC::Records too... (without items)
So they also have auth_structure & auth_word & all the infos that are in 
biblios (except items level, as there is no "authority" items).
so we could imagine to have 2 zebra databases : one for biblios and one 
for authorities.
Everything previously in this mail can be copied here. That's something 
we could investigate after moving MARC biblios to zebra, as we would 
have more experience on this tool.

"Trivial" querying
================
Someone may ask "why should we keep biblio/biblioitems/items tables ?", 
as everything is in zebra ?
First, as Koha is multi-marc, remember it's very complex to know what is 
a "title" just with a MARC record.
the same guy will ask "yes, but with Biblio/MARCmarc2koha, you can 
transform you MARC::Record to a semantically meaningful hash".

I answer to this :
Yes, but without those tables, sql querying the database would be 
completly impossible for developpers, as we could not know in mySQL "if 
we have authors filled by the bulkmarcimport", or "do we have the 
itemcallnumber correctly modified for item #158763".
That's a second reason to keep those tables in mySQL.

-- 
Paul POULAIN
Consultant indépendant en logiciels libres
responsable francophone de koha (SIGB libre http://www.koha-fr.org)