[Koha-devel] Experimental support for MARC records as a proper datatype in the PostgreSQL database

Marc Balmer marc at msys.ch
Sun Jun 10 17:43:09 CEST 2012


Sitting in one of the nice pubs in Edinburgh during KohaCon, I had this
idea to add MARC records as a proper datatype to the PostgreSQL database
server.  After a discussion with Marc Véron and Dobrica Pavlinusic about
what that could mean, I decided to just try it and I have now a basic
implementation (or, more a proof of concept).  So here is some
information on this:

If MARC records are a proper datatype, that means they are stored right
in the database, are backed-up, can be restored, replicated etc. just
with the standard database tools.  If then a function is provided to
access individual fields of a MARC record, then this can be used in SQL
expressions, e.g. for selects or to create views etc.  As PostgreSQL
supports functional indexes, you can create indexes on individual MARC
fields, giving you super fast access to your data.

How does it look?

There is a datatype called 'marc' for now:

CREATE TABLE books (
    id serial,
    sig varchar(16),
    marc_record marc
);

MARC records are loaded into the database as raw, binary records,
encoded in hexadecimal:

INSERT INTO books (sig, marc_record) VALUES ('a01b', '3030383830.....');

To access individual MARC fields, the function 'marc_field()' was
created, it returns VARCHAR:

SELECT id, marc_field(marc_record, '020') AS isbn FROM books WHERE sig =
'a01b';

Of course MARC fields can be used to search data:

SELECT SELECT id, marc_field(marc_record, '020') AS isbn FROM books
WHERE marc_field(marc_record, '245') like 'whatever%';

An index on a specific field can easily be created:

CREATE INDEX books_isbn_idx ON books (marc_field(marc_record, '020'));

As a syntactic sugar, the expression marc_record@'020' is equal to
marc_field(marc_record, '020').

Marijana Glavica kindly let me use a MARC database of ~250'000 records
to make some tests, here are some real world examples:

books=# \d test_marc
                         Table "public.test_marc"
 Column |  Type   |                       Modifiers
--------+---------+--------------------------------------------------------
 id     | integer | not null default nextval('test_marc_id_seq'::regclass)
 marc21 | marc    |

books=# select count(id) from test_marc;
 count
--------
 246727
(1 row)

How many croatian books do they have?

books=# select count(id) from test_marc where
substring(marc_field(marc21, '008'), 37, 3) = 'hrv';
 count
-------
 52582
(1 row)

Of course much more complex queries are possible with this, and using
the right indexes it is really, really fast.  One query I tested when
from 8.4 seconds to 0.21 ms with the right index.  That's a speedup of
40'000.

Thanks to Marijana, Dobrica, and Marc for feedback, interesting
discussions and ideas!

Feedback and suggestions are of course more than welcome.

- Marc


More information about the Koha-devel mailing list