[Koha-devel] proposed marc schema

Thu May 30 11:34:02 CEST 2002

I ran some benchmarks on Sergey's proposed marc schema, because I was a
little concerned about the performance problems of having to do a join to
retrieve a biblio.

To recap, here are the schemas that I used.  The first table is my
"all-in-one" schema, the second two are Sergey's:

CREATE TABLE marc_subfield_table (
                subfieldid bigint(20) unsigned NOT NULL auto_increment,
                tagid bigint(20) NOT NULL default '0',
                tag char(3) NOT NULL default '',
                bibid bigint(20) NOT NULL default '0',
                subfieldorder tinyint(4) NOT NULL default '0',
                subfieldcode char(1) NOT NULL default '',
                subfieldvalue varchar(255) default NULL,
                valuebloblink bigint(20) default NULL,
                PRIMARY KEY (subfieldid),
                KEY (bibid,tagid,tag,subfieldcode),
                KEY (bibid,tag,subfieldcode,subfieldvalue)
                ) TYPE=MyISAM;

CREATE TABLE marc_field_table_sergey (
       fieldid  bigint(20) unsigned NOT NULL auto_increment,
       bibid    bigint(20) NOT NULL default '0',
       tagid    bigint(20) NOT NULL default '0',
       tag      char(3) NOT NULL default '',
       PRIMARY KEY (fieldid),
       KEY (bibid),
       KEY (tagid),
       KEY (tag)
);

CREATE TABLE marc_subfield_table_sergey (
       subfieldid  bigint(20) unsigned NOT NULL auto_increment,
       fieldid bigint(20),
       subfieldorder tinyint(4) NOT NULL default '0',
       subfieldcode char(1) NOT NULL default '',
       subfieldvalue varchar(255) default NULL,
       valuebloblink bigint(20) default NULL,
       PRIMARY KEY (subfieldid),
       KEY (fieldid)
);

Here are the results of my benchmark:

getdata-steve:  0:08.29 0:08.36 0:08.24 0:08.25 0:08.28
getdata-sergey: 0:07.96 0:07.86 0:07.88 0:07.87 0:07.98

Times shown are for retrieving 2000 random records from a database of 5000
entries.  It's close, but Sergey's schema is marginally faster, and
unarguably better SQL, so let's go with splitting tag and subfield tables.

Note that I didn't use 80,000 records in my benchmark this time, mainly
because I didn't want to wait to generate a database that large.  With
only 5000 entries in the database, mysqld was able to cache the entire
database in ram on my test machine.  There was basically no hard drive
access during the benchmark (after I ran it a couple of times to make sure
all the data was cached).

Before the data was all cached, the lookups were taking more like 12-15
seconds for 2000 records.

Steve.