[Koha-devel] Catalogue API (fwd)

Tue May 21 23:33:03 CEST 2002

Hi Steve-  Sorry I missed the IRC chat.  However, the IRC web logs are 
great!  Glad I could catch up at least after the fact.  You've written 
an excellent summary here.  I'm beginning to see the light :-) and I 
think Pat's IRC comment is correct that we do have more in common in 
view here than it may have first appeared.

> I'm hopeful that we can develop a cataloguing API that will alleviate some
> of these problems.  

> getinfo(1563,'author')  returns Herman Melville  from the 100 a subfield
> setinfo(1563,'author', 'Melville, Herman')  updates author info 

Yes, this will go a long way towards simplifying it for the random 
contributing developer like myself.   I can imagine that such a map 
could be database driven, with a table of values along the lines of:

  author,100,a
  title,245,a
  isbn,020,a

and so on.  Perhaps even multiple MARC flavors could be handled that way 
with no (or minimal) coding changes.

> 1.  acqui.simple and marcimport.pl == MARC compatible
> 
>   acqui.simple and marcimport.pl will import marc records and "squash" the
>   data into the standard Koha database.  There can be considerable loss of
>   information in this process.  

Right.  It would only be viable if combined with the next item, Marc 
tables for non-kohadb data:

> 2.  Use Marc tables only for data that is not already represented by the
>     koha databases.
> 
>   I think this is a bad idea.  It'd be too confusing about what
>   information was stored where.

Perhaps it is.  However, it might work if we adopted some sort of 
table-driven field structure like the above, such as

    title = 100a = kohadb:biblio.title
    author = 245a = kohadb:additionalauthors.name
    lccall = 050a = marc

>   I envision separate indexes being
>   generated to facilitate searching.  It would be easy, for example, to
>   create an index that contained the following information:
> 
>       bibid, title, author, dewey, isbn
> 
>   Then a search for bibid would be identical to the search that you gave
>   in your example using the koha database.

OK, I think this is the piece that I was missing from the picture.  That 
would allow simple queries like I'm looking for.

>  Not [Note?] that you could potentially
>   lose information stored in the marc record if there is more than one
>   author, for example.

Or, there could also be another index table that paralleled the current 
"additionalauthors" table.

 >  These "seach indexes" are still vaporware.  Paul
 >   and I are trying to work out the specifics of what the new backend
 >   database will look like.

My suggestion: what about using the existing structure of the KohaDB 
(perhaps with minor mods) for these search index tables?  Code 
transition would happen by first isolating all the writes to the 
existing tables and making them go through API's.  Then any existing 
code that is just reading the data wouldn't have to change.

> 5.  You mention MARC-XML and suggest that we not tie ourselves to the MARC
>     format.
> 
>   We are not proposing that bibliographic data be stored in MARC format,
>   only that the back end database be capable of storing any and all
>   information that a MARC record can store (whether it is a conventional
>   MARC record or the as-yet-unspecified MARC-XML format record).  The
>   existing MARC format is just a flat file that uses control character
>   separators and a directory index for the tags.  The MARC-XML
>   specification will be the same format, but will use XML tags for
>   separating the data instead of control characters.  We would not want to
>   use either of these formats internally, although we would probably want
>   to be capable of importing/exporting those formats.

I definitely agree here.  I'm just now understanding that file format 
with the tag index and control character separators.  It seems to me 
that the other (original?) Marc representation can be considered yet a 
third format.  I'm thinking of ones I see where the tag number is at the 
beginning of each line and the $ character separates the subfields.  I'm 
presuming that this is what Marc-familiar librarians work on, and then 
it is translated to be stored in the control-character file format (or 
something else like the database we are discussing).

Based on this idea of the abstraction of the Marc data being the 
important part, where "100a = Author" regardless of storage, then I 
really like your suggestion on IRC of

 > The schema has columns like bibid,tag,subfieldcode,subfieldvalue, so...
 >  (bibid, tag, subfieldcode, subfieldvalue) = (16654, 165, 'a', 
'http://www.mysite.com/')

In my opinion, then, I would vote against the other suggestion of 
putting all subfields into one record separated by "$" characters.
It appears to be an artifact of the "straight-text" Marc format that 
doesn't exist in the control-character file format or proposed XML 
format, and therefore isn't really part of the "true" abstracted data.

It also increases the complexity of getting at any one subfield and 
potentially restricts the types of languages or tool sets that could 
work on the data. I love perl and regexps but I don't think we want a 
database that excludes non-perl access by design.

And relational database engines are built by design to support data 
growth by adding rows.  Perhaps the list of additional authors won't 
ever be that big for any book I know of, but it seems short-sighted to 
me to design it so an arbitrary-length list of subfields would be 
limited by column width or db-engine support for non-conventional data 
types just to handle wider and wider text strings.

Thanks for the comprehensive view; I'm starting to get the picture :-)

- Alan

-- 
Alan Millar                  Email: am12 at bolis.com