[Koha-zebra] A few Zebra Questions

Thu Jan 5 02:51:16 CET 2006

Mike Rylander wrote:

>On 1/4/06, Sebastian Hammer <quinn at indexdata.com> wrote:
>
>[big ol' snip]
>
>  
>
>>>Another question that immediately occurs is: _what_ speed issues?
>>>Have you actually seen any?  Do you have any numbers?
>>>
>>>
>>>      
>>>
>>I'd like to hear the answer to this too. But my sense is that updating a
>>single record in a multimillion record database does take some
>>significant period of a time -- much more than updating a single row in
>>an RDBMS, for sure. It matters if you're scaling to a major library with
>>multiple circulation desks.
>>    
>>
>
>Warning: rant follows. :)
>  
>
Not much of a rant, was it? Good comments, though.  :-)

>This is exactly the concern, unless I misunderstand the OP.  With a
>centralized system running, say, 250+ libs with more than 1,500 circ
>and reference desk clients it would be one of the primary speed
>related concerns.
>  
>
Yes indeed. We had this same discussion early on. At that time, I 
suggested that Zebra wasn't really intended to be a transaction-oriented 
system.

>I believe the desire here is for Koha to both scale to large
>installations, and also offer advanced search/filter options.  Keeping
>the item status as close to the item and record identifiers obviously
>increases the flexibility of searches and filters, but it imposes a
>much greater maintenance cost.  So with the knowledge that it would be
>slower than in an RDBMS, the question becomes "how much slower, and
>where is the tipping point?".
>  
>
Quite. Fortunately, it's dead easy to find the tipping point with the 
new Perl-Zoom API.

1) Index a suitably large database -- I'd say a few million bib records.
2) Write a little Perl loop that randomly fetches one record using a 
random barcode search or similar, changes it slightly (something  
roughly equivalent to flipping a circ status bit, etc), and updates it 
again.
3) Run this, and see how it goes.

You know roughly how fast someone at a checkout counter works... so that 
will give you a good idea of how fast this needs to work.. a few hundred 
to a thousand circ desks going flat out and scanning barcodes every 
second or two is going to generate a *lot* of transactions -- on top of 
the regular OPAC and admin client traffic..  Even assuming that each of 
your 1,500 workstations only generate one event every 30 seconds, that's 
still 50 events per second, and a circ station produces a lot more 
traffic than an average OPAC user.

>Part of that depends on what the most important filter would be. 
>IMHO, the most important status/state related item information would
>be those variables that affect item visibility to the patron, so I'll
>make an example of that.
>
>If you don't want items that are LOST or MISSING, or records that only
>have items in those states, to show up in the OPAC (because the
>patron, by definition, cannot use them), then that can be condensed
>into a "patron visibility" flag on the record.  It may be worth the
>cost of updating that flag when it is calculated to have changed, and
>not otherwise.  This gives you the functionality from the specific use
>case above, but it limits the flexibility of the system.  Staff don't
>get to search directly for items that are LOST or MISSING, just
>records that wouldn't show up because the constituent items are all in
>that state.
>  
>

>The thing to watch out for when denormalizing data to increase speed
>is that you'll do it over and over again.  Using the example above,
>there are probably 20 flags one could invent to solve specific issues
>like that, but then you've got to check, calculate, and possibly
>update the value of all of those flags on every item update.  At some
>point the denormaliziation costs too much in the application layer,
>and you might as well just move the raw data into the records,
>updating at every change.
>  
>
True.

I don't think that distinguishing between different non-available 
conditions would, in itself, provide a performance hit... it'd be fine 
to include the equivalent of a bitmask in the record -- a series of 
tokens denoting different conditions... up to a point.   Adding a 
boolean of 20 different conditions onto each end-user search might start 
to stretch things a little bit if you also want to handle 50-100 
searches per second.   :-)

>So, the first step is to probably to design some use cases.  If they
>seem to be comprehensive and the required data are easily identified,
>then tests can be done and a decision can be made as to whether any of
>this is worth the update costs inside zebra, and which plan is
>"better".
>  
>
I think this sounds like a pretty good plan. For the purposes of testing 
use cases, we can probably help determine which things will impact 
performance and which things won't. But the bottom line is, the only way 
to find out is to try it. If a relatively simple test indicates that 
using Zebra as it is now for handling these transaction-type things, 
then that is important information.. the results of that test might 
provide us at ID with some ideas for optimizations, and it might give 
y'all some inspiration for functionality that is or is not realistic to 
support.

Right now we really have no information whatsoever.. even our own 
performance tests have mostly focused on *adding* records, not updating 
them. It is possible that doing a minor mod on a record is in fact much, 
much faster than I've been saying, then that changes the discussion a 
little bit, I think.

--Sebastian

>
>  
>
>>--Sebastian
>>
>>    
>>
>>>_/|_   ___________________________________________________________________
>>>/o ) \/  Mike Taylor  <mike at miketaylor.org.uk>  http://www.miketaylor.org.uk
>>>)_v__/\  "Press any key to continue or any other key to quit" -- Jeff
>>>       Covey.
>>>
>>>
>>>
>>>
>>>      
>>>
>>--
>>Sebastian Hammer, Index Data
>>quinn at indexdata.com   www.indexdata.com
>>Ph: (603) 209-6853
>>
>>
>>
>>
>>_______________________________________________
>>Koha-zebra mailing list
>>Koha-zebra at nongnu.org
>>http://lists.nongnu.org/mailman/listinfo/koha-zebra
>>
>>    
>>
>
>
>--
>Mike Rylander
>mrylander at gmail.com
>GPLS -- PINES Development
>Database Developer
>http://open-ils.org
>
>
>_______________________________________________
>Koha-zebra mailing list
>Koha-zebra at nongnu.org
>http://lists.nongnu.org/mailman/listinfo/koha-zebra
>
>  
>

-- 
Sebastian Hammer, Index Data
quinn at indexdata.com   www.indexdata.com
Ph: (603) 209-6853