[Koha-bugs] [Bug 14367] History for MARC records. Roll back changes on a timeline or per field.

bugzilla-daemon at bugs.koha-community.org bugzilla-daemon at bugs.koha-community.org
Tue Apr 11 03:19:16 CEST 2017


https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14367

--- Comment #11 from David Cook <dcook at prosentient.com.au> ---
(In reply to Chris from comment #9)
> I like the idea of storing history with timestamp and author. But storing
> whole record seems like overkill to me. Especially for huge libraries.
> 
> So instead of whole history, just a "delta" of data changed (somewhat git
> does) and as mentioned - not JSON but TABLE.

After doing some reading, it appears that Git actually does save the whole file
during a commit. It will compress it using zlib, but it stores the whole file.
However, when it does its garbage collection, it may discard old whole files
and replace them with deltas for more efficient storage.

Using Compress::Zlib, which is a core Perl module, I compressed a 6.7KB 153
line MARCXML file to 1.3KB. 

Let's say we have 1,000,000 old versions... that's 976MB or .95GB. Not super
huge but not tiny either. 

If we used GNU diff/patch or even 'git diff --no-index file1 file2', we could
get diffs and store those. Using diff, I made a 467B patch for a one line
change, and with git diff I got a 474B patch for the same one line change. 

Presuming 1,000,000 uncompressed patches of the same size, that's 445MB or
.434GB.

If you compress the patch, you can get it down to 227B, so 216MB or .211GB for
1,000,000 compressed patches. That seems pretty decent to me.

So Compress::Zlib is easy to use and it's a core module. Awesome. 

I think making the deltas may be harder. You can use Git.pm to take advantage
of 'git diff --no-index file1 file2', although git diff returns non zero for a
diff, so Git.pm actually throws an exception which you have to catch and then
you get the diff from there. Not that elegant... plus you'd have to write the
records to temporary files (same with using GNU diff).

I've seen lots of diff modules on CPAN... and it seems that Text::Diff gets
recommended a fair bit (https://metacpan.org/pod/Text::Diff) and shows up in
the O'Reilly book it seems
(http://docstore.mik.ua/orelly/perl4/cook/ch08_23.htm). I also found it was
already installed on my Koha server, although I don't use the Debian packages,
so I can't attest to that. It looks like it's required by Test::Differences and
that lots of modules use that, so that's probably why it's on my system. I've
just tried it out and it's easy to use. Looks like we'd want to use Text::Patch
for applying the diff to rebuild a record... although I haven't tried it.

Using Text::Diff I got a patch that was 378B uncompressed.

-- 
You are receiving this mail because:
You are watching all bug changes.


More information about the Koha-bugs mailing list