[Koha-devel] dumpdict => something to investigate...
Paul POULAIN
paul.poulain at free.fr
Fri Oct 13 15:04:56 CEST 2006
I'm copying a mail from Eric Lease Morgan, about a dictionnary coming
from zebra. Sounds like an idea to investigate later...
Eric Lease Morgan a écrit :
>
> By exploiting zebraidx's dumpdict option I have been able to create an
> Aspell dictionary and accompanying lookup script paving the way for a
> Did You Mean? (alternative spelling) service against my zebra indexes.
> It is not perfect but a decent start.
>
> First is zebra2aspell.pl:
>
> #!/usr/bin/perl
>
> # zebra2aspell.pl - create an Aspell dictionary from a zebra index
>
> # Eric Lease Morgan <emorgan at nd.edu>
> # October 5, 2006
>
> # require
> use strict;
>
> # define the zebraidx and aspell binaries
> my $ZEBRAIDX = '/usr/local/bin/zebraidx dumpdict';
> my $ASPELL = '/usr/local/bin/aspell --lang=en create master
> /home/emorgan/idzebra-2.0.2/examples/marc21/aspell.dict';
>
> # initialize input and output words
> my @words;
>
> # get the list of words from the index
> open INPUT, "$ZEBRAIDX |";
> while ( <INPUT> ) {
>
> chop; # get rid of trailing return
> next if ( ! /^\d\d:\s/ ); # only look for word lines, not debugging
> s/^\d\d:\s+\d+\s//; # remove "leader"
> s/\s-\d.*$//; # remove "trailer"
> next if ( / / ); # no words containing spaces; why do
> they exist?
> next if ( /\W/ ); # no non-word characters
> next if ( /\d/); # no words containing digits
> next if ( ! $_ ); # has content
> push @words, $_;
>
> }
> close INPUT;
>
> # remove duplicates; from perl cookbook pg. 102
> my %seen = ();
> my @words = grep { ! $seen{$_} ++ } @words;
>
> # build a list aspell can use
> my $words;
> foreach ( @words ) { $words .= $_ . "\n" }
>
> # create a dictionary
> open OUTPUT, "| $ASPELL";
> print OUTPUT $words;
> close OUTPUT;
>
> # done
> exit;
>
>
> Next is lookup.pl. Usage: ./lookup.pl foobar
>
> #!/usr/bin/perl
>
> # lookup.pl - look up a word in a aspell dictionary
> # and return alternative spellings
>
> # Eric Lease Morgan <emorgan at nd.edu>
> # October 5, 2006
>
>
> # require
> use Text::Aspell;
> use strict;
>
> # define
> use constant DICTIONARY => './aspell.dict';
>
> # get the query
> my $query = $ARGV[0];
>
> # branch accordingly
> if ( ! $query ) { print "Usage: $0 word\n" }
> else {
>
> # initalize dictionary
> my $dictionary = Text::Aspell->new;
> $dictionary->set_option( 'master', DICTIONARY );
>
> # get suggestions
> my @suggestions = $dictionary->suggest( $query );
>
> # display the suggestions
> print "Alternative spellings for $query:\n";
> foreach ( @suggestions ) { print $_, "\n" }
>
> }
>
> # done
> exit;
>
>
> --Eric "It Feels Good To Hack Again" Morgan
> University Libraries of Notre Dame
>
>
>
>
> _______________________________________________
> Zebralist mailing list
> Zebralist at lists.indexdata.dk
> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/zebralist
>
>
--
Paul POULAIN et Henri Damien LAURENT
Consultants indépendants
en logiciels libres et bibliothéconomie (http://www.koha-fr.org)
Tel : 04 91 31 45 19
More information about the Koha-devel
mailing list