[Koha-devel] dumpdict => something to investigate...

Paul POULAIN paul.poulain at free.fr
Fri Oct 13 15:04:56 CEST 2006


I'm copying a mail from Eric Lease Morgan, about a dictionnary coming 
from zebra. Sounds like an idea to investigate later...


Eric Lease Morgan a écrit :
> 
> By exploiting zebraidx's dumpdict option I have been able to create an 
> Aspell dictionary and accompanying lookup script paving the way for a 
> Did You Mean? (alternative spelling) service against my zebra indexes. 
> It is not perfect but a decent start.
> 
> First is zebra2aspell.pl:
> 
>   #!/usr/bin/perl
> 
>   # zebra2aspell.pl - create an Aspell dictionary from a zebra index
> 
>   # Eric Lease Morgan <emorgan at nd.edu>
>   # October 5, 2006
> 
>   # require
>   use strict;
> 
>   # define the zebraidx and aspell binaries
>   my $ZEBRAIDX  = '/usr/local/bin/zebraidx dumpdict';
>   my $ASPELL    = '/usr/local/bin/aspell --lang=en create master 
> /home/emorgan/idzebra-2.0.2/examples/marc21/aspell.dict';
> 
>   # initialize input and output words
>   my @words;
> 
>   # get the list of words from the index
>   open INPUT, "$ZEBRAIDX |";
>   while ( <INPUT> ) {
> 
>       chop;                      # get rid of trailing return
>       next if ( ! /^\d\d:\s/ );  # only look for word lines, not debugging
>       s/^\d\d:\s+\d+\s//;        # remove "leader"
>       s/\s-\d.*$//;              # remove "trailer"
>       next if ( / / );           # no words containing spaces; why do 
> they exist?
>       next if ( /\W/ );          # no non-word characters
>       next if ( /\d/);           # no words containing digits
>       next if ( ! $_ );          # has content
>       push @words, $_;
>     
>   }
>   close INPUT;
> 
>   # remove duplicates; from perl cookbook pg. 102
>   my %seen = ();
>   my @words = grep { ! $seen{$_} ++ } @words;
> 
>   # build a list aspell can use
>   my $words;
>   foreach ( @words ) { $words .= $_ . "\n" }
> 
>   # create a dictionary
>   open OUTPUT, "| $ASPELL";
>   print OUTPUT $words;
>   close OUTPUT;
> 
>   # done
>   exit;
> 
> 
> Next is lookup.pl. Usage: ./lookup.pl foobar
> 
>   #!/usr/bin/perl
> 
>   # lookup.pl - look up a word in a aspell dictionary
>   #             and return alternative spellings
> 
>   # Eric Lease Morgan <emorgan at nd.edu>
>   # October 5, 2006
> 
> 
>   # require
>   use Text::Aspell;
>   use strict;
> 
>   # define
>   use constant DICTIONARY => './aspell.dict';
> 
>   # get the query
>   my $query = $ARGV[0];
> 
>   # branch accordingly
>   if ( ! $query ) { print "Usage: $0 word\n" }
>   else {
>            
>       # initalize dictionary
>       my $dictionary = Text::Aspell->new;
>       $dictionary->set_option( 'master', DICTIONARY );
>     
>       # get suggestions
>       my @suggestions = $dictionary->suggest( $query );
> 
>       # display the suggestions
>       print "Alternative spellings for $query:\n";
>       foreach ( @suggestions ) { print $_, "\n" }
>        
>   }
> 
>   # done
>   exit;
> 
> 
> --Eric "It Feels Good To Hack Again" Morgan
> University Libraries of Notre Dame
> 
> 
> 
> 
> _______________________________________________
> Zebralist mailing list
> Zebralist at lists.indexdata.dk
> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/zebralist
> 
> 


-- 
Paul POULAIN et Henri Damien LAURENT
Consultants indépendants
en logiciels libres et bibliothéconomie (http://www.koha-fr.org)
Tel : 04 91 31 45 19





More information about the Koha-devel mailing list