WordSieve

WordSieve is a utility I've been using in one form or another for decades, which extracts lists of unique words from combinations of files. I first wrote it in Turbo Pascal back in 1988, and I depended on a later version written in Delphi to compile my Penguin Dictionary of Computing, using it to generate word lists from archived texts. The Ruby version is remarkable in that on a modern PC it runs far faster than the fully compiled version used to run, and that the core logic of the program reduced to a single line containing a regular expression.

Please note that this is just a static screenshot and doesn't run. To make it work, download and install shoes-0.r244-win32.zip from the previous page and then the file wordsieve.shy from this page. I hope to learn how to make it run in situ soon.

The Shoes version has lots of buttons for all the useful options, like controlling the letter-case of the output file, filtering out common words, and choosing a logical combination from the options AND, OR, NOT, XOR and SORT (a one-file operation that merely sorts the unique words from Document 1).

The Count options add a count of the occurences of each word, either to the left or to the right of the word itself, so you can easily sort the final file either alphabetically or by frequency. I'm providing the Shoes source code and a compiled Shoes wordsieve.shy file that should just run if you have Ruby/Shoes installed, as well as my own dictionary of common words as both a text file and a Ruby array constant.







User Interface

A word about the way the interface works: First you have to press two buttons to choose your two files, Document1 and Document2, then choose a logical combination from the drop-down box, then press the Output To> button and type in or browse to a filename in which the results will be placed; simply entering this filename performs the filtering operation and writes the results.