Wiki‎ > ‎Tweaking Philologic‎ > ‎

Make wom words.R.pl

If you don't understand why you want a words.R.wom file, read EquivalencySearching and words.R.wom first.


What words.R.wom looks like

It's simply a two-fielded, tab-delimited, newline-separated data file, with the surface form on the left and the search form on the right. You enter the term on the left, and unbeknownst to the end user, the search is actually run on the second form on the right.


Making your words.R.wom file

So, you need to generate a two-fielded list of equivalencies. You will find one script that can do such a thing in your install already at /your/install/philologic/make_wom_words.R.pl:

while (<>) {
chop;
$index_form = $_;
$search_form = $_;
$search_form =~ s/\&lsb;//g;
$search_form =~ s/\&rsb;//g;
$search_form =~ s/\&lcb;//g;
$search_form =~ s/\&rcb;//g;
$search_form =~ s/\&lpn;//g;
$search_form =~ s/\&rpn;//g;
$search_form =~ s/\&rpar;//g;
$search_form =~ s/\&lpar;//g;
$search_form =~ s/\<//g;
$search_form =~ s/\>//g;
$search_form =~ s/\&\#171;//g;
$search_form =~ s/\&\#187;//g;
$search_form =~ s/\&\#183;//g;
$search_form =~ s/&([a-zA-Z])[^;]*;/$1/g;
# $search_form =~ s/'//g;
$lineout = $search_form . "\t" . $index_form . "\n";
print $lineout;
}

As you can see, this file takes care of a few different sgml ents.

So, you'd run:

cat words.R | ./make_wom_words.R.pl > words.R.wom

Here's another example:

#! /usr/bin/perl

require('characters.pl');

while (<>) {
chop;
$index_form = $_;
$search_form = $_;

$romsearchform = utf2rom($search_form);

$lineout = $romsearchform . "\t" . $index_form . "\n";
print $lineout;
}

And then, in characters.pl, you have a function which gives you the romainzation of the Unicode. It looks like this:

sub utf2rom() {
$text = $_[0];
while(($key, $value) = each(%sgml2utf)) {
if ($value !~ /^[A-Za-z]+$/) {
if ($text =~ /$value/) {
if ($key =~/\&cap/) {
$roman = substr($key, 4, 1);
} else {
$roman = substr($key, 1, 1);
}
$text =~ s/$value/$roman/g;
}
}
}
return $text;
}

That's pretty rough, because it works by taking the SGML ent equivalent of a Unicode character and just taking the first letter of the ent as the romanization. Works for most purposes.


Either/or Unicode/Roman switching in crapser

Now that all your Unicode (or whatever you've set up equivalencies for) is being searched, you may run into the problem that entering the underlying Unicode no longer works, because the entry that matches the Unicode is on the right in words.R.wom and we only search on the left. This can be solved with a modification to crapser.

In place of:

$pattern =~ s/^(.*)$/^$prefix$1\t/;

use:

if ($pattern =~ /[\177-\377][\177-\377]/) {
$pattern =~ s/^(.*)$/\t$1\$/;
}
else {
$pattern =~ s/^(.*)$/^$prefix$1\t/;
}
Comments