Equivalency searching

The Problem

Often, you would like to be able to give users results that matched things that are close to what they searched for, but not exactly. Examples of this kind of behavior include disregarding punctuation, searching for Unicode characters via Roman transliterations or vice versa, ignoring the case of letters in a search string, similarity searching, etc. There and many levels of search term evaluation in Philologic and these different but related problems can be solved in various ways.

Think about it both ways:

    • If I search for a certain string, what should I find?
    • If I am looking for a certain kind of result, what should I be able to search on in order to find it?

In order to get things to work the way you want, you have to understand how Philologic works and figure out what parts to tweak.

Searching Overview

Indexing

When you load a database into Philologic, an index of all the words in the text is created. What is considered to be a word unit is determined by textload.cfg. Look for:

$CHARSINWORD = "[\&A-Za-z0-9\177-\377][\&A-Za-z0-9\177-\377\\-\_\';]*";

That's where you define your word pattern, which contains all the characters that DO NOT cause the word to break. Anything else will break the word and you'll have two separate words (just like with a space.) More on this at: Textload.cfg.

Pattern Creation

search3t gets the input from the web browser and uses philosubs.pl and crapser to construct the search string. In philosubs.pl, clean_word_pattern performs some preliminary cleaning, then crapser expands the word pattern with folded accents, transliterations, etc. The pattern is then sent to search3 which returns the results.

Kinds of Equivalency

Unicode <=> Roman

Capital letters match all accents

As described in the crapser entry, crapser replaces some capital letters with a pattern that matches likely accents. If you want to add to this list, just put in some letters or codes separated by pipes (|):

%ACCENTS = ( 'A', "(a|\xc3\xa0|\xc3\xa1|\xc3\xa2|\xc3\xa3|\xc3\xa4)",

'C', "(c|\xc3\xa7)",

'E', "(e|\xc3\xa8|\xc3\xa9|\xc3\xaa|\xc3\xab)",

'I', "(i|\xc3\xac|\xc3\xad|\xc3\xae|\xc3\xaf)",

'N', "(n|\xc3\xb1)",

'O', "(o|\xc3\xb2|\xc3\xb3|\xc3\xb4|\xc3\xb4|\xc3\xb6)",

'U', "(u|\xc3\xb9|\xc3\xba|\xc3\xbb|\xc3\xbc)",

# Now X will match on X or 'ecks'

'X', "(x|ecks)",

'Y', "[y\375\377]" );

crapser replaces search terms with other words based on words.R.wom

Unicode and Roman equivalencies are handled via crapser which uses a two-field file called words.R.wom to make the matches. Here's how it works. When you load your database, Philologic generates a list of all words in the database and writes it out as words.R. It also makes words.R.wom which is (at this point) just a two-fielded list with the exact same words in each field. However, if you replace the second field with something else, it establishes a relationship between the two strings that says: when you get a user searching for stringA, replace that with stringB and search for that instead.

Here's an example. In our words.R.wom, we have this line:

melee meleé

Anyone searching for melee will find meleé in the text.

You can generate your words.R.wom file any way you want, as long as you come out with a file that contains two fields with words separated by tabs. See the make_wom_words.R.pl entry for some examples.

#! /usr/bin/perl

require('characters.pl');

while (<>) {

chop;

$index_form = $_;

$search_form = $_;

$romsearchform = utf2rom($search_form);

$lineout = $romsearchform . "\t" . $index_form . "\n";

print $lineout;

}

characters.pl is a

Case Folding

Roman

By default, case is stripped during the creation of the word index -- indexed words are stored in lower case. Then, crapser folds all search terms to lower case when you run a search, so upper will match lower and vice-versa.

Non-Roman

For non-Roman cases in Unicode, the situation is different. Philologic will not make any changes to the case during indexing -- it will index quite literally whatever Unicode is there. So what is one to do if you want case-folded searches on Unicode? Well, you could theoretically do the job with words.R.wom, creating a separate entry for each word containing the characters you want case folded -- one upper, one lower. But the combinatorics quickly get out of hand if you want to be able to search for UpperLowerUpperUpper, e.g., and match combination that has been indexed.

See here: UnicodeCaseFolding for a description of how to hack crapser to allow casefolded Unicdoe searches. This should also be in your Philo3 (second release) goodies directory.

Punctuation

Most punctuation problems can be taken care of simply by setting $CHARSINWORD correctly. See Textload.cfg. For an explanation of how to have a character treated as word-breaking punction and still be able to search with it, see CharsInWord. It all boils down to setting:

$word =~ s/\-/\+/g;

in clean_word_pattern in philosubs.pl.

Headwords

It is important to note that headword searches are handled completely differently, and therefore none of the above techniques will work for equivalency searching in headwords. Headword searches do not use crapser, but instead use subdocgimme, which searches through divindex.raw. By default, there is no way to search on headwords in anything but their underlying form -- i.e. exactly as they appeared in the input text.

You can work around this, however. Using a modified subdocgimme found in the goodies folder of your installation, you can search on a normalized field in your divindex.raw. To generate the new divindex.raw, you can use a script like this:

#! /usr/bin/perl

require('characters.pl');

while (<>) {

chop;

@parts = split(/\t/, $_);

$search_form = $parts[1];

$romsearchform = utf2rom($search_form);

foreach $foo (@parts) {

print $foo . "\t";

}

print $romsearchform . "\n";

}

You are simply adding a last field which is the romanized form of your headword. Pop the modified subdocgimme from goodies into place, and you are ready to go.

If you want to do this across a range of databaes, you can use a script like this:

#!/usr/bin/perl

# This script creates the indexes for all of the philologic dsal files

# Put your database names below and set the other vars

@dbnames = ('grierson', 'turner', 'pali', 'steingass');

foreach $dbname (@dbnames) {

print "Doing $dbname...\n";

# Set these

$script = "/projects/dsal/scripts/for-philo/make_divindex_roman.pl";

$raw = '/var/lib/philologic/databases/' . $dbname . '/divindex.raw';

$backup = '/var/lib/philologic/databases/' . $dbname . '/divindex.raw.old';

$path = '/var/lib/philologic/databases';

print "cp $raw $backup\n";

$res = `cp $raw $backup`;

print $res . "\n";

print "$script $backup > $raw\n";

$res = `$script $backup > $raw`;

print $res . "\n";

print "mv $path/$dbname/subdocgimme $path/$dbname/subdocgimme.old\n";

$res = `mv $path/$dbname/subdocgimme $path/$dbname/subdocgimme.old`;

print $res . "\n";

print "cp /projects/dsal/scripts/for-philo/subdocgimme $path/$dbname\n";

$res = `cp /projects/dsal/scripts/for-philo/subdocgimme $path/$dbname/`;

print $res . "\n";

print "\nDone.\n\n";

}