Unicode Case Folding

Problem

You want to be able to search for upper and lower case versions of Unicode characters and return either version.... you want case folding.

POSSIBLE ELEGENANT SOLUTION

Doesn't Perl feature Unicode-aware RegExp? Sure. So why can't we use POSIX character classes to neutralize case distinctions? Because it doesn't seem to work! Can we simply search for UpperCase character and replace it with UC | LC character? If so, skip the below and do that instead.

Solution

Maybe you could do this with words.R.wom, but the combinatorics could get ugly if you had to have a version in there for every possible variation of upper and lower case in each word. Better to hack something onto crapser to take care of this. Basically, you need an array of character equivalencies, from upper to lower and lower to upper. Stick that onto the top of crapser like so:

#!/usr/bin/perl

# $Id: crapser-egrep-2field.plin,v 2.1 2004/08/23 21:45:03 o Exp $

# These cases are Unicode characters upper -> lower and lower -> equivalencies.

# We need to search on either one and return either one, so they are subbed in

# down below to do casefolding.

$cases{"\xC3\x80"} = "\xC3\xA0";

$cases{"\xC3\x81"} = "\xC3\xA1";

$cases{"\xC3\x82"} = "\xC3\xA2";

$cases{"\xC3\x83"} = "\xC3\xA3";

# .... etc .....

Then, down below in the while loop, add:

# Here we subsititute in both cases of a Unicode character defined above

# so that we can do case foldd Unicode searches.

while (($a, $b) = each(%cases)) {

s/$a|$b/\($a|$b\)/g;

}

That should do it. Entire crapser file from the Mamluk project is included below for reference.... Unfortunately, the equivalency array is not that complete, so you might want to make your own.

crapser:

#!/usr/bin/perl

# $Id: crapser-egrep-2field.plin,v 2.1 2004/08/23 21:45:03 o Exp $

# These cases are Unicode characters upper -> lower and lower -> equivalencies.

# We need to search on either one and return either one, so they are subbed in

# down below to do casefolding.

$cases{"\xC3\x80"} = "\xC3\xA0";

$cases{"\xC3\x81"} = "\xC3\xA1";

$cases{"\xC3\x82"} = "\xC3\xA2";

$cases{"\xC3\x83"} = "\xC3\xA3";

$cases{"\xC3\x84"} = "\xC3\xA4";

$cases{"\xC3\x85"} = "\xC3\xA5";

$cases{"\xC3\x86"} = "\xC3\xA6";

$cases{"\xC3\x87"} = "\xC3\xA7";

$cases{"\xC3\x88"} = "\xC3\xA8";

$cases{"\xC3\x89"} = "\xC3\xA9";

$cases{"\xC3\x8A"} = "\xC3\xAA";

$cases{"\xC3\x8B"} = "\xC3\xAB";

$cases{"\xC3\x8C"} = "\xC3\xAC";

$cases{"\xC3\x8D"} = "\xC3\xAD";

$cases{"\xC3\x8E"} = "\xC3\xAE";

$cases{"\xC3\x8F"} = "\xC3\xAF";

$cases{"\xC3\x90"} = "\xC3\xB0";

$cases{"\xC3\x91"} = "\xC3\xB1";

$cases{"\xC3\x92"} = "\xC3\xB2";

$cases{"\xC3\x93"} = "\xC3\xB3";

$cases{"\xC3\x94"} = "\xC3\xB4";

$cases{"\xC3\x95"} = "\xC3\xB5";

$cases{"\xC3\x96"} = "\xC3\xB6";

$cases{"\xC3\x98"} = "\xC3\xB8";

$cases{"\xC3\x99"} = "\xC3\xB9";

$cases{"\xC3\x9A"} = "\xC3\xBA";

$cases{"\xC3\x9B"} = "\xC3\xBB";

$cases{"\xC3\x9C"} = "\xC3\xBC";

$cases{"\xC3\x9D"} = "\xC3\xBD";

$cases{"\xC3\x9E"} = "\xC3\xBE";

$cases{"\xC3\xBF"} = "\xC5\xB8";

$cases{"\xC4\x80"} = "\xC4\x81";

$cases{"\xC4\x82"} = "\xC4\x83";

$cases{"\xC4\x84"} = "\xC4\x85";

$cases{"\xC4\x86"} = "\xC4\x87";

$cases{"\xC4\x88"} = "\xC4\x89";

$cases{"\xC4\x8A"} = "\xC4\x8B";

$cases{"\xC4\x8C"} = "\xC4\x8D";

$cases{"\xC4\x8E"} = "\xC4\x8F";

$cases{"\xC4\x90"} = "\xC4\x91";

$cases{"\xC4\x92"} = "\xC4\x93";

$cases{"\xC4\x94"} = "\xC4\x95";

$cases{"\xC4\x96"} = "\xC4\x97";

$cases{"\xC4\x98"} = "\xC4\x99";

$cases{"\xC4\x9A"} = "\xC4\x9B";

$cases{"\xC4\x9C"} = "\xC4\x9D";

$cases{"\xC4\x9E"} = "\xC4\x9F";

$cases{"\xC4\xA0"} = "\xC4\xA1";

$cases{"\xC4\xA2"} = "\xC4\xA3";

$cases{"\xC4\xA4"} = "\xC4\xA5";

$cases{"\xC4\xA6"} = "\xC4\xA7";

$cases{"\xC4\xA8"} = "\xC4\xA9";

$cases{"\xC4\xAA"} = "\xC4\xAB";

$cases{"\xC4\xAC"} = "\xC4\xAD";

$cases{"\xC4\xAE"} = "\xC4\xAF";

$cases{"\xC4\xB2"} = "\xC4\xB3";

$cases{"\xC4\xB4"} = "\xC4\xB5";

$cases{"\xC4\xB6"} = "\xC4\xB7";

$cases{"\xC4\xB9"} = "\xC4\xBA";

$cases{"\xC4\xBB"} = "\xC4\xBC";

$cases{"\xC4\xBD"} = "\xC4\xBE";

$cases{"\xC4\xBF"} = "\xC5\x80";

$cases{"\xC5\x81"} = "\xC5\x82";

$cases{"\xC5\x83"} = "\xC5\x84";

$cases{"\xC5\x85"} = "\xC5\x86";

$cases{"\xC5\x87"} = "\xC5\x88";

$cases{"\xC5\x8A"} = "\xC5\x8B";

$cases{"\xC5\x8C"} = "\xC5\x8D";

$cases{"\xC5\x8E"} = "\xC5\x8F";

$cases{"\xC5\x90"} = "\xC5\x91";

$cases{"\xC5\x92"} = "\xC5\x93";

$cases{"\xC5\x94"} = "\xC5\x95";

$cases{"\xC5\x96"} = "\xC5\x97";

$cases{"\xC5\x98"} = "\xC5\x99";

$cases{"\xC5\x9A"} = "\xC5\x9B";

$cases{"\xC5\x9C"} = "\xC5\x9D";

$cases{"\xC5\x9E"} = "\xC5\x9F";

$cases{"\xC5\xA0"} = "\xC5\xA1";

$cases{"\xC5\xA2"} = "\xC5\xA3";

$cases{"\xC5\xA4"} = "\xC5\xA5";

$cases{"\xC5\xA6"} = "\xC5\xA7";

$cases{"\xC5\xA8"} = "\xC5\xA9";

$cases{"\xC5\xAA"} = "\xC5\xAB";

$cases{"\xC5\xAC"} = "\xC5\xAD";

$cases{"\xC5\xAE"} = "\xC5\xAF";

$cases{"\xC5\xB0"} = "\xC5\xB1";

$cases{"\xC5\xB2"} = "\xC5\xB3";

$cases{"\xC5\xB4"} = "\xC5\xB5";

$cases{"\xC5\xB6"} = "\xC5\xB7";

$cases{"\xC5\xB9"} = "\xC5\xBA";

$cases{"\xC5\xBB"} = "\xC5\xBC";

$cases{"\xC5\xBD"} = "\xC5\xBE";

$cases{"\xC6\x81"} = "\xC9\x93";

$cases{"\xC6\x82"} = "\xC6\x83";

$cases{"\xC6\x84"} = "\xC6\x85";

$cases{"\xC6\x86"} = "\xC9\x94";

$cases{"\xC6\x87"} = "\xC6\x88";

$cases{"\xC6\x89"} = "\xC9\x96";

$cases{"\xC6\x8A"} = "\xC9\x97";

$cases{"\xC6\x8B"} = "\xC6\x8C";

$cases{"\xC6\x8E"} = "\xC7\x9D";

$cases{"\xC6\x8F"} = "\xC9\x99";

$cases{"\xC6\x90"} = "\xC9\x9B";

$cases{"\xC6\x91"} = "\xC6\x92";

$cases{"\xC6\x93"} = "\xC9\xA0";

$cases{"\xC6\x94"} = "\xC9\xA3";

$cases{"\xC6\x95"} = "\xC7\xB6";

$cases{"\xC6\x96"} = "\xC9\xA9";

$cases{"\xC6\x97"} = "\xC9\xA8";

$cases{"\xC6\x98"} = "\xC6\x99";

$cases{"\xC6\x9A"} = "\xC8\xBD";

$cases{"\xC6\x9C"} = "\xC9\xAF";

$cases{"\xC6\x9D"} = "\xC9\xB2";

$cases{"\xC6\x9E"} = "\xC8\xA0";

$cases{"\xC6\x9F"} = "\xC9\xB5";

$cases{"\xC6\xA0"} = "\xC6\xA1";

$cases{"\xC6\xA2"} = "\xC6\xA3";

$cases{"\xC6\xA4"} = "\xC6\xA5";

$cases{"\xC6\xA6"} = "\xCA\x80";

$cases{"\xC6\xA7"} = "\xC6\xA8";

$cases{"\xC6\xA9"} = "\xCA\x83";

$cases{"\xC6\xAC"} = "\xC6\xAD";

$cases{"\xC6\xAE"} = "\xCA\x88";

$cases{"\xC6\xAF"} = "\xC6\xB0";

$cases{"\xC6\xB1"} = "\xCA\x8A";

$cases{"\xC6\xB2"} = "\xCA\x8B";

$cases{"\xC6\xB3"} = "\xC6\xB4";

$cases{"\xC6\xB5"} = "\xC6\xB6";

$cases{"\xC6\xB7"} = "\xCA\x92";

$cases{"\xC6\xB8"} = "\xC6\xB9";

$cases{"\xC6\xBC"} = "\xC6\xBD";

$cases{"\xC6\xBF"} = "\xC7\xB7";

$cases{"\xC7\x84"} = "\xC7\x85";

$cases{"\xC7\x87"} = "\xC7\x88";

$cases{"\xC7\x8A"} = "\xC7\x8B";

$cases{"\xC7\x8D"} = "\xC7\x8E";

$cases{"\xC7\x8F"} = "\xC7\x90";

$cases{"\xC7\x91"} = "\xC7\x92";

$cases{"\xC7\x93"} = "\xC7\x94";

$cases{"\xC7\x95"} = "\xC7\x96";

$cases{"\xC7\x97"} = "\xC7\x98";

$cases{"\xC7\x99"} = "\xC7\x9A";

$cases{"\xC7\x9B"} = "\xC7\x9C";

$cases{"\xC7\x9E"} = "\xC7\x9F";

$cases{"\xC7\xA0"} = "\xC7\xA1";

$cases{"\xC7\xA2"} = "\xC7\xA3";

$cases{"\xC7\xA4"} = "\xC7\xA5";

$cases{"\xC7\xA6"} = "\xC7\xA7";

$cases{"\xC7\xA8"} = "\xC7\xA9";

$cases{"\xC7\xAA"} = "\xC7\xAB";

$cases{"\xC7\xAC"} = "\xC7\xAD";

$cases{"\xC7\xAE"} = "\xC7\xAF";

$cases{"\xC7\xB1"} = "\xC7\xB2";

$cases{"\xC7\xB4"} = "\xC7\xB5";

$cases{"\xC7\xB8"} = "\xC7\xB9";

$cases{"\xC7\xBA"} = "\xC7\xBB";

$cases{"\xC7\xBC"} = "\xC7\xBD";

$cases{"\xC7\xBE"} = "\xC7\xBF";

$cases{"\xC8\x80"} = "\xC8\x81";

$cases{"\xC8\x82"} = "\xC8\x83";

$cases{"\xC8\x84"} = "\xC8\x85";

$cases{"\xC8\x86"} = "\xC8\x87";

$cases{"\xC8\x88"} = "\xC8\x89";

$cases{"\xC8\x8A"} = "\xC8\x8B";

$cases{"\xC8\x8C"} = "\xC8\x8D";

$cases{"\xC8\x8E"} = "\xC8\x8F";

$cases{"\xC8\x90"} = "\xC8\x91";

$cases{"\xC8\x92"} = "\xC8\x93";

$cases{"\xC8\x94"} = "\xC8\x95";

$cases{"\xC8\x96"} = "\xC8\x97";

$cases{"\xC8\x98"} = "\xC8\x99";

$cases{"\xC8\x9A"} = "\xC8\x9B";

$cases{"\xC8\x9C"} = "\xC8\x9D";

$cases{"\xC8\x9E"} = "\xC8\x9F";

$cases{"\xC8\xA2"} = "\xC8\xA3";

$cases{"\xC8\xA4"} = "\xC8\xA5";

$cases{"\xC8\xA6"} = "\xC8\xA7";

$cases{"\xC8\xA8"} = "\xC8\xA9";

$cases{"\xC8\xAA"} = "\xC8\xAB";

$cases{"\xC8\xAC"} = "\xC8\xAD";

$cases{"\xC8\xAE"} = "\xC8\xAF";

$cases{"\xC8\xB0"} = "\xC8\xB1";

$cases{"\xC8\xB2"} = "\xC8\xB3";

$cases{"\xC8\xBB"} = "\xC8\xBC";

$cases{"\xC9\x81"} = "\xCA\x94";

$cases{"\xE1\xB8\x80"} = "\xE1\xB8\x81";

$cases{"\xE1\xB8\x82"} = "\xE1\xB8\x83";

$cases{"\xE1\xB8\x84"} = "\xE1\xB8\x85";

$cases{"\xE1\xB8\x86"} = "\xE1\xB8\x87";

$cases{"\xE1\xB8\x88"} = "\xE1\xB8\x89";

$cases{"\xE1\xB8\x8A"} = "\xE1\xB8\x8B";

$cases{"\xE1\xB8\x8C"} = "\xE1\xB8\x8D";

$cases{"\xE1\xB8\x8E"} = "\xE1\xB8\x8F";

$cases{"\xE1\xB8\x90"} = "\xE1\xB8\x91";

$cases{"\xE1\xB8\x92"} = "\xE1\xB8\x93";

$cases{"\xE1\xB8\x94"} = "\xE1\xB8\x95";

$cases{"\xE1\xB8\x96"} = "\xE1\xB8\x97";

$cases{"\xE1\xB8\x98"} = "\xE1\xB8\x99";

$cases{"\xE1\xB8\x9A"} = "\xE1\xB8\x9B";

$cases{"\xE1\xB8\x9C"} = "\xE1\xB8\x9D";

$cases{"\xE1\xB8\x9E"} = "\xE1\xB8\x9F";

$cases{"\xE1\xB8\xA0"} = "\xE1\xB8\xA1";

$cases{"\xE1\xB8\xA2"} = "\xE1\xB8\xA3";

$cases{"\xE1\xB8\xA4"} = "\xE1\xB8\xA5";

$cases{"\xE1\xB8\xA6"} = "\xE1\xB8\xA7";

$cases{"\xE1\xB8\xA8"} = "\xE1\xB8\xA9";

$cases{"\xE1\xB8\xAA"} = "\xE1\xB8\xAB";

$cases{"\xE1\xB8\xAC"} = "\xE1\xB8\xAD";

$cases{"\xE1\xB8\xAE"} = "\xE1\xB8\xAF";

$cases{"\xE1\xB8\xB0"} = "\xE1\xB8\xB1";

$cases{"\xE1\xB8\xB2"} = "\xE1\xB8\xB3";

$cases{"\xE1\xB8\xB4"} = "\xE1\xB8\xB5";

$cases{"\xE1\xB8\xB6"} = "\xE1\xB8\xB7";

$cases{"\xE1\xB8\xB8"} = "\xE1\xB8\xB9";

$cases{"\xE1\xB8\xBA"} = "\xE1\xB8\xBB";

$cases{"\xE1\xB8\xBC"} = "\xE1\xB8\xBD";

$cases{"\xE1\xB8\xBE"} = "\xE1\xB8\xBF";

$cases{"\xE1\xB9\x80"} = "\xE1\xB9\x81";

$cases{"\xE1\xB9\x82"} = "\xE1\xB9\x83";

$cases{"\xE1\xB9\x84"} = "\xE1\xB9\x85";

$cases{"\xE1\xB9\x86"} = "\xE1\xB9\x87";

$cases{"\xE1\xB9\x88"} = "\xE1\xB9\x89";

$cases{"\xE1\xB9\x8A"} = "\xE1\xB9\x8B";

$cases{"\xE1\xB9\x8C"} = "\xE1\xB9\x8D";

$cases{"\xE1\xB9\x8E"} = "\xE1\xB9\x8F";

$cases{"\xE1\xB9\x90"} = "\xE1\xB9\x91";

$cases{"\xE1\xB9\x92"} = "\xE1\xB9\x93";

$cases{"\xE1\xB9\x94"} = "\xE1\xB9\x95";

$cases{"\xE1\xB9\x96"} = "\xE1\xB9\x97";

$cases{"\xE1\xB9\x98"} = "\xE1\xB9\x99";

$cases{"\xE1\xB9\x9A"} = "\xE1\xB9\x9B";

$cases{"\xE1\xB9\x9C"} = "\xE1\xB9\x9D";

$cases{"\xE1\xB9\x9E"} = "\xE1\xB9\x9F";

$cases{"\xE1\xB9\xA0"} = "\xE1\xB9\xA1";

$cases{"\xE1\xB9\xA2"} = "\xE1\xB9\xA3";

$cases{"\xE1\xB9\xA4"} = "\xE1\xB9\xA5";

$cases{"\xE1\xB9\xA6"} = "\xE1\xB9\xA7";

$cases{"\xE1\xB9\xA8"} = "\xE1\xB9\xA9";

$cases{"\xE1\xB9\xAA"} = "\xE1\xB9\xAB";

$cases{"\xE1\xB9\xAC"} = "\xE1\xB9\xAD";

$cases{"\xE1\xB9\xAE"} = "\xE1\xB9\xAF";

$cases{"\xE1\xB9\xB0"} = "\xE1\xB9\xB1";

$cases{"\xE1\xB9\xB2"} = "\xE1\xB9\xB3";

$cases{"\xE1\xB9\xB4"} = "\xE1\xB9\xB5";

$cases{"\xE1\xB9\xB6"} = "\xE1\xB9\xB7";

$cases{"\xE1\xB9\xB8"} = "\xE1\xB9\xB9";

$cases{"\xE1\xB9\xBA"} = "\xE1\xB9\xBB";

$cases{"\xE1\xB9\xBC"} = "\xE1\xB9\xBD";

$cases{"\xE1\xB9\xBE"} = "\xE1\xB9\xBF";

$cases{"\xE1\xBA\x80"} = "\xE1\xBA\x81";

$cases{"\xE1\xBA\x82"} = "\xE1\xBA\x83";

$cases{"\xE1\xBA\x84"} = "\xE1\xBA\x85";

$cases{"\xE1\xBA\x86"} = "\xE1\xBA\x87";

$cases{"\xE1\xBA\x88"} = "\xE1\xBA\x89";

$cases{"\xE1\xBA\x8A"} = "\xE1\xBA\x8B";

$cases{"\xE1\xBA\x8C"} = "\xE1\xBA\x8D";

$cases{"\xE1\xBA\x8E"} = "\xE1\xBA\x8F";

$cases{"\xE1\xBA\x90"} = "\xE1\xBA\x91";

$cases{"\xE1\xBA\x92"} = "\xE1\xBA\x93";

$cases{"\xE1\xBA\x94"} = "\xE1\xBA\x95";

$cases{"\xE1\xBA\xA0"} = "\xE1\xBA\xA1";

$cases{"\xE1\xBA\xA2"} = "\xE1\xBA\xA3";

$cases{"\xE1\xBA\xA4"} = "\xE1\xBA\xA5";

$cases{"\xE1\xBA\xA6"} = "\xE1\xBA\xA7";

$cases{"\xE1\xBA\xA8"} = "\xE1\xBA\xA9";

$cases{"\xE1\xBA\xAA"} = "\xE1\xBA\xAB";

$cases{"\xE1\xBA\xAC"} = "\xE1\xBA\xAD";

$cases{"\xE1\xBA\xAE"} = "\xE1\xBA\xAF";

$cases{"\xE1\xBA\xB0"} = "\xE1\xBA\xB1";

$cases{"\xE1\xBA\xB2"} = "\xE1\xBA\xB3";

$cases{"\xE1\xBA\xB4"} = "\xE1\xBA\xB5";

$cases{"\xE1\xBA\xB6"} = "\xE1\xBA\xB7";

$cases{"\xE1\xBA\xB8"} = "\xE1\xBA\xB9";

$cases{"\xE1\xBA\xBA"} = "\xE1\xBA\xBB";

$cases{"\xE1\xBA\xBC"} = "\xE1\xBA\xBD";

$cases{"\xE1\xBA\xBE"} = "\xE1\xBA\xBF";

$cases{"\xE1\xBB\x80"} = "\xE1\xBB\x81";

$cases{"\xE1\xBB\x82"} = "\xE1\xBB\x83";

$cases{"\xE1\xBB\x84"} = "\xE1\xBB\x85";

$cases{"\xE1\xBB\x86"} = "\xE1\xBB\x87";

$cases{"\xE1\xBB\x88"} = "\xE1\xBB\x89";

$cases{"\xE1\xBB\x8A"} = "\xE1\xBB\x8B";

$cases{"\xE1\xBB\x8C"} = "\xE1\xBB\x8D";

$cases{"\xE1\xBB\x8E"} = "\xE1\xBB\x8F";

$cases{"\xE1\xBB\x90"} = "\xE1\xBB\x91";

$cases{"\xE1\xBB\x92"} = "\xE1\xBB\x93";

$cases{"\xE1\xBB\x94"} = "\xE1\xBB\x95";

$cases{"\xE1\xBB\x96"} = "\xE1\xBB\x97";

$cases{"\xE1\xBB\x98"} = "\xE1\xBB\x99";

$cases{"\xE1\xBB\x9A"} = "\xE1\xBB\x9B";

$cases{"\xE1\xBB\x9C"} = "\xE1\xBB\x9D";

$cases{"\xE1\xBB\x9E"} = "\xE1\xBB\x9F";

$cases{"\xE1\xBB\xA0"} = "\xE1\xBB\xA1";

$cases{"\xE1\xBB\xA2"} = "\xE1\xBB\xA3";

$cases{"\xE1\xBB\xA4"} = "\xE1\xBB\xA5";

$cases{"\xE1\xBB\xA6"} = "\xE1\xBB\xA7";

$cases{"\xE1\xBB\xA8"} = "\xE1\xBB\xA9";

$cases{"\xE1\xBB\xAA"} = "\xE1\xBB\xAB";

$cases{"\xE1\xBB\xAC"} = "\xE1\xBB\xAD";

$cases{"\xE1\xBB\xAE"} = "\xE1\xBB\xAF";

$cases{"\xE1\xBB\xB0"} = "\xE1\xBB\xB1";

$cases{"\xE1\xBB\xB2"} = "\xE1\xBB\xB3";

$cases{"\xE1\xBB\xB4"} = "\xE1\xBB\xB5";

$cases{"\xE1\xBB\xB6"} = "\xE1\xBB\xB7";

$cases{"\xE1\xBB\xB8"} = "\xE1\xBB\xB9";

$cases{"\xE2\x92\xB6"} = "\xE2\x93\x90";

$cases{"\xE2\x92\xB7"} = "\xE2\x93\x91";

$cases{"\xE2\x92\xB8"} = "\xE2\x93\x92";

$cases{"\xE2\x92\xB9"} = "\xE2\x93\x93";

$cases{"\xE2\x92\xBA"} = "\xE2\x93\x94";

$cases{"\xE2\x92\xBB"} = "\xE2\x93\x95";

$cases{"\xE2\x92\xBC"} = "\xE2\x93\x96";

$cases{"\xE2\x92\xBD"} = "\xE2\x93\x97";

$cases{"\xE2\x92\xBE"} = "\xE2\x93\x98";

$cases{"\xE2\x92\xBF"} = "\xE2\x93\x99";

$cases{"\xE2\x93\x80"} = "\xE2\x93\x9A";

$cases{"\xE2\x93\x81"} = "\xE2\x93\x9B";

$cases{"\xE2\x93\x82"} = "\xE2\x93\x9C";

$cases{"\xE2\x93\x83"} = "\xE2\x93\x9D";

$cases{"\xE2\x93\x84"} = "\xE2\x93\x9E";

$cases{"\xE2\x93\x85"} = "\xE2\x93\x9F";

$cases{"\xE2\x93\x86"} = "\xE2\x93\xA0";

$cases{"\xE2\x93\x87"} = "\xE2\x93\xA1";

$cases{"\xE2\x93\x88"} = "\xE2\x93\xA2";

$cases{"\xE2\x93\x89"} = "\xE2\x93\xA3";

$cases{"\xE2\x93\x8A"} = "\xE2\x93\xA4";

$cases{"\xE2\x93\x8B"} = "\xE2\x93\xA5";

$cases{"\xE2\x93\x8C"} = "\xE2\x93\xA6";

$cases{"\xE2\x93\x8D"} = "\xE2\x93\xA7";

$cases{"\xE2\x93\x8E"} = "\xE2\x93\xA8";

$cases{"\xE2\x93\x8F"} = "\xE2\x93\xA9";

$cases{"\xE2\xB0\xAE"} = "\xE2\xB1\x9E";

$cases{"\xEF\xBC\xA1"} = "\xEF\xBD\x81";

$cases{"\xEF\xBC\xA2"} = "\xEF\xBD\x82";

$cases{"\xEF\xBC\xA3"} = "\xEF\xBD\x83";

$cases{"\xEF\xBC\xA4"} = "\xEF\xBD\x84";

$cases{"\xEF\xBC\xA5"} = "\xEF\xBD\x85";

$cases{"\xEF\xBC\xA6"} = "\xEF\xBD\x86";

$cases{"\xEF\xBC\xA7"} = "\xEF\xBD\x87";

$cases{"\xEF\xBC\xA8"} = "\xEF\xBD\x88";

$cases{"\xEF\xBC\xA9"} = "\xEF\xBD\x89";

$cases{"\xEF\xBC\xAA"} = "\xEF\xBD\x8A";

$cases{"\xEF\xBC\xAB"} = "\xEF\xBD\x8B";

$cases{"\xEF\xBC\xAC"} = "\xEF\xBD\x8C";

$cases{"\xEF\xBC\xAD"} = "\xEF\xBD\x8D";

$cases{"\xEF\xBC\xAE"} = "\xEF\xBD\x8E";

$cases{"\xEF\xBC\xAF"} = "\xEF\xBD\x8F";

$cases{"\xEF\xBC\xB0"} = "\xEF\xBD\x90";

$cases{"\xEF\xBC\xB1"} = "\xEF\xBD\x91";

$cases{"\xEF\xBC\xB2"} = "\xEF\xBD\x92";

$cases{"\xEF\xBC\xB3"} = "\xEF\xBD\x93";

$cases{"\xEF\xBC\xB4"} = "\xEF\xBD\x94";

$cases{"\xEF\xBC\xB5"} = "\xEF\xBD\x95";

$cases{"\xEF\xBC\xB6"} = "\xEF\xBD\x96";

$cases{"\xEF\xBC\xB7"} = "\xEF\xBD\x97";

$cases{"\xEF\xBC\xB8"} = "\xEF\xBD\x98";

$cases{"\xEF\xBC\xB9"} = "\xEF\xBD\x99";

$cases{"\xEF\xBC\xBA"} = "\xEF\xBD\x9A";

%ACCENTS = ( 'A', "(a|\xc4x80|\xc3\x84|\xc3\x82|\xc4\x81|\xc3\xa0|\xc3\xa1|\xc3\xa2|\xc3\xa3|\xc3\xa4)",

'C', "(c|\xc4\x8d|\xc4\x8c|\xc3\x87|\xc3\xa7)",

'D', "(d|\xe1\xb8\x8c|\xe1\xb8\x8d)",

'E', "(e|\xc3\x89|\xc3\x88|\xc3\xa8|\xc3\xa9|\xc3\xaa|\xc3\xab)",

'G', "(g|\xc4\x9e|\xc4\x9f)",

'H', "(h|\xe1\xb8\xa4|\xe1\xb8\xa5)",

'I', "(i|\xc3\x8f|\xc4\xb1|\xc4\xb0|\xc4\xaa|\xc4\xab|\xc3\xac|\xc3\xad|\xc3\xae|\xc3\xaf)",

'K', "(k|\xe1\xb8\xb2|\xe1\xb8\xb3)",

'N', "(n|\xc3\xb1)",

'O', "(o|\xc3\x96|\xc3\xb2|\xc3\xb3|\xc3\xb4|\xc3\xb4|\xc3\xb6)",

'S', "(s|\xe1\xb9\xa2|\xe1\xb9\xa3|\xc8\x98|\xc5\x9f|\xc5\x9e|\xc5\xa1|\xc5\xa0)",

'T', "(t|\xe1\xb9\xad|\xe1\xb9\xac)",

'U', "(u|\xc3\x9c|\xc5\xaa|\xc5\xab|\xc3\xb9|\xc3\xba|\xc3\xbb|\xc3\xbc)",

'Z', "(z|\xe1\xba\x92|\xe1\xba\x93|\xc5\xbd|\xc5\xbe)",

"'", "('|\xca\xbf|\xca\xbe)",

'Y', "[y\375\377]",

);

$DOTPATTERN = "([a-zA-Z0-9]|[\xa0-\xc3][\xa0-\xc3])";

while (<>)

{

chop;

$foo = `echo "got it: $_\n" > /Volumes/data/var/lib/philologic/databases/mamluk/crapout`;

# Here we subsititute in both cases of a Unicode character defined above

# so that we can do case foldd Unicode searches.

while (($a, $b) = each(%cases)) {

s/$a|$b/\($a|$b\)/g;

}

$foo = `echo "after: $_\n" >> /Volumes/data/var/lib/philologic/databases/mamluk/crapout`;

$prefix = /^\256\?/ ? "\256\?" : /^\256/ ? "\256" : "";

$prefix = "";

s/^\256\?*//;

s/(\([^\)]*)(\|)([^\)]*\))/$1#PIPE#$3/g;

@patterns = split ('\|', $_);

foreach $pattern (@patterns)

{

# $pattern =~ s/^(.*)$/^$prefix$1\$/;

$pattern =~ s/^(.*)$/^$prefix$1\t/;

}

$_ = join ("|", @patterns);

s/#PIPE#/\|/g;

s/[ACDEGHIKNOSTUZ'Y]/$ACCENTS{$&}/ge;

$foo = `echo '$_' >> /Volumes/data/var/lib/philologic/databases/mamluk/crapout`;

s/(\.)([^\*])/$DOTPATTERN$2/g;

tr/A-Z/a-z/;

system ("/usr/bin/grep -E -i -e \"$_\" < " . $ENV{SYSTEM_DIR} . "words.R.wom" . " | /sw/bin/gawk -F\"\t\" '{print \$2}' ");

}