Wiki‎ > ‎

Unicode Case Folding

Problem

You want to be able to search for upper and lower case versions of Unicode characters and return either version.... you want case folding.


POSSIBLE ELEGENANT SOLUTION

Doesn't Perl feature Unicode-aware RegExp? Sure. So why can't we use POSIX character classes to neutralize case distinctions? Because it doesn't seem to work! Can we simply search for UpperCase character and replace it with UC | LC character? If so, skip the below and do that instead.


Solution

Maybe you could do this with words.R.wom, but the combinatorics could get ugly if you had to have a version in there for every possible variation of upper and lower case in each word. Better to hack something onto crapser to take care of this. Basically, you need an array of character equivalencies, from upper to lower and lower to upper. Stick that onto the top of crapser like so:

#!/usr/bin/perl
# $Id: crapser-egrep-2field.plin,v 2.1 2004/08/23 21:45:03 o Exp $

# These cases are Unicode characters upper -> lower and lower -> equivalencies.
# We need to search on either one and return either one, so they are subbed in
# down below to do casefolding.

$cases{"\xC3\x80"} = "\xC3\xA0";
$cases{"\xC3\x81"} = "\xC3\xA1";
$cases{"\xC3\x82"} = "\xC3\xA2";
$cases{"\xC3\x83"} = "\xC3\xA3";
# .... etc .....

Then, down below in the while loop, add:

# Here we subsititute in both cases of a Unicode character defined above
# so that we can do case foldd Unicode searches.

while (($a, $b) = each(%cases)) {
s/$a|$b/\($a|$b\)/g;
}

That should do it. Entire crapser file from the Mamluk project is included below for reference.... Unfortunately, the equivalency array is not that complete, so you might want to make your own.

crapser:

#!/usr/bin/perl
# $Id: crapser-egrep-2field.plin,v 2.1 2004/08/23 21:45:03 o Exp $

# These cases are Unicode characters upper -> lower and lower -> equivalencies.
# We need to search on either one and return either one, so they are subbed in
# down below to do casefolding.

$cases{"\xC3\x80"} = "\xC3\xA0";
$cases{"\xC3\x81"} = "\xC3\xA1";
$cases{"\xC3\x82"} = "\xC3\xA2";
$cases{"\xC3\x83"} = "\xC3\xA3";
$cases{"\xC3\x84"} = "\xC3\xA4";
$cases{"\xC3\x85"} = "\xC3\xA5";
$cases{"\xC3\x86"} = "\xC3\xA6";
$cases{"\xC3\x87"} = "\xC3\xA7";
$cases{"\xC3\x88"} = "\xC3\xA8";
$cases{"\xC3\x89"} = "\xC3\xA9";
$cases{"\xC3\x8A"} = "\xC3\xAA";
$cases{"\xC3\x8B"} = "\xC3\xAB";
$cases{"\xC3\x8C"} = "\xC3\xAC";
$cases{"\xC3\x8D"} = "\xC3\xAD";
$cases{"\xC3\x8E"} = "\xC3\xAE";
$cases{"\xC3\x8F"} = "\xC3\xAF";
$cases{"\xC3\x90"} = "\xC3\xB0";
$cases{"\xC3\x91"} = "\xC3\xB1";
$cases{"\xC3\x92"} = "\xC3\xB2";
$cases{"\xC3\x93"} = "\xC3\xB3";
$cases{"\xC3\x94"} = "\xC3\xB4";
$cases{"\xC3\x95"} = "\xC3\xB5";
$cases{"\xC3\x96"} = "\xC3\xB6";
$cases{"\xC3\x98"} = "\xC3\xB8";
$cases{"\xC3\x99"} = "\xC3\xB9";
$cases{"\xC3\x9A"} = "\xC3\xBA";
$cases{"\xC3\x9B"} = "\xC3\xBB";
$cases{"\xC3\x9C"} = "\xC3\xBC";
$cases{"\xC3\x9D"} = "\xC3\xBD";
$cases{"\xC3\x9E"} = "\xC3\xBE";
$cases{"\xC3\xBF"} = "\xC5\xB8";
$cases{"\xC4\x80"} = "\xC4\x81";
$cases{"\xC4\x82"} = "\xC4\x83";
$cases{"\xC4\x84"} = "\xC4\x85";
$cases{"\xC4\x86"} = "\xC4\x87";
$cases{"\xC4\x88"} = "\xC4\x89";
$cases{"\xC4\x8A"} = "\xC4\x8B";
$cases{"\xC4\x8C"} = "\xC4\x8D";
$cases{"\xC4\x8E"} = "\xC4\x8F";
$cases{"\xC4\x90"} = "\xC4\x91";
$cases{"\xC4\x92"} = "\xC4\x93";
$cases{"\xC4\x94"} = "\xC4\x95";
$cases{"\xC4\x96"} = "\xC4\x97";
$cases{"\xC4\x98"} = "\xC4\x99";
$cases{"\xC4\x9A"} = "\xC4\x9B";
$cases{"\xC4\x9C"} = "\xC4\x9D";
$cases{"\xC4\x9E"} = "\xC4\x9F";
$cases{"\xC4\xA0"} = "\xC4\xA1";
$cases{"\xC4\xA2"} = "\xC4\xA3";
$cases{"\xC4\xA4"} = "\xC4\xA5";
$cases{"\xC4\xA6"} = "\xC4\xA7";
$cases{"\xC4\xA8"} = "\xC4\xA9";
$cases{"\xC4\xAA"} = "\xC4\xAB";
$cases{"\xC4\xAC"} = "\xC4\xAD";
$cases{"\xC4\xAE"} = "\xC4\xAF";
$cases{"\xC4\xB2"} = "\xC4\xB3";
$cases{"\xC4\xB4"} = "\xC4\xB5";
$cases{"\xC4\xB6"} = "\xC4\xB7";
$cases{"\xC4\xB9"} = "\xC4\xBA";
$cases{"\xC4\xBB"} = "\xC4\xBC";
$cases{"\xC4\xBD"} = "\xC4\xBE";
$cases{"\xC4\xBF"} = "\xC5\x80";
$cases{"\xC5\x81"} = "\xC5\x82";
$cases{"\xC5\x83"} = "\xC5\x84";
$cases{"\xC5\x85"} = "\xC5\x86";
$cases{"\xC5\x87"} = "\xC5\x88";
$cases{"\xC5\x8A"} = "\xC5\x8B";
$cases{"\xC5\x8C"} = "\xC5\x8D";
$cases{"\xC5\x8E"} = "\xC5\x8F";
$cases{"\xC5\x90"} = "\xC5\x91";
$cases{"\xC5\x92"} = "\xC5\x93";
$cases{"\xC5\x94"} = "\xC5\x95";
$cases{"\xC5\x96"} = "\xC5\x97";
$cases{"\xC5\x98"} = "\xC5\x99";
$cases{"\xC5\x9A"} = "\xC5\x9B";
$cases{"\xC5\x9C"} = "\xC5\x9D";
$cases{"\xC5\x9E"} = "\xC5\x9F";
$cases{"\xC5\xA0"} = "\xC5\xA1";
$cases{"\xC5\xA2"} = "\xC5\xA3";
$cases{"\xC5\xA4"} = "\xC5\xA5";
$cases{"\xC5\xA6"} = "\xC5\xA7";
$cases{"\xC5\xA8"} = "\xC5\xA9";
$cases{"\xC5\xAA"} = "\xC5\xAB";
$cases{"\xC5\xAC"} = "\xC5\xAD";
$cases{"\xC5\xAE"} = "\xC5\xAF";
$cases{"\xC5\xB0"} = "\xC5\xB1";
$cases{"\xC5\xB2"} = "\xC5\xB3";
$cases{"\xC5\xB4"} = "\xC5\xB5";
$cases{"\xC5\xB6"} = "\xC5\xB7";
$cases{"\xC5\xB9"} = "\xC5\xBA";
$cases{"\xC5\xBB"} = "\xC5\xBC";
$cases{"\xC5\xBD"} = "\xC5\xBE";
$cases{"\xC6\x81"} = "\xC9\x93";
$cases{"\xC6\x82"} = "\xC6\x83";
$cases{"\xC6\x84"} = "\xC6\x85";
$cases{"\xC6\x86"} = "\xC9\x94";
$cases{"\xC6\x87"} = "\xC6\x88";
$cases{"\xC6\x89"} = "\xC9\x96";
$cases{"\xC6\x8A"} = "\xC9\x97";
$cases{"\xC6\x8B"} = "\xC6\x8C";
$cases{"\xC6\x8E"} = "\xC7\x9D";
$cases{"\xC6\x8F"} = "\xC9\x99";
$cases{"\xC6\x90"} = "\xC9\x9B";
$cases{"\xC6\x91"} = "\xC6\x92";
$cases{"\xC6\x93"} = "\xC9\xA0";
$cases{"\xC6\x94"} = "\xC9\xA3";
$cases{"\xC6\x95"} = "\xC7\xB6";
$cases{"\xC6\x96"} = "\xC9\xA9";
$cases{"\xC6\x97"} = "\xC9\xA8";
$cases{"\xC6\x98"} = "\xC6\x99";
$cases{"\xC6\x9A"} = "\xC8\xBD";
$cases{"\xC6\x9C"} = "\xC9\xAF";
$cases{"\xC6\x9D"} = "\xC9\xB2";
$cases{"\xC6\x9E"} = "\xC8\xA0";
$cases{"\xC6\x9F"} = "\xC9\xB5";
$cases{"\xC6\xA0"} = "\xC6\xA1";
$cases{"\xC6\xA2"} = "\xC6\xA3";
$cases{"\xC6\xA4"} = "\xC6\xA5";
$cases{"\xC6\xA6"} = "\xCA\x80";
$cases{"\xC6\xA7"} = "\xC6\xA8";
$cases{"\xC6\xA9"} = "\xCA\x83";
$cases{"\xC6\xAC"} = "\xC6\xAD";
$cases{"\xC6\xAE"} = "\xCA\x88";
$cases{"\xC6\xAF"} = "\xC6\xB0";
$cases{"\xC6\xB1"} = "\xCA\x8A";
$cases{"\xC6\xB2"} = "\xCA\x8B";
$cases{"\xC6\xB3"} = "\xC6\xB4";
$cases{"\xC6\xB5"} = "\xC6\xB6";
$cases{"\xC6\xB7"} = "\xCA\x92";
$cases{"\xC6\xB8"} = "\xC6\xB9";
$cases{"\xC6\xBC"} = "\xC6\xBD";
$cases{"\xC6\xBF"} = "\xC7\xB7";
$cases{"\xC7\x84"} = "\xC7\x85";
$cases{"\xC7\x87"} = "\xC7\x88";
$cases{"\xC7\x8A"} = "\xC7\x8B";
$cases{"\xC7\x8D"} = "\xC7\x8E";
$cases{"\xC7\x8F"} = "\xC7\x90";
$cases{"\xC7\x91"} = "\xC7\x92";
$cases{"\xC7\x93"} = "\xC7\x94";
$cases{"\xC7\x95"} = "\xC7\x96";
$cases{"\xC7\x97"} = "\xC7\x98";
$cases{"\xC7\x99"} = "\xC7\x9A";
$cases{"\xC7\x9B"} = "\xC7\x9C";
$cases{"\xC7\x9E"} = "\xC7\x9F";
$cases{"\xC7\xA0"} = "\xC7\xA1";
$cases{"\xC7\xA2"} = "\xC7\xA3";
$cases{"\xC7\xA4"} = "\xC7\xA5";
$cases{"\xC7\xA6"} = "\xC7\xA7";
$cases{"\xC7\xA8"} = "\xC7\xA9";
$cases{"\xC7\xAA"} = "\xC7\xAB";
$cases{"\xC7\xAC"} = "\xC7\xAD";
$cases{"\xC7\xAE"} = "\xC7\xAF";
$cases{"\xC7\xB1"} = "\xC7\xB2";
$cases{"\xC7\xB4"} = "\xC7\xB5";
$cases{"\xC7\xB8"} = "\xC7\xB9";
$cases{"\xC7\xBA"} = "\xC7\xBB";
$cases{"\xC7\xBC"} = "\xC7\xBD";
$cases{"\xC7\xBE"} = "\xC7\xBF";
$cases{"\xC8\x80"} = "\xC8\x81";
$cases{"\xC8\x82"} = "\xC8\x83";
$cases{"\xC8\x84"} = "\xC8\x85";
$cases{"\xC8\x86"} = "\xC8\x87";
$cases{"\xC8\x88"} = "\xC8\x89";
$cases{"\xC8\x8A"} = "\xC8\x8B";
$cases{"\xC8\x8C"} = "\xC8\x8D";
$cases{"\xC8\x8E"} = "\xC8\x8F";
$cases{"\xC8\x90"} = "\xC8\x91";
$cases{"\xC8\x92"} = "\xC8\x93";
$cases{"\xC8\x94"} = "\xC8\x95";
$cases{"\xC8\x96"} = "\xC8\x97";
$cases{"\xC8\x98"} = "\xC8\x99";
$cases{"\xC8\x9A"} = "\xC8\x9B";
$cases{"\xC8\x9C"} = "\xC8\x9D";
$cases{"\xC8\x9E"} = "\xC8\x9F";
$cases{"\xC8\xA2"} = "\xC8\xA3";
$cases{"\xC8\xA4"} = "\xC8\xA5";
$cases{"\xC8\xA6"} = "\xC8\xA7";
$cases{"\xC8\xA8"} = "\xC8\xA9";
$cases{"\xC8\xAA"} = "\xC8\xAB";
$cases{"\xC8\xAC"} = "\xC8\xAD";
$cases{"\xC8\xAE"} = "\xC8\xAF";
$cases{"\xC8\xB0"} = "\xC8\xB1";
$cases{"\xC8\xB2"} = "\xC8\xB3";
$cases{"\xC8\xBB"} = "\xC8\xBC";
$cases{"\xC9\x81"} = "\xCA\x94";
$cases{"\xE1\xB8\x80"} = "\xE1\xB8\x81";
$cases{"\xE1\xB8\x82"} = "\xE1\xB8\x83";
$cases{"\xE1\xB8\x84"} = "\xE1\xB8\x85";
$cases{"\xE1\xB8\x86"} = "\xE1\xB8\x87";
$cases{"\xE1\xB8\x88"} = "\xE1\xB8\x89";
$cases{"\xE1\xB8\x8A"} = "\xE1\xB8\x8B";
$cases{"\xE1\xB8\x8C"} = "\xE1\xB8\x8D";
$cases{"\xE1\xB8\x8E"} = "\xE1\xB8\x8F";
$cases{"\xE1\xB8\x90"} = "\xE1\xB8\x91";
$cases{"\xE1\xB8\x92"} = "\xE1\xB8\x93";
$cases{"\xE1\xB8\x94"} = "\xE1\xB8\x95";
$cases{"\xE1\xB8\x96"} = "\xE1\xB8\x97";
$cases{"\xE1\xB8\x98"} = "\xE1\xB8\x99";
$cases{"\xE1\xB8\x9A"} = "\xE1\xB8\x9B";
$cases{"\xE1\xB8\x9C"} = "\xE1\xB8\x9D";
$cases{"\xE1\xB8\x9E"} = "\xE1\xB8\x9F";
$cases{"\xE1\xB8\xA0"} = "\xE1\xB8\xA1";
$cases{"\xE1\xB8\xA2"} = "\xE1\xB8\xA3";
$cases{"\xE1\xB8\xA4"} = "\xE1\xB8\xA5";
$cases{"\xE1\xB8\xA6"} = "\xE1\xB8\xA7";
$cases{"\xE1\xB8\xA8"} = "\xE1\xB8\xA9";
$cases{"\xE1\xB8\xAA"} = "\xE1\xB8\xAB";
$cases{"\xE1\xB8\xAC"} = "\xE1\xB8\xAD";
$cases{"\xE1\xB8\xAE"} = "\xE1\xB8\xAF";
$cases{"\xE1\xB8\xB0"} = "\xE1\xB8\xB1";
$cases{"\xE1\xB8\xB2"} = "\xE1\xB8\xB3";
$cases{"\xE1\xB8\xB4"} = "\xE1\xB8\xB5";
$cases{"\xE1\xB8\xB6"} = "\xE1\xB8\xB7";
$cases{"\xE1\xB8\xB8"} = "\xE1\xB8\xB9";
$cases{"\xE1\xB8\xBA"} = "\xE1\xB8\xBB";
$cases{"\xE1\xB8\xBC"} = "\xE1\xB8\xBD";
$cases{"\xE1\xB8\xBE"} = "\xE1\xB8\xBF";
$cases{"\xE1\xB9\x80"} = "\xE1\xB9\x81";
$cases{"\xE1\xB9\x82"} = "\xE1\xB9\x83";
$cases{"\xE1\xB9\x84"} = "\xE1\xB9\x85";
$cases{"\xE1\xB9\x86"} = "\xE1\xB9\x87";
$cases{"\xE1\xB9\x88"} = "\xE1\xB9\x89";
$cases{"\xE1\xB9\x8A"} = "\xE1\xB9\x8B";
$cases{"\xE1\xB9\x8C"} = "\xE1\xB9\x8D";
$cases{"\xE1\xB9\x8E"} = "\xE1\xB9\x8F";
$cases{"\xE1\xB9\x90"} = "\xE1\xB9\x91";
$cases{"\xE1\xB9\x92"} = "\xE1\xB9\x93";
$cases{"\xE1\xB9\x94"} = "\xE1\xB9\x95";
$cases{"\xE1\xB9\x96"} = "\xE1\xB9\x97";
$cases{"\xE1\xB9\x98"} = "\xE1\xB9\x99";
$cases{"\xE1\xB9\x9A"} = "\xE1\xB9\x9B";
$cases{"\xE1\xB9\x9C"} = "\xE1\xB9\x9D";
$cases{"\xE1\xB9\x9E"} = "\xE1\xB9\x9F";
$cases{"\xE1\xB9\xA0"} = "\xE1\xB9\xA1";
$cases{"\xE1\xB9\xA2"} = "\xE1\xB9\xA3";
$cases{"\xE1\xB9\xA4"} = "\xE1\xB9\xA5";
$cases{"\xE1\xB9\xA6"} = "\xE1\xB9\xA7";
$cases{"\xE1\xB9\xA8"} = "\xE1\xB9\xA9";
$cases{"\xE1\xB9\xAA"} = "\xE1\xB9\xAB";
$cases{"\xE1\xB9\xAC"} = "\xE1\xB9\xAD";
$cases{"\xE1\xB9\xAE"} = "\xE1\xB9\xAF";
$cases{"\xE1\xB9\xB0"} = "\xE1\xB9\xB1";
$cases{"\xE1\xB9\xB2"} = "\xE1\xB9\xB3";
$cases{"\xE1\xB9\xB4"} = "\xE1\xB9\xB5";
$cases{"\xE1\xB9\xB6"} = "\xE1\xB9\xB7";
$cases{"\xE1\xB9\xB8"} = "\xE1\xB9\xB9";
$cases{"\xE1\xB9\xBA"} = "\xE1\xB9\xBB";
$cases{"\xE1\xB9\xBC"} = "\xE1\xB9\xBD";
$cases{"\xE1\xB9\xBE"} = "\xE1\xB9\xBF";
$cases{"\xE1\xBA\x80"} = "\xE1\xBA\x81";
$cases{"\xE1\xBA\x82"} = "\xE1\xBA\x83";
$cases{"\xE1\xBA\x84"} = "\xE1\xBA\x85";
$cases{"\xE1\xBA\x86"} = "\xE1\xBA\x87";
$cases{"\xE1\xBA\x88"} = "\xE1\xBA\x89";
$cases{"\xE1\xBA\x8A"} = "\xE1\xBA\x8B";
$cases{"\xE1\xBA\x8C"} = "\xE1\xBA\x8D";
$cases{"\xE1\xBA\x8E"} = "\xE1\xBA\x8F";
$cases{"\xE1\xBA\x90"} = "\xE1\xBA\x91";
$cases{"\xE1\xBA\x92"} = "\xE1\xBA\x93";
$cases{"\xE1\xBA\x94"} = "\xE1\xBA\x95";
$cases{"\xE1\xBA\xA0"} = "\xE1\xBA\xA1";
$cases{"\xE1\xBA\xA2"} = "\xE1\xBA\xA3";
$cases{"\xE1\xBA\xA4"} = "\xE1\xBA\xA5";
$cases{"\xE1\xBA\xA6"} = "\xE1\xBA\xA7";
$cases{"\xE1\xBA\xA8"} = "\xE1\xBA\xA9";
$cases{"\xE1\xBA\xAA"} = "\xE1\xBA\xAB";
$cases{"\xE1\xBA\xAC"} = "\xE1\xBA\xAD";
$cases{"\xE1\xBA\xAE"} = "\xE1\xBA\xAF";
$cases{"\xE1\xBA\xB0"} = "\xE1\xBA\xB1";
$cases{"\xE1\xBA\xB2"} = "\xE1\xBA\xB3";
$cases{"\xE1\xBA\xB4"} = "\xE1\xBA\xB5";
$cases{"\xE1\xBA\xB6"} = "\xE1\xBA\xB7";
$cases{"\xE1\xBA\xB8"} = "\xE1\xBA\xB9";
$cases{"\xE1\xBA\xBA"} = "\xE1\xBA\xBB";
$cases{"\xE1\xBA\xBC"} = "\xE1\xBA\xBD";
$cases{"\xE1\xBA\xBE"} = "\xE1\xBA\xBF";
$cases{"\xE1\xBB\x80"} = "\xE1\xBB\x81";
$cases{"\xE1\xBB\x82"} = "\xE1\xBB\x83";
$cases{"\xE1\xBB\x84"} = "\xE1\xBB\x85";
$cases{"\xE1\xBB\x86"} = "\xE1\xBB\x87";
$cases{"\xE1\xBB\x88"} = "\xE1\xBB\x89";
$cases{"\xE1\xBB\x8A"} = "\xE1\xBB\x8B";
$cases{"\xE1\xBB\x8C"} = "\xE1\xBB\x8D";
$cases{"\xE1\xBB\x8E"} = "\xE1\xBB\x8F";
$cases{"\xE1\xBB\x90"} = "\xE1\xBB\x91";
$cases{"\xE1\xBB\x92"} = "\xE1\xBB\x93";
$cases{"\xE1\xBB\x94"} = "\xE1\xBB\x95";
$cases{"\xE1\xBB\x96"} = "\xE1\xBB\x97";
$cases{"\xE1\xBB\x98"} = "\xE1\xBB\x99";
$cases{"\xE1\xBB\x9A"} = "\xE1\xBB\x9B";
$cases{"\xE1\xBB\x9C"} = "\xE1\xBB\x9D";
$cases{"\xE1\xBB\x9E"} = "\xE1\xBB\x9F";
$cases{"\xE1\xBB\xA0"} = "\xE1\xBB\xA1";
$cases{"\xE1\xBB\xA2"} = "\xE1\xBB\xA3";
$cases{"\xE1\xBB\xA4"} = "\xE1\xBB\xA5";
$cases{"\xE1\xBB\xA6"} = "\xE1\xBB\xA7";
$cases{"\xE1\xBB\xA8"} = "\xE1\xBB\xA9";
$cases{"\xE1\xBB\xAA"} = "\xE1\xBB\xAB";
$cases{"\xE1\xBB\xAC"} = "\xE1\xBB\xAD";
$cases{"\xE1\xBB\xAE"} = "\xE1\xBB\xAF";
$cases{"\xE1\xBB\xB0"} = "\xE1\xBB\xB1";
$cases{"\xE1\xBB\xB2"} = "\xE1\xBB\xB3";
$cases{"\xE1\xBB\xB4"} = "\xE1\xBB\xB5";
$cases{"\xE1\xBB\xB6"} = "\xE1\xBB\xB7";
$cases{"\xE1\xBB\xB8"} = "\xE1\xBB\xB9";
$cases{"\xE2\x92\xB6"} = "\xE2\x93\x90";
$cases{"\xE2\x92\xB7"} = "\xE2\x93\x91";
$cases{"\xE2\x92\xB8"} = "\xE2\x93\x92";
$cases{"\xE2\x92\xB9"} = "\xE2\x93\x93";
$cases{"\xE2\x92\xBA"} = "\xE2\x93\x94";
$cases{"\xE2\x92\xBB"} = "\xE2\x93\x95";
$cases{"\xE2\x92\xBC"} = "\xE2\x93\x96";
$cases{"\xE2\x92\xBD"} = "\xE2\x93\x97";
$cases{"\xE2\x92\xBE"} = "\xE2\x93\x98";
$cases{"\xE2\x92\xBF"} = "\xE2\x93\x99";
$cases{"\xE2\x93\x80"} = "\xE2\x93\x9A";
$cases{"\xE2\x93\x81"} = "\xE2\x93\x9B";
$cases{"\xE2\x93\x82"} = "\xE2\x93\x9C";
$cases{"\xE2\x93\x83"} = "\xE2\x93\x9D";
$cases{"\xE2\x93\x84"} = "\xE2\x93\x9E";
$cases{"\xE2\x93\x85"} = "\xE2\x93\x9F";
$cases{"\xE2\x93\x86"} = "\xE2\x93\xA0";
$cases{"\xE2\x93\x87"} = "\xE2\x93\xA1";
$cases{"\xE2\x93\x88"} = "\xE2\x93\xA2";
$cases{"\xE2\x93\x89"} = "\xE2\x93\xA3";
$cases{"\xE2\x93\x8A"} = "\xE2\x93\xA4";
$cases{"\xE2\x93\x8B"} = "\xE2\x93\xA5";
$cases{"\xE2\x93\x8C"} = "\xE2\x93\xA6";
$cases{"\xE2\x93\x8D"} = "\xE2\x93\xA7";
$cases{"\xE2\x93\x8E"} = "\xE2\x93\xA8";
$cases{"\xE2\x93\x8F"} = "\xE2\x93\xA9";
$cases{"\xE2\xB0\xAE"} = "\xE2\xB1\x9E";
$cases{"\xEF\xBC\xA1"} = "\xEF\xBD\x81";
$cases{"\xEF\xBC\xA2"} = "\xEF\xBD\x82";
$cases{"\xEF\xBC\xA3"} = "\xEF\xBD\x83";
$cases{"\xEF\xBC\xA4"} = "\xEF\xBD\x84";
$cases{"\xEF\xBC\xA5"} = "\xEF\xBD\x85";
$cases{"\xEF\xBC\xA6"} = "\xEF\xBD\x86";
$cases{"\xEF\xBC\xA7"} = "\xEF\xBD\x87";
$cases{"\xEF\xBC\xA8"} = "\xEF\xBD\x88";
$cases{"\xEF\xBC\xA9"} = "\xEF\xBD\x89";
$cases{"\xEF\xBC\xAA"} = "\xEF\xBD\x8A";
$cases{"\xEF\xBC\xAB"} = "\xEF\xBD\x8B";
$cases{"\xEF\xBC\xAC"} = "\xEF\xBD\x8C";
$cases{"\xEF\xBC\xAD"} = "\xEF\xBD\x8D";
$cases{"\xEF\xBC\xAE"} = "\xEF\xBD\x8E";
$cases{"\xEF\xBC\xAF"} = "\xEF\xBD\x8F";
$cases{"\xEF\xBC\xB0"} = "\xEF\xBD\x90";
$cases{"\xEF\xBC\xB1"} = "\xEF\xBD\x91";
$cases{"\xEF\xBC\xB2"} = "\xEF\xBD\x92";
$cases{"\xEF\xBC\xB3"} = "\xEF\xBD\x93";
$cases{"\xEF\xBC\xB4"} = "\xEF\xBD\x94";
$cases{"\xEF\xBC\xB5"} = "\xEF\xBD\x95";
$cases{"\xEF\xBC\xB6"} = "\xEF\xBD\x96";
$cases{"\xEF\xBC\xB7"} = "\xEF\xBD\x97";
$cases{"\xEF\xBC\xB8"} = "\xEF\xBD\x98";
$cases{"\xEF\xBC\xB9"} = "\xEF\xBD\x99";
$cases{"\xEF\xBC\xBA"} = "\xEF\xBD\x9A";



%ACCENTS = ( 'A', "(a|\xc4x80|\xc3\x84|\xc3\x82|\xc4\x81|\xc3\xa0|\xc3\xa1|\xc3\xa2|\xc3\xa3|\xc3\xa4)",
'C', "(c|\xc4\x8d|\xc4\x8c|\xc3\x87|\xc3\xa7)",
'D', "(d|\xe1\xb8\x8c|\xe1\xb8\x8d)",
'E', "(e|\xc3\x89|\xc3\x88|\xc3\xa8|\xc3\xa9|\xc3\xaa|\xc3\xab)",
'G', "(g|\xc4\x9e|\xc4\x9f)",
'H', "(h|\xe1\xb8\xa4|\xe1\xb8\xa5)",
'I', "(i|\xc3\x8f|\xc4\xb1|\xc4\xb0|\xc4\xaa|\xc4\xab|\xc3\xac|\xc3\xad|\xc3\xae|\xc3\xaf)",
'K', "(k|\xe1\xb8\xb2|\xe1\xb8\xb3)",
'N', "(n|\xc3\xb1)",
'O', "(o|\xc3\x96|\xc3\xb2|\xc3\xb3|\xc3\xb4|\xc3\xb4|\xc3\xb6)",
'S', "(s|\xe1\xb9\xa2|\xe1\xb9\xa3|\xc8\x98|\xc5\x9f|\xc5\x9e|\xc5\xa1|\xc5\xa0)",
'T', "(t|\xe1\xb9\xad|\xe1\xb9\xac)",
'U', "(u|\xc3\x9c|\xc5\xaa|\xc5\xab|\xc3\xb9|\xc3\xba|\xc3\xbb|\xc3\xbc)",
'Z', "(z|\xe1\xba\x92|\xe1\xba\x93|\xc5\xbd|\xc5\xbe)",
"'", "('|\xca\xbf|\xca\xbe)",
'Y', "[y\375\377]",
);

$DOTPATTERN = "([a-zA-Z0-9]|[\xa0-\xc3][\xa0-\xc3])";

while (<>)
{

chop;


$foo = `echo "got it: $_\n" > /Volumes/data/var/lib/philologic/databases/mamluk/crapout`;

# Here we subsititute in both cases of a Unicode character defined above
# so that we can do case foldd Unicode searches.

while (($a, $b) = each(%cases)) {
s/$a|$b/\($a|$b\)/g;
}

$foo = `echo "after: $_\n" >> /Volumes/data/var/lib/philologic/databases/mamluk/crapout`;

$prefix = /^\256\?/ ? "\256\?" : /^\256/ ? "\256" : "";
$prefix = "";
s/^\256\?*//;

s/(\([^\)]*)(\|)([^\)]*\))/$1#PIPE#$3/g;

@patterns = split ('\|', $_);

foreach $pattern (@patterns)
{
# $pattern =~ s/^(.*)$/^$prefix$1\$/;
$pattern =~ s/^(.*)$/^$prefix$1\t/;
}

$_ = join ("|", @patterns);
s/#PIPE#/\|/g;

s/[ACDEGHIKNOSTUZ'Y]/$ACCENTS{$&}/ge;

$foo = `echo '$_' >> /Volumes/data/var/lib/philologic/databases/mamluk/crapout`;

s/(\.)([^\*])/$DOTPATTERN$2/g;
tr/A-Z/a-z/;

system ("/usr/bin/grep -E -i -e \"$_\" < " . $ENV{SYSTEM_DIR} . "words.R.wom" . " | /sw/bin/gawk -F\"\t\" '{print \$2}' ");
}
Comments