Problem: we want to be able to search for Al-Qabah or Al Qabah and find Al-Qabah OR Al Qabah with either one.
Solution: The default $CHARSINWORD in textload.cfg will work just fine -- it treats - as a word-breaking character. Run the load with that. That means you can search for Al Qabah and find Al-Qabah. Cool. But now if you search for Al-Qabah, you won't find anything. Damn.
So add this to clean_word_pattern in philosubs.pl:
$word =~ s/\-/\+/g;
This strips out hyphens, replacing them with spaces (URL-encoded). The downside is you can no longer distinguish between hyphenated terms and non-hypenated terms, but that is the tradeoff.
Problem: There are Unicode characters that should be treated as word breaking, non-indexed characters, such as « or … or › among others. In textload.cfg, add these to the list @UnicodeWordBreakers.
@UnicodeWordBreakers = ('\xe2\x80\x93', # U+2013 – EN DASH
Some of these are pesky MS quotes. I'm sure there will be others. The textloader handles a bunch of SGML character ents. I suspect this list will be expanded.
--> A few more to add
'\xce\x87', # U+00B7 ano teleia
Note the great dynamic chart with perl literals:
We've hacked a custom warning into philoload that will alert you if your $charsinword variable is not set to the standard, so you dont end up doing a bunch of loads with the wrong setting. This may be included in a future release. See CharsinwordWarning.