Charslnword
Hyphens
Problem: we want to be able to search for Al-Qabah or Al Qabah and find Al-Qabah OR Al Qabah with either one.
Solution: The default $CHARSINWORD in textload.cfg will work just fine -- it treats - as a word-breaking character. Run the load with that. That means you can search for Al Qabah and find Al-Qabah. Cool. But now if you search for Al-Qabah, you won't find anything. Damn.
So add this to clean_word_pattern in philosubs.pl:
$word =~ s/\-/\+/g;
This strips out hyphens, replacing them with spaces (URL-encoded). The downside is you can no longer distinguish between hyphenated terms and non-hypenated terms, but that is the tradeoff.
Some more stuff
More stuff...
Word Breaking (non-indexing) Unicode Punctuation
Problem: There are Unicode characters that should be treated as word breaking, non-indexed characters, such as « or … or › among others. In textload.cfg, add these to the list @UnicodeWordBreakers.
@UnicodeWordBreakers = ('\xe2\x80\x93', # U+2013 – EN DASH
'\xe2\x80\x94', # U+2014 — EM DASH
'\xe2\x80\x98', # U+2018 ‘ LEFT SINGLE QUOTATION
'\xe2\x80\x99', # U+2019 ’ RIGHT SINGLE QUOTATION
'\xe2\x80\x9c', # U+201C “ LEFT DOUBLE QUOTATION
'\xe2\x80\x9d', # U+201D ” RIGHT DOUBLE QUOTATION
'\xe2\x80\xb9', # U+2039 ‹ SINGLE LEFT-POINTING
# ANGLE QUOTATION MARK
'\xe2\x80\xba', # U+203A › SINGLE RIGHT-POINTING
# ANGLE QUOTATION MARK
'\xc2\xab', # «
'\xc2\xbb', # »
'\xe2\x80\xa6' # U+2026 … HORIZONTAL ELLIPSIS
);
Some of these are pesky MS quotes. I'm sure there will be others. The textloader handles a bunch of SGML character ents. I suspect this list will be expanded.
--> A few more to add
'\xce\x87', # U+00B7 ano teleia
'\xe2\x80\xa0', # U+2020 dagger
'\xcd\xbe', # U+037E Greek question mark
Note the great dynamic chart with perl literals:
http://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=string-literal&htmlent=1"
WARNING for non-standard $charsinword setting
We've hacked a custom warning into philoload that will alert you if your $charsinword variable is not set to the standard, so you dont end up doing a bunch of loads with the wrong setting. This may be included in a future release. See CharsinwordWarning.