Charslnword

Hyphens

Problem: we want to be able to search for Al-Qabah or Al Qabah and find Al-Qabah OR Al Qabah with either one.

Solution: The default $CHARSINWORD in textload.cfg will work just fine -- it treats - as a word-breaking character. Run the load with that. That means you can search for Al Qabah and find Al-Qabah. Cool. But now if you search for Al-Qabah, you won't find anything. Damn.

So add this to clean_word_pattern in philosubs.pl:

$word =~ s/\-/\+/g;

This strips out hyphens, replacing them with spaces (URL-encoded). The downside is you can no longer distinguish between hyphenated terms and non-hypenated terms, but that is the tradeoff.

Some more stuff

More stuff...

Word Breaking (non-indexing) Unicode Punctuation

Problem: There are Unicode characters that should be treated as word breaking, non-indexed characters, such as « or … or › among others. In textload.cfg, add these to the list @UnicodeWordBreakers.

@UnicodeWordBreakers = ('\xe2\x80\x93', # U+2013 – EN DASH

'\xe2\x80\x94', # U+2014 — EM DASH

'\xe2\x80\x98', # U+2018 ‘ LEFT SINGLE QUOTATION

'\xe2\x80\x99', # U+2019 ’ RIGHT SINGLE QUOTATION

'\xe2\x80\x9c', # U+201C “ LEFT DOUBLE QUOTATION

'\xe2\x80\x9d', # U+201D ” RIGHT DOUBLE QUOTATION

'\xe2\x80\xb9', # U+2039 ‹ SINGLE LEFT-POINTING

# ANGLE QUOTATION MARK

'\xe2\x80\xba', # U+203A › SINGLE RIGHT-POINTING

# ANGLE QUOTATION MARK

'\xc2\xab', # «

'\xc2\xbb', # »

'\xe2\x80\xa6' # U+2026 … HORIZONTAL ELLIPSIS

);

Some of these are pesky MS quotes. I'm sure there will be others. The textloader handles a bunch of SGML character ents. I suspect this list will be expanded.

--> A few more to add

'\xce\x87', # U+00B7 ano teleia

'\xe2\x80\xa0', # U+2020 dagger

'\xcd\xbe', # U+037E Greek question mark

Note the great dynamic chart with perl literals:

http://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=string-literal&htmlent=1"

WARNING for non-standard $charsinword setting

We've hacked a custom warning into philoload that will alert you if your $charsinword variable is not set to the standard, so you dont end up doing a bunch of loads with the wrong setting. This may be included in a future release. See CharsinwordWarning.