Wiki‎ > ‎Useful links‎ > ‎

Charslnword


Hyphens

Problem: we want to be able to search for Al-Qabah or Al Qabah and find Al-Qabah OR Al Qabah with either one.

Solution: The default $CHARSINWORD in textload.cfg will work just fine -- it treats - as a word-breaking character. Run the load with that. That means you can search for Al Qabah and find Al-Qabah. Cool. But now if you search for Al-Qabah, you won't find anything. Damn.

So add this to clean_word_pattern in philosubs.pl:

    $word =~ s/\-/\+/g;

This strips out hyphens, replacing them with spaces (URL-encoded). The downside is you can no longer distinguish between hyphenated terms and non-hypenated terms, but that is the tradeoff.


Some more stuff


More stuff...


Word Breaking (non-indexing) Unicode Punctuation

Problem: There are Unicode characters that should be treated as word breaking, non-indexed characters, such as « or … or › among others. In textload.cfg, add these to the list @UnicodeWordBreakers.

@UnicodeWordBreakers = ('\xe2\x80\x93', # U+2013 – EN DASH
'\xe2\x80\x94', # U+2014 — EM DASH
'\xe2\x80\x98', # U+2018 ‘ LEFT SINGLE QUOTATION
'\xe2\x80\x99', # U+2019 ’ RIGHT SINGLE QUOTATION
'\xe2\x80\x9c', # U+201C “ LEFT DOUBLE QUOTATION
'\xe2\x80\x9d', # U+201D ” RIGHT DOUBLE QUOTATION
'\xe2\x80\xb9', # U+2039 ‹ SINGLE LEFT-POINTING
# ANGLE QUOTATION MARK
'\xe2\x80\xba', # U+203A › SINGLE RIGHT-POINTING
# ANGLE QUOTATION MARK
'\xc2\xab', # «
'\xc2\xbb', # »
'\xe2\x80\xa6' # U+2026 … HORIZONTAL ELLIPSIS
);

Some of these are pesky MS quotes. I'm sure there will be others. The textloader handles a bunch of SGML character ents. I suspect this list will be expanded.

--> A few more to add

                        '\xce\x87',     # U+00B7 ano teleia
'\xe2\x80\xa0', # U+2020 dagger
'\xcd\xbe', # U+037E Greek question mark

Note the great dynamic chart with perl literals:

http://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=string-literal&htmlent=1"


WARNING for non-standard $charsinword setting

We've hacked a custom warning into philoload that will alert you if your $charsinword variable is not set to the standard, so you dont end up doing a bunch of loads with the wrong setting. This may be included in a future release. See CharsinwordWarning.

Comments