HyphensProblem: we want to be able to search for Al-Qabah or Al Qabah and find Al-Qabah OR Al Qabah with either one. Solution: The default $CHARSINWORD in textload.cfg will work just fine -- it treats - as a word-breaking character. Run the load with that. That means you can search for Al Qabah and find Al-Qabah. Cool. But now if you search for Al-Qabah, you won't find anything. Damn. So add this to clean_word_pattern in philosubs.pl: $word =~ s/\-/\+/g; This strips out hyphens, replacing them with spaces (URL-encoded). The downside is you can no longer distinguish between hyphenated terms and non-hypenated terms, but that is the tradeoff. Some more stuff
Word Breaking (non-indexing) Unicode PunctuationProblem: There are Unicode characters that should be treated as word breaking, non-indexed characters, such as « or … or › among others. In textload.cfg, add these to the list @UnicodeWordBreakers. @UnicodeWordBreakers = ('\xe2\x80\x93', # U+2013 – EN DASH
Some of these are pesky MS quotes. I'm sure there will be others. The textloader handles a bunch of SGML character ents. I suspect this list will be expanded. --> A few more to add '\xce\x87', # U+00B7 ano teleia Note the great dynamic chart with perl literals: http://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=string-literal&htmlent=1" WARNING for non-standard $charsinword settingWe've hacked a custom warning into philoload that will alert you if your $charsinword variable is not set to the standard, so you dont end up doing a bunch of loads with the wrong setting. This may be included in a future release. See CharsinwordWarning. |