Wiki‎ > ‎Useful links‎ > ‎Charslnword‎ > ‎



Find it here: /your/install/path/philologic/textload.cfg By default, this is probably /var/lib/philologic/textload.cfg


textload.cfg sets various parameters for textloading done by philoload. You can control what characters break words, etc.


Define Word Pattern

By defining the regular expression used to match words, you control what characters will cause words to be broken during loading and indexed as separate words. This probably depends on the conventions of the language you are interested in. In French, an apostrophe probably breaks words, but in English not.

Define the variable $CHARSINWORD to be a RegExp set that includes all characters you want indexed as part of the word. Anything not in that set will cause the word to break, just like a space will.

Here are some example word patterns from one of ARTFL's systems:

# ------------------------ Define Word Pattern ----------------------
# What word pattern do you want to use? This is important.
# We will want to add optional characters like {[]} for MSS
# notation and then set a function to delete these for the index
# in order to search across them. [Note, leave "_" in the
# second pattern to handle tags in words, etc., see below]
# Normal Characters.....
# $CHARSINWORD = "[\&A-Za-z0-9\177-\377][\&A-Za-z0-9\177-\377\_\';]*";

# Use this if you want hyphens to be included as word characters
# (Note the hypen, double escaped)...
# This is used for some DSAL dicos (but NOT Mamluk as we thought earlier...
# Mamluk can use the normal one because we change hyphens to spaces in
# clean_word_pattern in philosubs.)
#$CHARSINWORD = "[\&A-Za-z0-9\177-\377][\&A-Za-z0-9\177-\377\\-\_\';]*";

# EEBO Specific:
#$CHARSINWORD = "[\&A-Za-z0-9\177-\377][\&A-Za-z0-9\^\|\~\+\177-\377\_\';]*";

# OVI Specific:
$CHARSINWORD = "[\&A-Za-z0-9\177-\377][\&A-Za-z0-9\^\|\~\+\177-\377\#\_\';\\[\\]\\(\\)]*";

Also see:

  • EquivalencySearching - broad discussion of differences between the underlying text and search strings, how to make different things find what you want them to.
  • CharsInWord - tells you how to get hyphens to act as you want them to. The name of this page should probably be changed.