Solr: Text Analyzers
It does the following
- case normalization
- query expansion using synonyms
Examples of text analyzers
Character Filters (<charFilter>)
Process a stream of text prior to tokenization
- MappingCharFilterFactory (replaces unicode to ascii for example)
- HTMLStripCharFilterFactory (extracts text from html docs)
- PatternReplaceCharFilterFactory (regex pattern)
Takes text in the form of a character stream and splits it into tokens, most of the time skipping insignificant bits like whitespace and joining punctuation.
- UAX29URLEmailTokenizer (This behaves like StandardTokenizer with the additional property of recognizing e-mail addresses and URLs as single tokens)
Find more details https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
Consume one stream of tokens, known as TokenStream, and generate another. Hence, they can be chained one after another indefinitely. A token filter may be used to perform complex analysis by processing multiple tokens in the stream at once but in most cases it processes each token sequentially and decides to consider, replace, or ignore the token.
Stemming is the process of reducing inflected or sometimes derived words to their stem, base, or root form, for example, a stemming algorithm might reduce running and runs, to just run. If you want to improve the precision of search results but retain the recall benefit s, you should consider indexing the data in two fields, one stemmed and the other not stemmed. stemming is language specific.
Stemmers in English
- KStemFilterFactory (less aggressive than PorterStemmer)
Generally applied for either at query time or index time, but not both.
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
i-pod, i pod =>ipod
ipod, i-pod, i pod
free Thesaurus is WordNet (http://wordnet.princeton.edu/)