TF IDF
measure of word importance
The topic of a document can be characterized by words. Frequent words are not necessarily important, e.g. "the", "and", and other stop words.
Rare words are not necessarily important either, e.g. albeit.
Suppose we have N documents in a corpus
define t as term (word), d as document (set of words)
TF term frequency
The term frequency of word t in document d is the count of t in d, normalized by the document length (total number of words in d)
tf(t, d) = count of t in d / # of words in d
Note there are other ways to normalize tf such as log, binary, max, etc.
The purpose is reducing the impact of the document length, otherwise a bigger document obviously has a bigger word count of t.
DF document frequency
The document frequency of t is the count of occurrences of word t in all documents, i.e. the number of documents that contain word t.
So if t appears in k of the N documents, then the document frequency is k. Obviously we should normalize it as well, i.e. df / N
IDF inverse document frequency
The idea is that if a word appears in all documents then the word is less important in classifying documents.
A rare word would help classification more than a common word does, so we take the inverse of document frequency, 1 / (df/N) = N/df
For a common word, the N/df tends to 1 and for a rare word, the value tends to N. When the corpus is large, N is large, so the value
explodes. To dampen the effect, we take the log of the value, so
idf = log(N/df)
actually
idf = log(N/ (df + 1)) to avoid divided by 0. some words in a vocabulary may not appear in any documents.
A frequent word like a stop word "the" would have idf ~ 0.
TF.IDF
The final TF.IDF score is simply a multiplication of tf and idf
TF.IDF = tf * idf
In this way it considers both term frequency in a single document (the more the better)
and occurences in all documents (the less the better).