TF IDF

measure of word importance

The topic of a document can be characterized by words. Frequent words are not necessarily important, e.g. "the", "and", and other stop words.

Rare words are not necessarily important either, e.g. albeit.

Suppose we have N documents in a corpus

define t as term (word), d as document (set of words)

TF term frequency

The term frequency of word t in document d is the count of t in d, normalized by the document length (total number of words in d)

tf(t, d) = count of t in d / # of words in d

Note there are other ways to normalize tf such as log, binary, max, etc.

The purpose is reducing the impact of the document length, otherwise a bigger document obviously has a bigger word count of t.

DF document frequency

The document frequency of t is the count of occurrences of word t in all documents, i.e. the number of documents that contain word t.

So if t appears in k of the N documents, then the document frequency is k. Obviously we should normalize it as well, i.e. df / N

IDF inverse document frequency

The idea is that if a word appears in all documents then the word is less important in classifying documents.

A rare word would help classification more than a common word does, so we take the inverse of document frequency, 1 / (df/N) = N/df

For a common word, the N/df tends to 1 and for a rare word, the value tends to N. When the corpus is large, N is large, so the value

explodes. To dampen the effect, we take the log of the value, so

idf = log(N/df)

actually

idf = log(N/ (df + 1)) to avoid divided by 0. some words in a vocabulary may not appear in any documents.

A frequent word like a stop word "the" would have idf ~ 0.

TF.IDF

The final TF.IDF score is simply a multiplication of tf and idf

TF.IDF = tf * idf

In this way it considers both term frequency in a single document (the more the better)

and occurences in all documents (the less the better).

Page updated

Google Sites

Report abuse