bag of words

Bag of Words Model

In natural language processing or information retrieval, a text / document can be represented by the bag(list) of its words. It is a simplifying representation that just lists the words ignoring grammar even word order.

In practice it is used to generate features that characterize a text/ document. We can calulate different measures based on the list of words. The most common one is frequency, i.e. the appearance of each word in the list. Email spam detection is a successful application using the bag of words model + word frequency.

When it comes to document classification, more features can be calculated such as document frequency, and then tf-idf.

So the bag of words model itself is just a very simple idea. Get the list of words, and generate features based on the list for futher machinel learning purposes.

Obviously the bag of words model doesn't use any context information from adjacent words. e.g. xiaoming likes food and xiaohong likes book. it doesn't know "likes" always follows someone's name.

So people use the n-gram model, which is an extemely simple idea as well.

Just put consecutive words together as single unit. e.g. "xiaoming likes food" generates a 2-gram/bigram list {"xiaoming likes", "likes food"}. Similarly there are 3-gram, 4-gram, etc, hoping that the n-consecutive words capture useful information. Bag of words model is just a special case of n-gram where n=1.

n-gram language model is about predicting the probability of the n-th word given n-1 words in the sequence, which is a Markov chain problem.