News‎ > ‎

Document representation for information retrieval systems

posted 10 Feb 2013, 05:57 by Diego Tosato
I am trying to figure out which is the best document representation for an information retrieval (IR) system. There are two popular options at the state-of-the-art: (1) BOW (Bag Of Words), adopted by IR systems like Lucene (http://lucene.apache.org/core/); (2) BOF (Bag Of Features), exploited by LETOR (http://research.microsoft.com/en-us/um/beijing/projects/letor/). BOW is apparently able to capture documents semantic, but it only happens if you consider small vocabularies (eg., 5000 words), so it is not scalable. On the other hand, BOF is independent from the vocabulary size, and it may be easily combined to machine learning techniques to build an ad hoc ranking system, even though it is not able to capture the semantic. The question is: which is the most important feature for an IR system? Capturing the semantic or learning how to rank a document? Which basically means: which is the best representation between BOW and BOF?

http://www.reddit.com/r/MachineLearning/comments/188s60/document_representation_for_information_retrieval/
Comments