Understanding Indexing and Search

Post date: Aug 22, 2014 1:13:43 AM

Playing around with databases recently introduces me to the concept of indexing. An index is a data structure that is designed for efficient search tasks. A search task must be well-defined with searchable fields, i.e. objects that have name-value format. An example search task is to "find all fruits that are orange". In this case, the only searchable field has "fruit" as its name, and the value that "fruit" can take on are "orange", "apple", "banana", "pineapple", etc. Each searchable field can be indexed, i.e. structured for efficient value look-up, by a tree data structure. A common choice of data structure for index implementation is the B-tree [1], or more recently, Fractal-tree [4]. Note that each leaf in the tree, beside storing the value of the field, can also contains pointers to other data objects. This is one way that an index can store together searchable and non-searchable fields in a single document. In addition, if the value of the field has a lot of redundancy, it may need to be analyzed and tokenized first [2].

A database, beside storing data, must also be able to service queries for data. A query creates a search task that leverages the index data structure to find the location of the queried data in a large database. It then use the location information to retrieve the queried data quickly [6]. This approach is much faster than scanning through the entire database, at the expense of additional storage space of the indexing information.

The type of indexing described earlier is static, i.e. indexing is always done for new data regardless of query statistics. This can be inefficient in certain cases. A natural extension/improvement is to take into account query statistics to dynamically decide whether to index on a data object. This is known as dynamic indexing, and a naive implementation of dynamic indexing can be found in [3]. 

The most famous search and indexing engine is probably Lucene [5]. 

[1] http://docs.mongodb.org/manual/indexes/

[2] http://oak.cs.ucla.edu/cs144/projects/lucene/

[3] http://ravendb.net/docs/2.0/http-api/indexes/dynamic-indexes

[4] http://www.tokutek.com/2013/06/tokumx-fractal-trees-with-mongodb/

[5] http://lucene.apache.org/core/4_9_0/demo/overview-summary.html#overview_description

[6] http://www.lucenetutorial.com/lucene-in-5-minutes.html