Shang-Yi Lin

Auto Complete

Use probability to predict next word/phrase :

read a large-scale document collection, sentence by sentence
build N-Gram library via Hadoop Mapreduce
calculate probability via Hadoop Mapreduce
write in SQL table

Download source code

N-Gram Model

Loading big-scale data and read files sentence by sentence. The combination of the words should have meanings in one sentence.

So base on this fact, build meaningful N-Gram model go give us a good prediction.

Model format is <key, value> : < word combination : count >

take 1 sentence as the example "how to build Ngram model from scratch". following is its 2-gram and 3 gram <key, value>

Reducer will take this as input and sum up count for this specific prefix

Language Model

N-Gram models as input and to build DB format which helps us to make the prediction based on probability. Because N-Gram model has all combinations (based on config N), but we able to pick the Top K possible term to write into DB. It also increase the performance for SQL query (e.g. LIKE %)

Mapper output format + Reducer output format (DB)

Mapper output : we able to filter out some term with low probability
Reducer output to DB, you can see once client type starting _word, we able to use SQL language to make good prediction

Google Sites

Report abuse