Use probability to predict next word/phrase :
Loading big-scale data and read files sentence by sentence. The combination of the words should have meanings in one sentence.
So base on this fact, build meaningful N-Gram model go give us a good prediction.
Model format is <key, value> : < word combination : count >
take 1 sentence as the example "how to build Ngram model from scratch". following is its 2-gram and 3 gram <key, value>
Reducer will take this as input and sum up count for this specific prefix
N-Gram models as input and to build DB format which helps us to make the prediction based on probability. Because N-Gram model has all combinations (based on config N), but we able to pick the Top K possible term to write into DB. It also increase the performance for SQL query (e.g. LIKE %)
Mapper output format + Reducer output format (DB)