stacking model

In a nutshell, train multiple models and use the models' outputs to train a final model.

The assumption is each individual model is good at sth but not everything. The final model combines the merits of all models.

Input

1 train dataset 80%

1 test dataset 20%

Firstly train 3 separate models M1, M2 and M3

1. xgboost.cv to train a model to find out the best parameters

2. lr with cv. another set of parameters

3. nn with a few layers. network design is determined

here has 3 models. now need to use the 3 models to predict on another set of train data, and use the predictions to train the last stacking model.

However, there is no another train dataset, or a waste to reserve another set of training data.

So simply reuse the same train data as follows.

Divide train dataset into 5 folds

for (i fold) in the (5 folds)

   test_temp = i fold

   train_temp = combine the other 4 folds

   train Model 1' xgboost with train_temp

   train Model 2' lr with train_temp

   train Model 3' NN with train_temp

   Predict test_temp using Model 1', 2' and 3' separately. Results are stored as separately as M1', M2' and M3'

   /*Don't train model 1,2,3 on the whole train dataset, otherwise the prediction on train dataset would be overfitted

     The 5-fold cv here is not for parameter turning. It's simply for providing prediction values for the train dataset by Model 1', 2' and 3'

without predicting on exactly the data used for training.

The M1', M2', M3' will be used to train the last stacked model.

The Model 1', 2' and 3' are different to the Model M1, M2 and M3 which are trained on the full train dataset.

   */

   

Now also has the predictions M1', M2', M3' for every record in train dataset

Use the M1', M2', M3' and optionally any existing features in train dataset to train another LR model.

The LR model is the last stacked model.

Output and Test

use the Model M1, M2 and M3 to predict on test dataset

The results from M1, M2, M3 are fed into the stacked model LR to get the final result.