How to Win a Data Science Competition
August 2019
Module 1: Introduction
Created a Kaggle account: https://www.kaggle.com/raybellwaves
Course overview
- Intro to competitions
- Feature preprocessing and extraction
- EDA
- Validation
- Data leaks
- Metrics
- Mean-encodings
- Advanced features
- Hyperparameter optimization
- Ensembles
- Final solutions
- Winning solutions
Competition mechanics
Data, Model, Submission, Evaluation, Leaderboard
Here is an example of a Model http://blog.kaggle.com/2016/04/08/homesite-quote-conversion-winners-write-up-1st-place-kazanova-faron-clobber/
Submit only predictions.
Specifies which evaluations metric to use
Analyze data -> fit model -> submit -> see public score -> repeat.
Other platforms: Kaggle, DrivenData, ...
Real World Applications vs Competitions
Understand of business problem; problem formalization; data collecting; data preprocessing; modelling; way to evaluate model in real life; way to deploy model
Recap of main ML algorithms
- Linear
- Tree-based (e.g. random Forrest and Gradient Boosted Decision Trees). Libraries include dmlc XGBoost and Microsoft LightGBM
- kNN
- Neural Networks
Linear models are good for spare high dimensional data.
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html
Quiz
In RandomForest model we average 100 similar performing trees, trained independently. So the order of trees does not matter in RandomForest and performance drop will be very similar on average.
In GBDT model we have sequence of trees, each improve predictions of all previous. So, if we drop first tree — sum of all the rest trees will be biased and overall performance should drop. If we drop the last tree -- sum of all previous tree won't be affected, so performance will change insignificantly (in case we have enough trees).
Each tree in forest is independent from the others, so two RandomForest with 500 trees is essentially the same as single RandomForest model with 1000 trees.
Decision Tree - Decision surface consists of lines parallel to the axis and it is sharp.
Random Forest - Decision surface consists of lines parallel to the axis and its boundaries are smooth
GBM Notebook
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science.html
https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
Software/Hardware requirements
https://aws.amazon.com/ec2/spot/; https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html - Cheapest cloud compute option
https://github.com/VowpalWabbit/vowpal_wabbit - Fast out of core machine learning
http://www.libfm.org/; https://www.csie.ntu.edu.tw/~cjlin/libffm/ - used for spare data such as click through rate prediction
https://github.com/RGF-team/rgf/tree/master/FastRGF; https://arxiv.org/pdf/1109.0887.pdf - ensemble trees
Scikit-learn v0.17 includes TSNE algorithms
Pandas Basics assignment
Data from https://www.kaggle.com/c/competitive-data-science-final-project/data
Feature preprocessing and generation with respect to models
Feature prepossessing depends on model.
Random Forrest doesn't need OHE
if want to predict apples every weak and there is a linear trend something like gradient boosted decision tree will struggle?
Numeric features
Tree based models don't care about the scale of the variable.
knn does care about scale
regularization is proportional to feature scale. It is also important for gradients
Another hyperparamters e.g. MinMaxScaler, StandardScaler
Can clip features values to bounds e.g. 1st and 99th percentile to get rid of outliers.
You can rank numeric features e.g. scipy.stats.rankdata
log transform np.log(1 + x). Raise to the power <1: np.sqrt(x + 2/3). These drive outliers closer to the mean.
Create additional features:
- e.g. price per squared area;
- distance metric with x and y.
- keeping the decimal place of a price to see how these affects purchase
Categorical and ordinal features
ordinal - ordered categorical feature
Label encoding. LabelEncoder
(alphabetical, order of appearance (pd.factorize
), freq encoding) - tree based models
OHE - non-tree based models
If target is based on two categorical features you can concat the string then OHE.
Datetime and coordinates
Time and time delta.
last purchase date - e.g. churn prediction.
distant to nearest school etc.
e.g. grid the map and find the most expensive house in a grid and distance from that.
Number of flats around a certain point with a certain radius.
You can rotate the grid by 45o which can help with decision trees
Handling missing data
See null values using histogram
Use IsNull
XGBoost can handle NaN
If a value is not in train but is in test you can use frequency of value
Additional material
https://scikit-learn.org/stable/modules/preprocessing.html
http://sebastianraschka.com/Articles/2014_about_feature_scaling.html
https://www.quora.com/What-are-some-best-practices-in-Feature-Engineering
Bag of words
CountVectorizer
term frequency = 1 / x.sum(axis=1] [:, None]
inverse document frequency = np.log(x.shape[0] / (x >0).sum(0)) TfidfVectorizer
N-gram Ngram_range.analyzer
lemmatization (root form) and stemming (chops off end of word)
stopwords
Word2vec, CNN
Convect word to vector with similar dimensions. Words with same context will be close. Can do additional and subtraction
Vec2vec, Glove, FastText, Doc2vec (pre-trained)
Bog of words:
- very large vectors
- each value is known
word2vec:
- small vectors
- words with similar meaning have similar embeddings
Images -> vector: CNN
fine-tunning e.g. fastai. e,g, replace last layer of VGG (1x1000) with (1x4) i.e. this comp
data augmentation to increase number of images to train on e.g. rotate by 180o.
Additional material
https://andhint.github.io/machine-learning/nlp/Feature-Extraction-From-Text/
https://www.tensorflow.org/tutorials/representation/word2vec
https://rare-technologies.com/word2vec-tutorial/
http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/
https://keras.io/applications/
https://www.kernix.com/blog/image-classification-with-a-pre-trained-deep-neural-network_p11
https://www.tensorflow.org/hub/tutorials/image_retraining
https://flyyufelix.github.io/2016/10/08/fine-tuning-in-keras-part2.html
Module 2: Exploratory Data Analysis
EDA
http://blog.kaggle.com/2015/12/03/dato-winners-interview-1st-place-mad-professors/
Visualization -> Idea
Idea -> Visualization
e.g. guest with number of promos received and when they used a promo
Building intuition about the data
read around the subject e.g. for predict advertiser's cost read around Google advert data e.g. number of times viewed, number of times clicked. e.g. clicks < impressions
Check intuition e.g. age > 125
Predict the advertisement cost for a particular ad notebook
Competition hosted by solutions.se. Dataset was exported from Google AdWords
Every time a user queries a search engine, Google AdWords decides what ad will be shown along with the actual search results.
How much they will pay to Google (column Cost) when the parameters (e.g. keywords) are changed
For each AdGroupId there is a distinct set of possible KeywordId's, but Device and Slot variants are the same for each ad. And the target is to predict what will be the daily cost for using different KeywordId's, Device type, Slot type to advertise ads from AdGroups.
ID is an aggregation index -- so for each date the Cost is aggreagated for each possible index
Extend the train-set and inject rows with 0
impressions. Such change will make train set very similar to the test set and the models will generalize nicely.
Exploring anonymized data
e.g. word with hash values of the words
Guess meaning of the columns, guess the type of columns
Find relations between pairs, find feature groups
An example to explore a dataset:
- RandomForestClassifier
- with NaN's with -999
- Label encoder categorical types
- Plot feature importance
- Put standardized data back to original if possible
df.dtypes, df.info(), x.value_counts(), x.isnull()
Visualizations
plt.hist(x)
XGBoost has algorithm to fill in NaNs
See relationship between two features e.g. difference, ratio between the two
Correlation heat map
Dataset cleaning and other things to check
If the feature is constant throughout it is worth removing it.
If there are duplicate feature you can do df.T.dop_duplicates()
Sometimes features are decoded which can be duplicates
Understand why duplicated rows
Check dataset is shuffled (otherwise there could be data leakage)
Additional material
https://scikit-learn.org/stable/auto_examples/bicluster/plot_spectral_biclustering.html
Springleaf Competition EDA I
How many nans in each row?
train.isnull().sum(axis=1).head(15)
A lot of rows with same number of nans in a row. Do same with columns.
Drop columns which only have the same value
constant_features = feats_counts.loc[feats_counts==1].index.tolist()
traintest.drop(constant_features,axis = 1,inplace=True)
Remove duplicated columns
train_enc = pd.DataFrame(index = train.index)
for col in tqdm_notebook(traintest.columns):
train_enc[col] = train[col].factorize()[0]
dup_cols = {}
for i, c1 in enumerate(tqdm_notebook(train_enc.columns)):
for c2 in train_enc.columns[i + 1:]:
if c2 not in dup_cols and np.all(train_enc[c1] == train_enc[c2]):
dup_cols[c2] = c1
traintest.drop(dup_cols.keys(), axis = 1,inplace=True)
Springleaf Competition EDA II
Histogram of number of unique values in each column.
Columns with large number number of unique values and could be integers.
Can create features off of these such as 1 if equal to a value and 0 if not.
Some values look like NaN e.g. 99999999
Look at value_counts.
Split cols into numeric and other into categorical.
Fraction of elements that are greater in one column than another column. E.g. next feature is greater than the feature before possibly cumulative?
Histrogram of a column. Has come kind of periodicity. Time? Months. Can create another feature as modulus 12.
There is one columns for cities. Can generate geo-location features from it.
Look at date features. e.g. difference between two dates.
Numerai Competition EDA
Could get good score by ordering data. LR on 21 original features + 21 features from knn.
Some connection between weekly dataset.
Every week data came in with a little bit of noise.
Validation and overfitting
Don't want to overfit so it doesn't adjust to new data. You also overfit on public test dataset instead of private test dataset.
Validation strategies
- Holdout - (
ShuffleSplit
). - K-fold - Repeated holdout
Leave-one-out
. K-fold where k = number of samples.
One object left. Good if have little data and model which is quick to train.
Stratify test and train set.
Data splitting strategies
1. Previous and next target values
2. Time-based trend
Split by:
- Random - may be able to find features such as family should have successful credit.
- Timewise - Avg customers past month.
- By id.
- Combined
You validation should look like split by course organizers.
Problem occurring during validation
Problems during local validation e.g. different parameters for different folds
Submission stage - score don't match. Not good train/test split.
Do extensive validation - avg scores from diff KFold splits
Try to work out train-test split.
Leaderboard probing. e.g. add 7 to woman height for men height.
Force validation to match distribution of test.
LB shuffle (different score on public and private
Additional material
http://www.chioka.in/how-to-select-your-final-models-in-a-kaggle-competitio/
Basic data leaks
If time series comp is not split on time e.g. price next week.
e.g. cat vs. dogs which used different cameras.
sometimes adding row order improves score.
Leaderboard probing and examples of rare data leaks
Distribution was same for public and private test dataset.
e.g. expedia dis
tance from person to city. Reverse engineer to get true coords.
Expedia challenge
destination_distiance - user_city pair is a leak to a true hotel location.
How many hotels of which group in user city
Use three locations and find fourth location.
Grid grid cells and do sum, avg.
Xgboost with 16 hours of training.
Data leak assignment
Module 3: Metrics optimization
Regression metrics review I
MSE = 1/N sum (y - yhat)^2
Best constant? Replace yhat above with alpha. Mean of target value.
RMSE = sqrt (MSE). Scale of error is scale of target.
d RMSE / d yhat = 1 / 2 sqrt(MSE) * d MSE/ d yhat
How much model is better than baseline:
R^2 = 1 when MSE = 0. When MSE = constant model then R^2 = 0.
MAE = 1/N sum |y - yhat|. Less sensitive for outliers.
Best constant? Median of target value.
Regression metrics review II
Mean Squared Percentage Error, MSPE = 100% / N * sum (y - yhat/ y)^2. Best constant weighted target mean
Mean Absolute Percentage Error, MAPE = 100% / N * sum (y - yhat / y) . Best constant weight target median.
Root Mean Square Logarithmic Error, RMSLE = sqrt(1 / N * (log(y + 1) - log(yhat +1))2 = RMSE(log(y + 1), log(y + 1)) = sqrt(MSE(log(y + 1), log(y + 1)). Best constant something like... exp(mean target value).
Classifications metrics review
Soft classification - e.g. probability belonging to a class
Hard classification - e.g. argmax f_i(x)
Accuracy 0-1. Fraction of correctly classified objects. Soft prediction and apply threshold e.g. > 0.5. Best constant - predict the most frequent class.
Logarithmic loss (logloss) binary = -1/N * sum(y * log(yhat) + (1 - y) log(1 - yhat)
multiclass - =1/N sum sum y_l * log(yhat_l). In practice clipped to a small number. Penalizes very wrong predictions. Best constant set to frequency of i-th class.
AUC ROC - looks at threshold for accuracy.
AUC = # correctly ordered pairs / total number of pairs. Random prediction AUC = 0.5
Cohen's Kappa - my score = 1 - (1 - accuracy) / (1 - baseline). Normalize target.
= 1 - (1 - accuracy) / (1 - p_e). p_e is randomly permute our predictions = 1/ N^2 sum (n_k1 * n_k2)
error = 1 - accuracy
weighted error. If 3 classes create error weight matrix 3 x 3 and punish classifications far away.
Confusion matrix
weighted error = 1 / constant sum(confusion matrix * weight matrix).
Use linear or quadratic weights. weighted kappa = 1 - (weighted error / weighted baseline error).
General approaches for metrics optimization
Target metric - what we want to optimize
Optimization loss - what model optimizes
Sometimes model does not optimize to the target metric so may need to adjust output.
Some models optimize - MSE, Logloss
Preprocess train and optimize another metric e.g. for RMSLE with XGBoost cannot optimize.
Postprocess prediciton e.g. kappa
write custom loss function
Sometimes you can use early stopping when the model starts to overfit.
Regression metrics optimization
MSE ~ L2 loss. Default
MAE ~ L1, median regression. XGBoost cannot optimize as second derivative is 0. LightGBM can use it. Called quantile loss in VW (Vowpal Wabbit). Huber loss.
MSPE, MAPE (weighted version of MSE; MAE). Many libraries accept sample weights. Resample train set using df.sample(weights=sample_weights)
then use MSE. Test set stays as it. Need to resample many times and average.
RMSLE. Train: transform target zi = log(yi + 1) and fit a model with MSE loss. Test: transform predictions back yhati = exp(zhati) - 1
Classification metrics optimization I
Logloss - doesn't really work with sklean.RandomForestClassifier. You can claibrate probability e.g. Platt scaling, Isotonic regression, Stacking
Accuracy - fit any metric and tune threshold. e.g. 0.5 threshold -> 0.7 threshold. Can do for loop/grid search.
Classification metrics optimization II
AUC - XGBoost, LightGBM
Quadratic weighted kappa metric - optimize MSE and find right thresholds or custom smooth loss for GBDT or neural nets.
Additional material
http://queirozf.com/entries/evaluation-metrics-for-classification-quick-examples-references
https://www.garysieling.com/blog/sklearn-gini-vs-entropy-criteria
https://icml.cc/2015/wp-content/uploads/2015/06/icml_ranking.pdf ; https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/MSR-TR-2010-82.pdf ; https://sourceforge.net/p/lemur/wiki/RankLib/ ; https://wellecks.wordpress.com/2015/01/15/learning-to-rank-overview/
http://nlp.uned.es/docs/amigo2007a.pdf
Concept of mean encoding
Encode categorical features.
Can do mean of target (feature mean) (likelihood). Better than label encoding which is random.
Weight of evidence = ln (count of 0's/count of 1's) * 100
Count = sum(target)
Diff = count of 1's - count of 0's
Regularization
CV loop - Can help with mean encoding by doing 5 times on different
Smoothing - (mean(target) * nrows) + (globalmean * alpha) / (nrows + alpha)
Noise - degrades the quality of encoding. Used with Leave-One-Out (LOO)
Expanding mean - cumulative sum / cumulative count. Built in CatBoost
Extensions and generalizations
Many-to-many relationship e.g. one person and multiple purchases. Split the table into long representation
Can use rolling statistics with time data
Can concat categorical features and mean encode them to make them numeric
For local experiments:
- Estimate encodings on train then map them to train and validation. Regularize them on train then validate the model on train and validation split
For submission
- Estimate encodings on whole train data, map to train and test, regularize then on train and fit on train
Module 4: Hyperparameter optimization
Hyperparameter tuning I
Understand which parameters are the most important.
Understand how the parameters will change the results.
Lots of hyperparamter optimization software: Hyperopt, Scikit-optimize, Spearmint, GPyOpt, RoBO, SMAC3
def xgb_score(param):
# Run xgboost with parameters: param
def xgb_hyperopt():
space = {
'eta':0.01,
'max_depth':hp.quniform('max_depth', 10, 30, 1),
'min_child_weight':hp.quniform('min_child_weight', 0, 100, 1),
'subsample':hp.quniform('subsample', 0.1, 1.1, 0.1),
'gamma':hp.quniform('gamma', 0.0, 30, 0.5),
'colsample_bytree':hp.quniform('colsample_bytree', 0.1, 1.0, 0.1),
'objective':'reg:linear',
'nthread':28,
'silent':1,
'num_round':2500,
'seed':2441
'early_stopping_rounds':100
}
best = fmin(xgb_score, space, algo=tpe.suggest, max_evals=1000)
Split parameters into red (reduce overfitting), green (better fit on train set).
Hyperparameter tuning II
Tree-based models: XGBoost, LightGBM, CatBoost, RandomForest, ExtraTrees, FastRGF (regularized greedy forests).
GBDT - Build decision trees one after another to optimize a given metric. Parameters
XGBoost:
max_depth
- max depth of tree. Better fit to train test. 1-30. Sometimes better to stop tuning and generate some features. Start around 7.subsample
- Fraction of objects to fit to the tree. 0-1. If lower less prone to overfitting.colsample_bytree
,colsample_bylevel
- Consider a fraction of features. If it is overfitting you can lower this parameter.min_child_weight
,lambda
,alpha
- regularization parameters. Increasemin_child_weight
to be conservative in model. 0, 5, 50, 300. An important parameter to tune.eta
- Eta is learning rate. Too high it will not converge. If too small it will take a while. 0.1. 0.01. Freeze while tune num_round. Uses early stopping to monitor validation when loss increases. When fit can do eta / 2 and num_rounds * alpha???num_round
- How many trees to buildseed
- Random seed.
LightGBM:
max_depth
/num_leaves
(to split data better)bagging_fraction
feature_fraction
min_date_in_leaf
,lambda_l1
,lambda_l2
learning_rate
,num_iteractions
_seed
RandomForest/ExtraTrees - Each tree is independent of each other. Paramters:
n_estimators
- Accuracy plateaus eventually when increasing number of trees.max_depth
- Depth of trees. None is unlimited depth. Start around 7.max_features
- If higher then faster training.min_sample_leaf
- Likemin_child_weight
criterion
- to evaluate a split e.g. gini, entropyrandom_state
- random seedn_jobs
- Number of cores. 0 to use all.
Hyperparameter tuning III
Neural Nets:
- Number of neurons per layer - Learn more complex decision boundary and overfit faster
- Number of layers
- Optimizers:
- SGD + momenturn
- Adam, Adadelta, Adagram (adaptive). This can be faster but lead to overfitting
- Batch size - large value leads to more overfitting. 32 or 64.
- Learning rate - start 0.1 and lower down to when it converges. Connection between batch size and learning rate. Inc. batch size by a factor or alpha you can also inc. LR by the same factor.
- Regularization:
- L2/L1 for weights
- Dropout/Dropconnect
- Static dropconnect - make first layer large number of neural but drop 99% of connections from input layer to first hidden layer.
Linear model:
- SVC/SVR. SVMs don't require much tuning. Sklean wraps libLinear and libSVM. Compile these yourself for multi-core support.
- LogisticRegression/LinearRegression + regularizers
- SGDClassifier/SGDRegressor
- Vowpal Wabbit - for out of core. FTRL (Follow the Regularized Leader).
Regularization parameters (C, alpha, lambda). Start very small and increase it. As C increases it does down the model though.
Try L1, L2, L1 + L2 each. L1 provides some sparsity and can be used for feature selection.
Average models. e.g. if there is a good model fit for max_depth = 5 then do 3 GBDT with 4,5,6 and average them.
Quiz feedback
for RandForest train n_estimators, max_depth, min_samples_split
https://scikit-learn.org/stable/modules/grid_search.html
http://fastml.com/optimizing-hyperparams-with-hyperopt/
https://www.ntu.edu.sg/home/egbhuang/
Practical guide
The Nature Conservancy, Planet - understanding comps
Parameters - Importance, feasibility, understanding
Save data as hdf5/npy for faster reading.
Can cast to 32-bits to save RAM
Keep it reproducible e.g. random seems
Log everything
KazAnova's competition pipeline, part 1
Understand the problem, EDA, define cv stragegy, feature engineering, modelling, ensembling.
Type of problem; how big is the data? hardware? software? metric being tested on? (is there a similar comp to this?)
Plot histogram of variance. Similar between train and test?
Plot feature versus the target variance and vs time.
Univariate predictability metrics (information value https://medium.com/@sundarstyles89/weight-of-evidence-and-information-value-using-python-6f05072e83eb)
Bin numerical features (to see non-linearlity) and correlation matrices.
If time is important use time-based validation.
Be aware of features missing in test.
Additional material
https://github.com/Far0n/kaggletils/tree/master/kaggletils
https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/
Statistics and distance based features
Walmart recruitment, Acquire Valued Shoppers, Walmart weather shopping
KazAnova's competition pipeline, part 2
Ensembling - save predictions on internal validation and test are saved.
Different ways to combine different features from averaging to stacking.
Look at correlations between predictions (good if low).
Statistics and distance based features
CTR (click through rate). Combine user and page type and get min and max price of adds.
Group on neighbors is more difficult. e.g. rental price. Number of houses in 500m, 1000m,
Springleaf compeition - Mean encode all variables. For every point find 2000 nearest neighbors using Bray-Curtis metric: sum |u_i - v_i| / sum |u_i + v_i|. Calculate features from that 2000 neighbors. Mean target of nearest 5, mean distance to 10 closest neighbors with target 1.
Matrix factorizations
Movie recommendation. user (rows) and ratings (column). Use to encode something about user.
e.g. bag-of-words ...
feature fusion e.g. Vanilla BOW, BOW+TF-IDF + BOW (bigrams) -> dimensionally reduction -> Tree-based method.
Can use on only some columns and can provide additional diverstity (good for ensembles).
SVD, PCA, TruncatedSVD,
Non-negative matrix factorization (NMF) - Ensures that all latent factors are non-negative (>= 0), good for counts-like date. Makes data for more suitable for decision trees. You can also do NMF(log(X + 1)).
Transform all data then select train and test.
Feature interactions
e.g. banner selection on website. two categorical features - ad type and website type. Concat this two to create another feature you can then do OHE on this.
If the values are numeric you can multiple them (or sum, diff, division). This enlarges features space and makes fitting easier. Then can then do feature selection of dimensionally reduction.
Data -> sums, diffs, dots, divisions -> fit randomforest, get features of importances, select a few of the most important features.
You can look at third order interactions etc.
You can extract features for decisions tree e.g. is age > 9.5.
sklearn: tree_model.apply()
- returns index
xgboost: booster.predict(pred_leaf=True)
- returns index
t-SNE
Non-linear dimenstion reduction - manifold learning.
https://scikit-learn.org/stable/modules/manifold.html
MNIST 700 dimension -> 2 dimension. 3 is next to 5 and is close to 6 and 8.
Good use for EDA.
Be careful of hyper-parameters (perplexity). Higher more clustered.
https://distill.pub/2016/misread-tsne/
Test several perplexity. Train and test should be projected together. If n features > 500 may want to reduce dimensions before projecting.
https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
https://lvdmaaten.github.io/tsne/
Additional material
https://scikit-learn.org/stable/modules/decomposition.html
https://github.com/DmitryUlyanov/Multicore-TSNE
https://research.fb.com/publications/practical-lessons-from-predicting-clicks-on-ads-at-facebook/
https://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html
KNN Assignment
knn for features was important for https://www.kaggle.com/c/otto-group-product-classification-challenge and https://www.kaggle.com/c/springleaf-marketing-response
Introduction to ensemble methods
Combing ML models to get a better prediction
Averaging (or blending), weighted averaging, conditional averaging, bagging, boosting, stacking, stacknet
e.g prediction age. model is good <= 50 year olds, model is good for > 50
average is (model 1 + model 2) / 2
weighted average (model 1 x 0.7 + model 2 x 0.3)
Condition method e.g. < 50 use 1 and >= 50 use another one.
Bagging
Averaging slightly different versions of the same model e.g. random forest
Two main sources of error bias (underfitting; high bias/low variance), variance (overfitting; low bias/high variance)
Parameters to control bagging:
- Change the random seed
- Row (sub) sampling or bootstrapping
- Shuffling
- Column (sub) sampling
- Model-specific parameters
- Number of models (or bags)
- (Optionally) parallesim
BaggingClassifier and BaggingRegressor from sklean
Boosting
Form of weighted averaging of models where each model is built sequentially via taking into account the past model performance.
Weight boosting - is trying to classify 1's or 0'2 you can take pred (probabilty) and subtract y from pred to get absolute error. Generator a new column which is weight (1 + absolute error). Add this weight feature to the model. e.g. is 2 pretend this row occurs twice. You can repeat with other weights.
parameters:
- Learning rate: PredN = pred0*eta + pred1*eta + ...
- Number of estimators - (e.g. models. more we add smaller LR need to put).
- Input model (anything that accepts weights)
- Sub boosting type: AdaBoost; LogitBoost
Residual boosting - Use error (to get direction). Then make this the new target variable. Final prediction would be new prediction + old prediction.
parameters:
- Learning rate: PredN = pred0*eta + pred1*eta + ...
- Number of estimators
- Row (sub) sampling; Column (sub) sampling
- Input model - better with trees
- Sub boosting type: Fully gradient based; Dart (uses dropout)
Xgboost, Lightgbm, H20's GBM, Catboost, Sklearn's GBM
Stacking
Making predictions of a number of models in a hold-out set then using different (meta) model to train on these predictions
Doesn't need to know input data
Wolpert (1992) it involves:
- Splitting the train set into two part
- Train several base learners on the first part
- Make predictions with the base learners on the second part
- Use last predictions as input to train a higher level learner
Fit on train (A) and save predictions on valid (B) and test (C). Do multiple times. Train algorithm on B1 and make predictions for C1
With time sensitive data - respect time
Diversity as important as performance: different algorithms, different input features.
Performance plateaus after N models
Meta model is normally modest e.g. linear regression
StackNet
Scalable meta model method that utilizes stacking to combine multiple models in a NN architecture of multiple levels
4 layer stacking won kaggle comp. e.g, homesite comp.
In a NN every node is a simple linear model with some non-linear transformation. Instead of linear model we could use any model.
Cannot use back-propagation
Use stacking to line each model/node with target
If data is limited is hard to keep splitting train into train and valid. Can do k-fold then average if going to extend to many layers.
Ensembling tips and tricks
1st level tips. Diversity based on algorithms:
- 2-3 gradient boosted trees (lightgbm, xgboost)
- 2-3 NN (keras, pytorch)
- 1-2 ExtraTrees/Random Forest (sklearn)
- 1-2 linear logistic/ridge, svm (sklearn)
- 1-2 knn (sklearn)
- 1 factorization machine (libfm)
- 1 svm with non-linear kernal (sklean)
Diversity baesd in input data:
- categorical features: OHE, label encoding, target encoding
- Numerical features: outlier, binning, derivatives (smooth), percentiles, scaling)
- Interactions col1 /+*- col2; groupby; unsupervised
Subsequent level tips. Simple (or shallow) algorithms
- gradient boosted trees with small depth (2/3)
- linear models with high regularization
- extra trees
- shallow NN (1 layer)
- knn with BrayCurtis Distance
- Brute force a search for best linear weights based on cv
Feature engineering:
- pairwise differences between meta features
- row-wise statistics like averages or stds
For every 7.5 models in previous layer we add 1 meta model in subsequent layer
Be mindful of target leakage
Stacked ensembles from H2O
https://xcessiv.readthedocs.io/en/stable/
Can run classifiers in regression e.g. predict is age > 50
https://www.kaggle.com/datafiniti/consumer-reviews-of-amazon-products
Additional material
https://mlwave.com/kaggle-ensembling-guide/
https://github.com/kaz-Anova/StackNet
https://heamy.readthedocs.io/en/latest/usage.html
CatBoost 1
Categorical features:
- OHE (
one_hot_max_size
) - Number of appearances
- Statistics with label usage of a random permutation of the data
- Combines features in a greedy way (only the best combination).
Symmetric decision trees e.g. weight > 65 and weight > 65 as two leafs
CatBoost 2
Leaf value is calculated as average gradient on all objects in this leaf.
Ordered boosting
Speed up: rsm (random subspace method) = 0.1; max_ctr_complexity=1
; boosting_type='Plain'
; task_type='GPU'
Overfitting detector
Evaluating custom metrics during training
CatBoost Viewer
Can be calculated using TensorBoard
Nan features support
Training parameters:
- Number of trees + learning rate
- Tree depth
- L2 regularization
- Bagging temperature
- Random strength
Module 5: Competitions go through
Crowdflower Competition
https://www.kaggle.com/c/crowdflower-search-relevance
Relevance of search result. 1-4 of score with 4 being best
Metric: Quadratic weighted kappa (0 random to 1 perfect); may go below 0. Agreement be tween two ratings
https://www.kaggle.com/aroraaman/quadratic-kappa-metric-explained-in-5-simple-steps
N x N confustion matric when N is 1:4 =Ground truth histogram x predictions histogram (outer product).
N x N weights: Wi,j = (i-j)^2 / (N-1)^2
kappa, k = 1 - sum(W_i,j * C_i,j) / sum(W_i,j * E_i,j) where E is expectation matrix.
Text features - query, title and description.
For each (query-title) and (query-description) calculated:
- Number of matching words
- Cosine distance between tf-idf representations
- Distance between the avg word2vec vectors
- Levenshtein distance
Symbolic n-grams - e.g. at a character level. 1-5 grams. First 300 features as components
Extend queries - top 10 words associated with score 4 of query
Median and variance of weighting. Use heuristic of w = 1 / (1 + var)
Create ensembles by using different combinations of features.
Used a regression task.
Springleaf Marketing Response
https://www.kaggle.com/c/springleaf-marketing-response
Stacking scheme: feature engineering -> XGBoost -> meta features -> Meta XGB -> Linear combination
Binary class with AUC as metric.
Feature packs - data cleaning; mean-encoded dataset; KNN dataset on mean-encoded
out-of-fold predictions (meta features should be diverse).
Neural Network - scale, ranks, power
Microsoft Malware Classification Challenge
https://www.kaggle.com/c/microsoft-malware-prediction
Data is stored in HEX dump or disassembly.
Multiclass LogLoss
Baseline - size of file and file id
Single bytes counts (257 features)
Extract some features from disassembly.
n-gram
Entropy for a sliding window over byte sequence. Do some stats on this such as mean, median, max and min.
Dimensionality reduction - non-negative matrix factorization, PCA. PCA = min||X - SA||_2 and NMF = min||X - SA||_2 where S_ij >= 0 and A_ij >= 0. NMF is good if you use counts.
NMF is better for use in Random Forest.
Can do log transform to change objective from MSE to RMSLE NMF(log(X + 1))
For 4-gram original -> omit rate -> Linear SVM + L1-penalty -> threshold Random Forest importance
For 10-gram original -> Hashing -> Omit rare -> Linear SVM + L1-penalty -> threshold Random Forest importance
Find features that would separate large error prone objects.
Random Forest - needs manual calibration for log-loss
Move to XGBoost.
Bagging works well with boosting.
Use test data for training. Sample label according to predicted distribution or use predicted class.
Try and predict train set then try and predict test set.
per-class weight mixing
https://github.com/geffy/kaggle-malware
Walmart: Trip Type Classification
https://www.kaggle.com/c/walmart-recruiting-trip-type-classification
Purchases people made during their trip
group by visit number and see what items people purchased on a trip
Acquire Valued Shoppers Challenge, Part 1
https://www.kaggle.com/c/acquire-valued-shoppers-challenge
Recommender challenge:
- 310,000 shopers (160k in train, 150k in test)
- 350,000,000 transactions (for 1+ year for each shopper)
- 37 offers
- No exact products but could product from a combination of band, category, company
Visit -> Visit -> Coupon! -> Redeem -> Again? Make habit of customer being an item.
Optimize AUC for whether the shopper will buy again.
Most offers appear in either train or test
Focus on acquisition. Limited history of customer and offer
Offer propensity varied e.g. 50% for offer2, 20% for offer4.
Created each file for each customer; different file for every category, brand, company.
Leave-one-out offer. e.g. predict offer16 using info for all other offers.
Leave-one-out offer + concatenation
Acquire Valued Shoppers Challenge, Part 2
Strategies for recommenders:
- Content-based - the customer likes this product
- Collaborative filtering - how a customer looks like another customer that is likely to buy a product
- Hybrid - combination of the above two
Content-based
Product hierarchy versus customer and time.
Define time intervals were last 30, 30-60, 60-90, 90-120, 120-180 and 180-360 days. For category and brand; category and company; brand and company; category, brand and company.
Feature selected through forward cross validation.
Big values capped. Missing values replaced with -1
Ridge regression on the actual repeat purchase.
Collaborative filtering
Would the customer have brought the product, had they not received the offer?
A model for every offer in train and test.
Target variable natural logarithm of the times a customer brought the product 90 days before receiving the coupon.
Features based on users' activity:
- Counts of popular categories, brands, companies
- Restricted Boltzmann Machines to summarize purchase activity on least popular
- Average amount of purchase, total visits, distinct brands, categories, companies
- Total discounts/returns, visits in weekends, spend in weekends
GBM (from sklean) on the log of counts (log helped cap large values)
Combination
Transform score into ranks and combine scores 50/50.
Additional material
http://ndres.me/kaggle-past-solutions/
https://www.kaggle.com/wiki/PastSolutions
Competition
https://www.kaggle.com/c/competitive-data-science-predict-future-sales
Time series with data (item x shop x day) for 18 months, daily data
Test is item x ship for 1 month, monthly data
old comp - https://www.kaggle.com/c/competitive-data-science-final-project
Submit to kaggle which will give feedback on public dataset. Courersa will give feedback for public and private datasets
Start early
start with submitting sample_submission.csv from "Data" page on Kaggle and try submitting different constants.
Predict total sales for every product and store in the next month.
A good exercise is to reproduce previous_value_benchmark. As the name suggest - in this benchmark for the each shop/item pair our predictions are just monthly sales from the previous month, i.e. October 2015.
The most important step at reproducing this score is correctly aggregating daily data and constructing monthly sales data frame. You need to get lagged values, fill NaNs with zeros and clip the values into [0,20] range. If you do it correctly, you'll get precisely 1.16777 on the public leaderboard.
Generating features like this is a necessary basis for more complex models. Also, if you decide to fit some model, don't forget to clip the target into [0,20] range, it makes a big difference.
You can get a rather good score after creating some lag-based features like in advice from previous week and feeding them into gradient boosted trees model.
Apart from item/shop pair lags you can try adding lagged values of total shop or total item sales (which are essentially mean-encodings). All of that is going to add some new information.
Try to carefully tune hyper parameters of your models, maybe there is a better set of parameters for your model out there. But don't spend too much time on it.
Try ensembling. Start with simple averaging of linear model and gradient boosted trees like in programming assignment notebook. And then try to use stacking.
Explore new features! There is a lot of useful information in the data: text descriptions, item categories, seasonal trends.
Notes
$ conda install -c conda-forge kaggle
Go to the 'Account' tab of your user profile (https://www.kaggle.com/<username>/account
) and select 'Create API Token'. This will trigger the download of kaggle.json, a file containing your API credentials. Place this file in the location ~/.kaggle/kaggle.json
$ kaggle competitions download -c competitive-data-science-predict-future-sales
$ gunzip sales_train.csv.gz
$ gunzip sample_submission.csv.gz
$ gunzip test.csv.gz
$ kaggle competitions submit -c competitive-data-science-predict-future-sales -f sample_submission.csv -m "Message"
http://www.blackarbs.com/blog/time-series-analysis-in-python-linear-models-to-garch/11/1/2016#AR=
https://www.kaggle.com/jagangupta/time-series-basics-exploring-traditional-ts
https://www.kaggle.com/dlarionov/feature-engineering-xgboost