How to Win a Data Science Competition

August 2019

Coursera: How to Win a Data Science Competition: Learn from Top Kagglers

Taught by National Research University Higher School of Economics, Russia

Alexander D'yakonov

Stanislav Semenov

Module 1: Introduction

Created a Kaggle account: https://www.kaggle.com/raybellwaves

Course overview

Intro to competitions
Feature preprocessing and extraction
EDA
Validation
Data leaks
Metrics
Mean-encodings
Advanced features
Hyperparameter optimization
Ensembles
Final solutions
Winning solutions

Competition mechanics

Data, Model, Submission, Evaluation, Leaderboard

Here is an example of a Model http://blog.kaggle.com/2016/04/08/homesite-quote-conversion-winners-write-up-1st-place-kazanova-faron-clobber/

Submit only predictions.

Specifies which evaluations metric to use

Analyze data -> fit model -> submit -> see public score -> repeat.

Other platforms: Kaggle, DrivenData, ...

Real World Applications vs Competitions

Understand of business problem; problem formalization; data collecting; data preprocessing; modelling; way to evaluate model in real life; way to deploy model

Recap of main ML algorithms

Linear
Tree-based (e.g. random Forrest and Gradient Boosted Decision Trees). Libraries include dmlc XGBoost and Microsoft LightGBM
kNN
Neural Networks

Linear models are good for spare high dimensional data.

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

Quiz

In RandomForest model we average 100 similar performing trees, trained independently. So the order of trees does not matter in RandomForest and performance drop will be very similar on average.

In GBDT model we have sequence of trees, each improve predictions of all previous. So, if we drop first tree — sum of all the rest trees will be biased and overall performance should drop. If we drop the last tree -- sum of all previous tree won't be affected, so performance will change insignificantly (in case we have enough trees).

Each tree in forest is independent from the others, so two RandomForest with 500 trees is essentially the same as single RandomForest model with 1000 trees.

Decision Tree - Decision surface consists of lines parallel to the axis and it is sharp.

Random Forest - Decision surface consists of lines parallel to the axis and its boundaries are smooth

GBM Notebook

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier.staged_decision_function

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_hastie_10_2.html#sklearn.datasets.make_hastie_10_2

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html

http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science.html

https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

Software/Hardware requirements

https://aws.amazon.com/ec2/spot/; https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html - Cheapest cloud compute option

https://github.com/VowpalWabbit/vowpal_wabbit - Fast out of core machine learning

http://www.libfm.org/; https://www.csie.ntu.edu.tw/~cjlin/libffm/ - used for spare data such as click through rate prediction

https://github.com/RGF-team/rgf/tree/master/FastRGF; https://arxiv.org/pdf/1109.0887.pdf - ensemble trees

Scikit-learn v0.17 includes TSNE algorithms

Pandas Basics assignment

Data from https://www.kaggle.com/c/competitive-data-science-final-project/data

Feature preprocessing and generation with respect to models

Feature prepossessing depends on model.

Random Forrest doesn't need OHE

if want to predict apples every weak and there is a linear trend something like gradient boosted decision tree will struggle?

Numeric features

Tree based models don't care about the scale of the variable.

knn does care about scale

regularization is proportional to feature scale. It is also important for gradients

Another hyperparamters e.g. MinMaxScaler, StandardScaler

Can clip features values to bounds e.g. 1st and 99th percentile to get rid of outliers.

You can rank numeric features e.g. scipy.stats.rankdata

log transform np.log(1 + x). Raise to the power <1: np.sqrt(x + 2/3). These drive outliers closer to the mean.

Create additional features:

e.g. price per squared area;
distance metric with x and y.
keeping the decimal place of a price to see how these affects purchase

Categorical and ordinal features

ordinal - ordered categorical feature

Label encoding. LabelEncoder (alphabetical, order of appearance (pd.factorize), freq encoding) - tree based models

OHE - non-tree based models

If target is based on two categorical features you can concat the string then OHE.

Datetime and coordinates

Time and time delta.

last purchase date - e.g. churn prediction.

distant to nearest school etc.

e.g. grid the map and find the most expensive house in a grid and distance from that.

Number of flats around a certain point with a certain radius.

You can rotate the grid by 45o which can help with decision trees

Handling missing data

See null values using histogram

Use IsNull

XGBoost can handle NaN

If a value is not in train but is in test you can use frequency of value

Additional material

https://scikit-learn.org/stable/modules/preprocessing.html

http://sebastianraschka.com/Articles/2014_about_feature_scaling.html

https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/

https://www.quora.com/What-are-some-best-practices-in-Feature-Engineering

Bag of words

CountVectorizer

term frequency = 1 / x.sum(axis=1] [:, None]

inverse document frequency = np.log(x.shape[0] / (x >0).sum(0)) TfidfVectorizer

N-gram Ngram_range.analyzer

lemmatization (root form) and stemming (chops off end of word)

stopwords

Word2vec, CNN

Convect word to vector with similar dimensions. Words with same context will be close. Can do additional and subtraction

Vec2vec, Glove, FastText, Doc2vec (pre-trained)

Bog of words:

very large vectors
each value is known

word2vec:

small vectors
words with similar meaning have similar embeddings

Images -> vector: CNN

fine-tunning e.g. fastai. e,g, replace last layer of VGG (1x1000) with (1x4) i.e. this comp

data augmentation to increase number of images to train on e.g. rotate by 180o.

Additional material

https://andhint.github.io/machine-learning/nlp/Feature-Extraction-From-Text/

https://www.tensorflow.org/tutorials/representation/word2vec

https://rare-technologies.com/word2vec-tutorial/

http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/

https://keras.io/applications/

https://www.kernix.com/blog/image-classification-with-a-pre-trained-deep-neural-network_p11

https://www.tensorflow.org/hub/tutorials/image_retraining

https://flyyufelix.github.io/2016/10/08/fine-tuning-in-keras-part2.html

Module 2: Exploratory Data Analysis

EDA

http://blog.kaggle.com/2015/12/03/dato-winners-interview-1st-place-mad-professors/

Visualization -> Idea

Idea -> Visualization

e.g. guest with number of promos received and when they used a promo

Building intuition about the data

read around the subject e.g. for predict advertiser's cost read around Google advert data e.g. number of times viewed, number of times clicked. e.g. clicks < impressions

Check intuition e.g. age > 125

Predict the advertisement cost for a particular ad notebook

Competition hosted by solutions.se. Dataset was exported from Google AdWords

Every time a user queries a search engine, Google AdWords decides what ad will be shown along with the actual search results.

How much they will pay to Google (column Cost) when the parameters (e.g. keywords) are changed

For each AdGroupId there is a distinct set of possible KeywordId's, but Device and Slot variants are the same for each ad. And the target is to predict what will be the daily cost for using different KeywordId's, Device type, Slot type to advertise ads from AdGroups.

ID is an aggregation index -- so for each date the Cost is aggreagated for each possible index

Extend the train-set and inject rows with 0 impressions. Such change will make train set very similar to the test set and the models will generalize nicely.

Exploring anonymized data

e.g. word with hash values of the words

Guess meaning of the columns, guess the type of columns

Find relations between pairs, find feature groups

An example to explore a dataset:

RandomForestClassifier
with NaN's with -999
Label encoder categorical types
Plot feature importance
Put standardized data back to original if possible

df.dtypes, df.info(), x.value_counts(), x.isnull()

Visualizations

plt.hist(x)

XGBoost has algorithm to fill in NaNs

See relationship between two features e.g. difference, ratio between the two

Correlation heat map

Dataset cleaning and other things to check

If the feature is constant throughout it is worth removing it.

If there are duplicate feature you can do df.T.dop_duplicates()

Sometimes features are decoded which can be duplicates

Understand why duplicated rows

Check dataset is shuffled (otherwise there could be data leakage)

Additional material

https://networkx.github.io/

https://scikit-learn.org/stable/auto_examples/bicluster/plot_spectral_biclustering.html

Springleaf Competition EDA I

How many nans in each row?

train.isnull().sum(axis=1).head(15)

A lot of rows with same number of nans in a row. Do same with columns.

Drop columns which only have the same value

constant_features = feats_counts.loc[feats_counts==1].index.tolist()

traintest.drop(constant_features,axis = 1,inplace=True)

Remove duplicated columns

train_enc =  pd.DataFrame(index = train.index)

for col in tqdm_notebook(traintest.columns):

    train_enc[col] = train[col].factorize()[0]

dup_cols = {}

for i, c1 in enumerate(tqdm_notebook(train_enc.columns)):

    for c2 in train_enc.columns[i + 1:]:

        if c2 not in dup_cols and np.all(train_enc[c1] == train_enc[c2]):

            dup_cols[c2] = c1

traintest.drop(dup_cols.keys(), axis = 1,inplace=True)

Springleaf Competition EDA II

Histogram of number of unique values in each column.

Columns with large number number of unique values and could be integers.

Can create features off of these such as 1 if equal to a value and 0 if not.

Some values look like NaN e.g. 99999999

Look at value_counts.

Split cols into numeric and other into categorical.

Fraction of elements that are greater in one column than another column. E.g. next feature is greater than the feature before possibly cumulative?

Histrogram of a column. Has come kind of periodicity. Time? Months. Can create another feature as modulus 12.

There is one columns for cities. Can generate geo-location features from it.

Look at date features. e.g. difference between two dates.

Numerai Competition EDA

Could get good score by ordering data. LR on 21 original features + 21 features from knn.

Some connection between weekly dataset.

Every week data came in with a little bit of noise.

Validation and overfitting

Don't want to overfit so it doesn't adjust to new data. You also overfit on public test dataset instead of private test dataset.

Validation strategies

Holdout - (ShuffleSplit).
K-fold - Repeated holdout
Leave-one-out. K-fold where k = number of samples.

One object left. Good if have little data and model which is quick to train.

Stratify test and train set.

Data splitting strategies

1. Previous and next target values

2. Time-based trend

Split by:

Random - may be able to find features such as family should have successful credit.
Timewise - Avg customers past month.
By id.
Combined

You validation should look like split by course organizers.

Problem occurring during validation

Problems during local validation e.g. different parameters for different folds

Submission stage - score don't match. Not good train/test split.

Do extensive validation - avg scores from diff KFold splits

Try to work out train-test split.

Leaderboard probing. e.g. add 7 to woman height for men height.

Force validation to match distribution of test.

LB shuffle (different score on public and private

Additional material

http://www.chioka.in/how-to-select-your-final-models-in-a-kaggle-competitio/

Basic data leaks

If time series comp is not split on time e.g. price next week.

e.g. cat vs. dogs which used different cameras.

sometimes adding row order improves score.

Leaderboard probing and examples of rare data leaks

Distribution was same for public and private test dataset.

e.g. expedia dis

tance from person to city. Reverse engineer to get true coords.

Expedia challenge

destination_distiance - user_city pair is a leak to a true hotel location.

How many hotels of which group in user city

Use three locations and find fourth location.

Grid grid cells and do sum, avg.

Xgboost with 16 hours of training.

Data leak assignment

https://www.kaggle.com/olegtrott/the-perfect-score-script

Module 3: Metrics optimization

Regression metrics review I

MSE = 1/N sum (y - yhat)^2

Best constant? Replace yhat above with alpha. Mean of target value.

RMSE = sqrt (MSE). Scale of error is scale of target.

d RMSE / d yhat = 1 / 2 sqrt(MSE) * d MSE/ d yhat

How much model is better than baseline:

R^2 = 1 when MSE = 0. When MSE = constant model then R^2 = 0.

MAE = 1/N sum |y - yhat|. Less sensitive for outliers.

Best constant? Median of target value.

Regression metrics review II

Mean Squared Percentage Error, MSPE = 100% / N * sum (y - yhat/ y)^2. Best constant weighted target mean

Mean Absolute Percentage Error, MAPE = 100% / N * sum (y - yhat / y) . Best constant weight target median.

Root Mean Square Logarithmic Error, RMSLE = sqrt(1 / N * (log(y + 1) - log(yhat +1))2 = RMSE(log(y + 1), log(y + 1)) = sqrt(MSE(log(y + 1), log(y + 1)). Best constant something like... exp(mean target value).

Classifications metrics review

Soft classification - e.g. probability belonging to a class

Hard classification - e.g. argmax f_i(x)

Accuracy 0-1. Fraction of correctly classified objects. Soft prediction and apply threshold e.g. > 0.5. Best constant - predict the most frequent class.

Logarithmic loss (logloss) binary = -1/N * sum(y * log(yhat) + (1 - y) log(1 - yhat)

multiclass - =1/N sum sum y_l * log(yhat_l). In practice clipped to a small number. Penalizes very wrong predictions. Best constant set to frequency of i-th class.

AUC ROC - looks at threshold for accuracy.

AUC = # correctly ordered pairs / total number of pairs. Random prediction AUC = 0.5

Cohen's Kappa - my score = 1 - (1 - accuracy) / (1 - baseline). Normalize target.

= 1 - (1 - accuracy) / (1 - p_e). p_e is randomly permute our predictions = 1/ N^2 sum (n_k1 * n_k2)

error = 1 - accuracy

weighted error. If 3 classes create error weight matrix 3 x 3 and punish classifications far away.

Confusion matrix

weighted error = 1 / constant sum(confusion matrix * weight matrix).

Use linear or quadratic weights. weighted kappa = 1 - (weighted error / weighted baseline error).

General approaches for metrics optimization

Target metric - what we want to optimize

Optimization loss - what model optimizes

Sometimes model does not optimize to the target metric so may need to adjust output.

Some models optimize - MSE, Logloss

Preprocess train and optimize another metric e.g. for RMSLE with XGBoost cannot optimize.

Postprocess prediciton e.g. kappa

write custom loss function

Sometimes you can use early stopping when the model starts to overfit.

Regression metrics optimization

MSE ~ L2 loss. Default

MAE ~ L1, median regression. XGBoost cannot optimize as second derivative is 0. LightGBM can use it. Called quantile loss in VW (Vowpal Wabbit). Huber loss.

MSPE, MAPE (weighted version of MSE; MAE). Many libraries accept sample weights. Resample train set using df.sample(weights=sample_weights) then use MSE. Test set stays as it. Need to resample many times and average.

RMSLE. Train: transform target zi = log(yi + 1) and fit a model with MSE loss. Test: transform predictions back yhati = exp(zhati) - 1

Classification metrics optimization I

Logloss - doesn't really work with sklean.RandomForestClassifier. You can claibrate probability e.g. Platt scaling, Isotonic regression, Stacking

Accuracy - fit any metric and tune threshold. e.g. 0.5 threshold -> 0.7 threshold. Can do for loop/grid search.

Classification metrics optimization II

AUC - XGBoost, LightGBM

Quadratic weighted kappa metric - optimize MSE and find right thresholds or custom smooth loss for GBDT or neural nets.

Additional material

http://queirozf.com/entries/evaluation-metrics-for-classification-quick-examples-references

https://www.garysieling.com/blog/sklearn-gini-vs-entropy-criteria

http://www.navan.name/roc/

https://icml.cc/2015/wp-content/uploads/2015/06/icml_ranking.pdf ; https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/MSR-TR-2010-82.pdf ; https://sourceforge.net/p/lemur/wiki/RankLib/ ; https://wellecks.wordpress.com/2015/01/15/learning-to-rank-overview/

http://nlp.uned.es/docs/amigo2007a.pdf

Concept of mean encoding

Encode categorical features.

Can do mean of target (feature mean) (likelihood). Better than label encoding which is random.

Weight of evidence = ln (count of 0's/count of 1's) * 100

Count = sum(target)

Diff = count of 1's - count of 0's

Regularization

CV loop - Can help with mean encoding by doing 5 times on different

Smoothing - (mean(target) * nrows) + (globalmean * alpha) / (nrows + alpha)

Noise - degrades the quality of encoding. Used with Leave-One-Out (LOO)

Expanding mean - cumulative sum / cumulative count. Built in CatBoost

Extensions and generalizations

Many-to-many relationship e.g. one person and multiple purchases. Split the table into long representation

Can use rolling statistics with time data

Can concat categorical features and mean encode them to make them numeric

For local experiments:

Estimate encodings on train then map them to train and validation. Regularize them on train then validate the model on train and validation split

For submission

Estimate encodings on whole train data, map to train and test, regularize then on train and fit on train

Module 4: Hyperparameter optimization

Hyperparameter tuning I

Understand which parameters are the most important.

Understand how the parameters will change the results.

Lots of hyperparamter optimization software: Hyperopt, Scikit-optimize, Spearmint, GPyOpt, RoBO, SMAC3

def xgb_score(param):

    # Run xgboost with parameters: param

def xgb_hyperopt():

    space = {

        'eta':0.01,

        'max_depth':hp.quniform('max_depth', 10, 30, 1),

        'min_child_weight':hp.quniform('min_child_weight', 0, 100, 1),

        'subsample':hp.quniform('subsample', 0.1, 1.1, 0.1),

        'gamma':hp.quniform('gamma', 0.0, 30, 0.5),

        'colsample_bytree':hp.quniform('colsample_bytree', 0.1, 1.0, 0.1),

        'objective':'reg:linear',

        'nthread':28,

        'silent':1,

        'num_round':2500,

        'seed':2441

        'early_stopping_rounds':100

    best = fmin(xgb_score, space, algo=tpe.suggest, max_evals=1000)

Split parameters into red (reduce overfitting), green (better fit on train set).

Hyperparameter tuning II

Tree-based models: XGBoost, LightGBM, CatBoost, RandomForest, ExtraTrees, FastRGF (regularized greedy forests).

GBDT - Build decision trees one after another to optimize a given metric. Parameters

XGBoost:

max_depth - max depth of tree. Better fit to train test. 1-30. Sometimes better to stop tuning and generate some features. Start around 7.
subsample - Fraction of objects to fit to the tree. 0-1. If lower less prone to overfitting.
colsample_bytree, colsample_bylevel - Consider a fraction of features. If it is overfitting you can lower this parameter.
min_child_weight, lambda, alpha - regularization parameters. Increase min_child_weight to be conservative in model. 0, 5, 50, 300. An important parameter to tune.
eta - Eta is learning rate. Too high it will not converge. If too small it will take a while. 0.1. 0.01. Freeze while tune num_round. Uses early stopping to monitor validation when loss increases. When fit can do eta / 2 and num_rounds * alpha???
num_round - How many trees to build
seed - Random seed.

LightGBM:

max_depth/num_leaves (to split data better)
bagging_fraction
feature_fraction
min_date_in_leaf, lambda_l1, lambda_l2
learning_rate, num_iteractions
_seed

RandomForest/ExtraTrees - Each tree is independent of each other. Paramters:

n_estimators - Accuracy plateaus eventually when increasing number of trees.
max_depth - Depth of trees. None is unlimited depth. Start around 7.
max_features - If higher then faster training.
min_sample_leaf - Like min_child_weight
criterion - to evaluate a split e.g. gini, entropy
random_state - random seed
n_jobs - Number of cores. 0 to use all.

Hyperparameter tuning III

Neural Nets:

Number of neurons per layer - Learn more complex decision boundary and overfit faster
Number of layers
Optimizers:
- SGD + momenturn
- Adam, Adadelta, Adagram (adaptive). This can be faster but lead to overfitting
Batch size - large value leads to more overfitting. 32 or 64.
Learning rate - start 0.1 and lower down to when it converges. Connection between batch size and learning rate. Inc. batch size by a factor or alpha you can also inc. LR by the same factor.
Regularization:
- L2/L1 for weights
- Dropout/Dropconnect
- Static dropconnect - make first layer large number of neural but drop 99% of connections from input layer to first hidden layer.

Linear model:

SVC/SVR. SVMs don't require much tuning. Sklean wraps libLinear and libSVM. Compile these yourself for multi-core support.
LogisticRegression/LinearRegression + regularizers
SGDClassifier/SGDRegressor
Vowpal Wabbit - for out of core. FTRL (Follow the Regularized Leader).

Regularization parameters (C, alpha, lambda). Start very small and increase it. As C increases it does down the model though.

Try L1, L2, L1 + L2 each. L1 provides some sparsity and can be used for feature selection.

Average models. e.g. if there is a good model fit for max_depth = 5 then do 3 GBDT with 4,5,6 and average them.

Quiz feedback

for RandForest train n_estimators, max_depth, min_samples_split

https://scikit-learn.org/stable/modules/grid_search.html

http://fastml.com/optimizing-hyperparams-with-hyperopt/

https://www.ntu.edu.sg/home/egbhuang/

https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/

Practical guide

The Nature Conservancy, Planet - understanding comps

Parameters - Importance, feasibility, understanding

Save data as hdf5/npy for faster reading.

Can cast to 32-bits to save RAM

Keep it reproducible e.g. random seems

Log everything

KazAnova's competition pipeline, part 1

Understand the problem, EDA, define cv stragegy, feature engineering, modelling, ensembling.

Type of problem; how big is the data? hardware? software? metric being tested on? (is there a similar comp to this?)

Plot histogram of variance. Similar between train and test?

Plot feature versus the target variance and vs time.

Univariate predictability metrics (information value https://medium.com/@sundarstyles89/weight-of-evidence-and-information-value-using-python-6f05072e83eb)

Bin numerical features (to see non-linearlity) and correlation matrices.

If time is important use time-based validation.

Be aware of features missing in test.

Additional material

https://github.com/Far0n/kaggletils/tree/master/kaggletils

https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/

Statistics and distance based features

Walmart recruitment, Acquire Valued Shoppers, Walmart weather shopping

KazAnova's competition pipeline, part 2

Ensembling - save predictions on internal validation and test are saved.

Different ways to combine different features from averaging to stacking.

Look at correlations between predictions (good if low).

Statistics and distance based features

CTR (click through rate). Combine user and page type and get min and max price of adds.

Group on neighbors is more difficult. e.g. rental price. Number of houses in 500m, 1000m,

Springleaf compeition - Mean encode all variables. For every point find 2000 nearest neighbors using Bray-Curtis metric: sum |u_i - v_i| / sum |u_i + v_i|. Calculate features from that 2000 neighbors. Mean target of nearest 5, mean distance to 10 closest neighbors with target 1.

Matrix factorizations

Movie recommendation. user (rows) and ratings (column). Use to encode something about user.

e.g. bag-of-words ...

feature fusion e.g. Vanilla BOW, BOW+TF-IDF + BOW (bigrams) -> dimensionally reduction -> Tree-based method.

Can use on only some columns and can provide additional diverstity (good for ensembles).

SVD, PCA, TruncatedSVD,

Non-negative matrix factorization (NMF) - Ensures that all latent factors are non-negative (>= 0), good for counts-like date. Makes data for more suitable for decision trees. You can also do NMF(log(X + 1)).

Transform all data then select train and test.

Feature interactions

e.g. banner selection on website. two categorical features - ad type and website type. Concat this two to create another feature you can then do OHE on this.

If the values are numeric you can multiple them (or sum, diff, division). This enlarges features space and makes fitting easier. Then can then do feature selection of dimensionally reduction.

Data -> sums, diffs, dots, divisions -> fit randomforest, get features of importances, select a few of the most important features.

You can look at third order interactions etc.

You can extract features for decisions tree e.g. is age > 9.5.

sklearn: tree_model.apply() - returns index

xgboost: booster.predict(pred_leaf=True) - returns index

t-SNE

Non-linear dimenstion reduction - manifold learning.

https://scikit-learn.org/stable/modules/manifold.html

MNIST 700 dimension -> 2 dimension. 3 is next to 5 and is close to 6 and 8.

Good use for EDA.

Be careful of hyper-parameters (perplexity). Higher more clustered.

https://distill.pub/2016/misread-tsne/

Test several perplexity. Train and test should be projected together. If n features > 500 may want to reduce dimensions before projecting.

https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

https://lvdmaaten.github.io/tsne/

Additional material

https://scikit-learn.org/stable/modules/decomposition.html

https://github.com/DmitryUlyanov/Multicore-TSNE

https://scikit-learn.org/stable/auto_examples/manifold/plot_t_sne_perplexity.html#sphx-glr-auto-examples-manifold-plot-t-sne-perplexity-py

https://research.fb.com/publications/practical-lessons-from-predicting-clicks-on-ads-at-facebook/

https://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html

KNN Assignment

knn for features was important for https://www.kaggle.com/c/otto-group-product-classification-challenge and https://www.kaggle.com/c/springleaf-marketing-response

Introduction to ensemble methods

Combing ML models to get a better prediction

Averaging (or blending), weighted averaging, conditional averaging, bagging, boosting, stacking, stacknet

e.g prediction age. model is good <= 50 year olds, model is good for > 50

average is (model 1 + model 2) / 2

weighted average (model 1 x 0.7 + model 2 x 0.3)

Condition method e.g. < 50 use 1 and >= 50 use another one.

Bagging

Averaging slightly different versions of the same model e.g. random forest

Two main sources of error bias (underfitting; high bias/low variance), variance (overfitting; low bias/high variance)

Parameters to control bagging:

Change the random seed
Row (sub) sampling or bootstrapping
Shuffling
Column (sub) sampling
Model-specific parameters
Number of models (or bags)
(Optionally) parallesim

BaggingClassifier and BaggingRegressor from sklean

Boosting

Form of weighted averaging of models where each model is built sequentially via taking into account the past model performance.

Weight boosting - is trying to classify 1's or 0'2 you can take pred (probabilty) and subtract y from pred to get absolute error. Generator a new column which is weight (1 + absolute error). Add this weight feature to the model. e.g. is 2 pretend this row occurs twice. You can repeat with other weights.

parameters:

Learning rate: PredN = pred0*eta + pred1*eta + ...
Number of estimators - (e.g. models. more we add smaller LR need to put).
Input model (anything that accepts weights)
Sub boosting type: AdaBoost; LogitBoost

Residual boosting - Use error (to get direction). Then make this the new target variable. Final prediction would be new prediction + old prediction.

parameters:

Learning rate: PredN = pred0*eta + pred1*eta + ...
Number of estimators
Row (sub) sampling; Column (sub) sampling
Input model - better with trees
Sub boosting type: Fully gradient based; Dart (uses dropout)

Xgboost, Lightgbm, H20's GBM, Catboost, Sklearn's GBM

Stacking

Making predictions of a number of models in a hold-out set then using different (meta) model to train on these predictions

Doesn't need to know input data

Wolpert (1992) it involves:

Splitting the train set into two part
Train several base learners on the first part
Make predictions with the base learners on the second part
Use last predictions as input to train a higher level learner

Fit on train (A) and save predictions on valid (B) and test (C). Do multiple times. Train algorithm on B1 and make predictions for C1

With time sensitive data - respect time

Diversity as important as performance: different algorithms, different input features.

Performance plateaus after N models

Meta model is normally modest e.g. linear regression

StackNet

Scalable meta model method that utilizes stacking to combine multiple models in a NN architecture of multiple levels

4 layer stacking won kaggle comp. e.g, homesite comp.

In a NN every node is a simple linear model with some non-linear transformation. Instead of linear model we could use any model.

Cannot use back-propagation

Use stacking to line each model/node with target

If data is limited is hard to keep splitting train into train and valid. Can do k-fold then average if going to extend to many layers.

Ensembling tips and tricks

1st level tips. Diversity based on algorithms:

2-3 gradient boosted trees (lightgbm, xgboost)
2-3 NN (keras, pytorch)
1-2 ExtraTrees/Random Forest (sklearn)
1-2 linear logistic/ridge, svm (sklearn)
1-2 knn (sklearn)
1 factorization machine (libfm)
1 svm with non-linear kernal (sklean)

Diversity baesd in input data:

categorical features: OHE, label encoding, target encoding
Numerical features: outlier, binning, derivatives (smooth), percentiles, scaling)
Interactions col1 /+*- col2; groupby; unsupervised

Subsequent level tips. Simple (or shallow) algorithms

gradient boosted trees with small depth (2/3)
linear models with high regularization
extra trees
shallow NN (1 layer)
knn with BrayCurtis Distance
Brute force a search for best linear weights based on cv

Feature engineering:

pairwise differences between meta features
row-wise statistics like averages or stds

For every 7.5 models in previous layer we add 1 meta model in subsequent layer

Be mindful of target leakage

Stacked ensembles from H2O

https://xcessiv.readthedocs.io/en/stable/

Can run classifiers in regression e.g. predict is age > 50

https://www.kaggle.com/datafiniti/consumer-reviews-of-amazon-products

Additional material

https://mlwave.com/kaggle-ensembling-guide/

https://github.com/kaz-Anova/StackNet

https://heamy.readthedocs.io/en/latest/usage.html

CatBoost 1

Categorical features:

OHE (one_hot_max_size)
Number of appearances
Statistics with label usage of a random permutation of the data
Combines features in a greedy way (only the best combination).

Symmetric decision trees e.g. weight > 65 and weight > 65 as two leafs

CatBoost 2

Leaf value is calculated as average gradient on all objects in this leaf.

Ordered boosting

Speed up: rsm (random subspace method) = 0.1; max_ctr_complexity=1; boosting_type='Plain'; task_type='GPU'

Overfitting detector

Evaluating custom metrics during training

CatBoost Viewer

Can be calculated using TensorBoard

Nan features support

Training parameters:

Number of trees + learning rate
Tree depth
L2 regularization
Bagging temperature
Random strength

Module 5: Competitions go through

Crowdflower Competition

https://www.kaggle.com/c/crowdflower-search-relevance

Relevance of search result. 1-4 of score with 4 being best

Metric: Quadratic weighted kappa (0 random to 1 perfect); may go below 0. Agreement be tween two ratings

https://www.kaggle.com/aroraaman/quadratic-kappa-metric-explained-in-5-simple-steps

N x N confustion matric when N is 1:4 =Ground truth histogram x predictions histogram (outer product).

N x N weights: Wi,j = (i-j)^2 / (N-1)^2

kappa, k = 1 - sum(W_i,j * C_i,j) / sum(W_i,j * E_i,j) where E is expectation matrix.

Text features - query, title and description.

For each (query-title) and (query-description) calculated:

Number of matching words
Cosine distance between tf-idf representations
Distance between the avg word2vec vectors
Levenshtein distance

Symbolic n-grams - e.g. at a character level. 1-5 grams. First 300 features as components

Extend queries - top 10 words associated with score 4 of query

Median and variance of weighting. Use heuristic of w = 1 / (1 + var)

Create ensembles by using different combinations of features.

Used a regression task.

Springleaf Marketing Response

https://www.kaggle.com/c/springleaf-marketing-response

Stacking scheme: feature engineering -> XGBoost -> meta features -> Meta XGB -> Linear combination

Binary class with AUC as metric.

Feature packs - data cleaning; mean-encoded dataset; KNN dataset on mean-encoded

out-of-fold predictions (meta features should be diverse).

Neural Network - scale, ranks, power

Microsoft Malware Classification Challenge

https://www.kaggle.com/c/microsoft-malware-prediction

Data is stored in HEX dump or disassembly.

Multiclass LogLoss

Baseline - size of file and file id

Single bytes counts (257 features)

Extract some features from disassembly.

n-gram

Entropy for a sliding window over byte sequence. Do some stats on this such as mean, median, max and min.

Dimensionality reduction - non-negative matrix factorization, PCA. PCA = min||X - SA||_2 and NMF = min||X - SA||_2 where S_ij >= 0 and A_ij >= 0. NMF is good if you use counts.

NMF is better for use in Random Forest.

Can do log transform to change objective from MSE to RMSLE NMF(log(X + 1))

For 4-gram original -> omit rate -> Linear SVM + L1-penalty -> threshold Random Forest importance

For 10-gram original -> Hashing -> Omit rare -> Linear SVM + L1-penalty -> threshold Random Forest importance

Find features that would separate large error prone objects.

Random Forest - needs manual calibration for log-loss

Move to XGBoost.

Bagging works well with boosting.

Use test data for training. Sample label according to predicted distribution or use predicted class.

Try and predict train set then try and predict test set.

per-class weight mixing

https://github.com/geffy/kaggle-malware

Walmart: Trip Type Classification

https://www.kaggle.com/c/walmart-recruiting-trip-type-classification

Purchases people made during their trip

group by visit number and see what items people purchased on a trip

Acquire Valued Shoppers Challenge, Part 1

https://www.kaggle.com/c/acquire-valued-shoppers-challenge

Recommender challenge:

310,000 shopers (160k in train, 150k in test)
350,000,000 transactions (for 1+ year for each shopper)
37 offers
No exact products but could product from a combination of band, category, company

Visit -> Visit -> Coupon! -> Redeem -> Again? Make habit of customer being an item.

Optimize AUC for whether the shopper will buy again.

Most offers appear in either train or test

Focus on acquisition. Limited history of customer and offer

Offer propensity varied e.g. 50% for offer2, 20% for offer4.

Created each file for each customer; different file for every category, brand, company.

Leave-one-out offer. e.g. predict offer16 using info for all other offers.

Leave-one-out offer + concatenation

Acquire Valued Shoppers Challenge, Part 2

Strategies for recommenders:

Content-based - the customer likes this product
Collaborative filtering - how a customer looks like another customer that is likely to buy a product
Hybrid - combination of the above two

Content-based

Product hierarchy versus customer and time.

Define time intervals were last 30, 30-60, 60-90, 90-120, 120-180 and 180-360 days. For category and brand; category and company; brand and company; category, brand and company.

Feature selected through forward cross validation.

Big values capped. Missing values replaced with -1

Ridge regression on the actual repeat purchase.

Collaborative filtering

Would the customer have brought the product, had they not received the offer?

A model for every offer in train and test.

Target variable natural logarithm of the times a customer brought the product 90 days before receiving the coupon.

Features based on users' activity:

Counts of popular categories, brands, companies
Restricted Boltzmann Machines to summarize purchase activity on least popular
Average amount of purchase, total visits, distinct brands, categories, companies
Total discounts/returns, visits in weekends, spend in weekends

GBM (from sklean) on the log of counts (log helped cap large values)

Combination

Transform score into ranks and combine scores 50/50.

Additional material

http://ndres.me/kaggle-past-solutions/

https://www.kaggle.com/wiki/PastSolutions

http://www.chioka.in/kaggle-competition-solutions/

https://github.com/ShuaiW/kaggle-classification/

Competition

https://www.kaggle.com/c/competitive-data-science-predict-future-sales

Time series with data (item x shop x day) for 18 months, daily data

Test is item x ship for 1 month, monthly data

old comp - https://www.kaggle.com/c/competitive-data-science-final-project

Submit to kaggle which will give feedback on public dataset. Courersa will give feedback for public and private datasets

Start early

start with submitting sample_submission.csv from "Data" page on Kaggle and try submitting different constants.

Predict total sales for every product and store in the next month.

A good exercise is to reproduce previous_value_benchmark. As the name suggest - in this benchmark for the each shop/item pair our predictions are just monthly sales from the previous month, i.e. October 2015.

The most important step at reproducing this score is correctly aggregating daily data and constructing monthly sales data frame. You need to get lagged values, fill NaNs with zeros and clip the values into [0,20] range. If you do it correctly, you'll get precisely 1.16777 on the public leaderboard.

Generating features like this is a necessary basis for more complex models. Also, if you decide to fit some model, don't forget to clip the target into [0,20] range, it makes a big difference.

You can get a rather good score after creating some lag-based features like in advice from previous week and feeding them into gradient boosted trees model.

Apart from item/shop pair lags you can try adding lagged values of total shop or total item sales (which are essentially mean-encodings). All of that is going to add some new information.

Try to carefully tune hyper parameters of your models, maybe there is a better set of parameters for your model out there. But don't spend too much time on it.

Try ensembling. Start with simple averaging of linear model and gradient boosted trees like in programming assignment notebook. And then try to use stacking.

Explore new features! There is a lot of useful information in the data: text descriptions, item categories, seasonal trends.

Notes

$ conda install -c conda-forge kaggle

Go to the 'Account' tab of your user profile (https://www.kaggle.com/<username>/account) and select 'Create API Token'. This will trigger the download of kaggle.json, a file containing your API credentials. Place this file in the location ~/.kaggle/kaggle.json

$ kaggle competitions download -c competitive-data-science-predict-future-sales

$ gunzip sales_train.csv.gz

$ gunzip sample_submission.csv.gz

$ gunzip test.csv.gz

$ kaggle competitions submit -c competitive-data-science-predict-future-sales -f sample_submission.csv -m "Message"

http://www.blackarbs.com/blog/time-series-analysis-in-python-linear-models-to-garch/11/1/2016#AR=

https://www.kaggle.com/jagangupta/time-series-basics-exploring-traditional-ts

https://www.kaggle.com/dlarionov/feature-engineering-xgboost

Google Sites

Report abuse