Blog

Articles about:

Benchmarking: comparing different stuff in a similar environment...
Machine Learning: mostly supervised machine learning (a lot of xgboost / LightGBM)...
Statistics: whatever a data scientist (should) already know...
Explanations: going from A to Z when the steps are known...
Design: problem-solving (ex: experimental design)...
Computers: virtualization, information technology/systems...

... can be found below in an ordered list by date.

They are organized as follows:

Link with title and caption (if available) on the left
Description on the right:
- Major (most important) / Super-Minor (supporting) / Sub-Minor (least important) thematics
- Original date of post (may not match with the real post)
- Small hard-coded description

LightGBM on Windows: Visual Studio vs MinGW (gcc), R with Visual StudioThinking on using LightGBM on Windows? You know you are given two hard choices: Visual Studio or MinGW (gcc).

Benchmark / Machine Learning

Date: Jun 10, 2017

Are you thinking about using LightGBM on Windows?

If yes, should you choose Visual Studio or MinGW as the compiler? We are checking here the impact on the compiler on the performance of LightGBM!

In addition, some juicy xgboost comparison: they bridged the gap they had versus LightGBM!

Benchmarking xgboost with and without virtualizationWe have seen previously that the xgboost had a new fast histogram method leading to blazing performance. All our tests were done on a…

Benchmark / Machine Learning

Date: May 25, 2017

Thinking about Intel vs AMD Ryzen?

What about picking both together, and putting them on a ring of xgboost benchmarks? This is what we are doing here!

We are also looking indirectly at Linux vs Windows, and baremetal vs virtualized servers. Turns out virtualized servers and Windows servers are actually doing very well against Linux and baremetal servers.

Benchmarking new xgboost fast histogram: xgboost and the compiler storyWe have seen previously that the new xgboost fast histogram method had an issue: it was awfully slow. But we fixed it. By recompiling R…

Benchmark / Machine Learning

Date: May 14, 2017

Using xgboost fast histogram?

Ever heard about old and new fast histogram?

This is the comparison between old and new xgboost fast histogram! Get ready to see... juicy 75% improvements!

Exact xgboost and Fast Histogram xgboost training speed comparisonDid you ever wanted to compare “unfairly” Exact xgboost and Fast Histogram xgboost? Here you are served.

Benchmark / Machine Learning

Date: Apr 30, 2017

Remember the comparison between exact and fast histogram xgboost? Here they are both together!

Benchmarking xgboost fast histogram: frequency versus cores, many cores server is bad!We have seen previously that many cores are helping us doing great when using xgboost exact method on large data: even an i7–7700K…

Benchmark / Computers / Machine Learning

Date: Apr 29, 2017

Using fast histogram xgboost? You are going to get served with benchmarks using:

Overclocked 5.0 GHz i7-7700K and...;
20 cores / 40 threads 2.7GHz server!

Best practice to remember: fast histogram xgboost scales very well with frequency (GHz). Using too many cores will destroy heavily the speed of your training.

Benchmarking xgboost: 5GHz i7–7700K vs 20 core Xeon Ivy Bridge, and KVM/VMware VirtualizationRecently, I decided to spend $125 into renting (for a month) an overclocked system which had promising performance when compared to a…

Benchmark / Computers / Machine Learning

Date: Apr 27, 2017

Using exact xgboost? You are going to get served with benchmarks using:

Overclocked 5.0 GHz i7-7700K and...;
20 cores / 40 threads 2.7GHz server!

Best practice to remember: exact xgboost scales very well with number of cores. Frequency is secondary.

Visiting: Categorical Features and Encoding in Decision TreesWhen you have categorical features and you are using decision trees, you often have a major issue: how to deal with categorical features?

Benchmark / Machine Learning / Design

Date: Apr 23, 2017

Using decision trees and using categorical features?

Should you use...:

Numeric encoded features?
One-hot encoded features?
Categorical (raw) features?
Binary encoded features?

We will show one-hot encoding is the worst you can use, while categorical features are the best ever you can use, if and only if the supervised machine learning program can handle them.

Destroying the Myth of “number of threads = number of physical cores”tl;dr: max performance => number of threads = number of logical/virtual cores

Benchmark / Computers / Design

Date: Apr 16, 2017

When you have a CPU with hyperthreading, make sure you are using all its available performance.

Do not believe the myth "number of threads = number of physical cores" anymore.

We are not in 2000-era where multithreading was horribly done.

Benchmarking LightGBM: Float vs Double – Data Science & Design – MediumWe have seen previously that LightGBM was extremely fast, much faster than xgboost with default settings in R. Recently, to fix a…

Benchmark / Computers / Machine Learning

Date: Jan 10, 2017

Programming practices: is there a sensible difference between floats and doubles when it comes to speed?

Benchmarking LightGBM: how fast is LightGBM vs xgboost?This post is about benchmarking LightGBM and xgboost (exact method) on a customized Bosch data set. I have seen xgboost being 10 times…

Benchmark / Machine Learning

Date: Jan 09, 2017

We are comparing here xgboost (exact) and LightGBM.

The computation speed is 10x faster using LightGBM.

xgboost’s New Fast Histogram (tree_method = hist) – Data Science & Design – MediumLaurae: This post is about the new feature of xgboost: the histogram tree grow method. Currently, it provides error in R but works in…

Explanation / Machine Learning

Date: Jan 07, 2017

xgboost has a new method for boosting, providing excellent performance: fast histogram.

Understanding a bit xgboost’s Generalized Linear Model (gblinear)Laurae: This post is about xgboost’s gblinear and its parameters. Elastic Net? Generalized Linear Model? Gradient Descent? Coordinate…

Explanation / Machine Learning

Date: Dec 07, 2016

Think you don't understand xgboost's gblinear? Think again. That's just a generalized linear model.

Virtualization & (Hyperthreading) Machine Learning Performance (Windows) (Part 2)Non-Kaggle post about the impact of Virtualized CPU cores / Sockets on Machine Learning / Optimization problems, specifically on xgboost…

Benchmark / Computers / Machine Learning

Date: Nov 25, 2016

Using virtualization? The CPU topology you are passing to your virtual machine matters. But by how much? (xgboost exact)

This time, we are looking for increasing number of sockets.

Virtualization & (Hyperthreading) Machine Learning Performance (Windows) (Part 1)Non-Kaggle post about the impact of Virtualized CPU cores / Sockets on Machine Learning / Optimization problems, specifically on xgboost…

Benchmark / Computers / Machine Learning

Date: Nov 14, 2016

Using virtualization? The CPU topology you are passing to your virtual machine matters. But by how much? (xgboost exact)

We will look at the number of cores passed to the virtual machine.

Large amount of observations: Statistical Test not so StatisticalLaurae: This post is about why a statistical test is not much statistical when the amount of observations you feed is “large”, with a full…

Statistics / Design

Date: Nov 08, 2016

Statistical tests are not statistical tests anymore when using large amount of data.

They were just not made for that.

Why under-predicting or over-predicting might not be an issue?Laurae: This post is about the rationale between over-predicting/under-predicting and the performance metric you are optimizing. It takes…

Design

Date: Nov 06, 2016

Have a metric which is quadratic?

Then improving it becomes quadratic, this is easy as piece to understand. Explaining the phenomena is something different.

Maximum & Fast readability of multivariate data vs LabelLaurae: This post is about plotting data to maximize readability so you can read fast multivariate data vs a single label. Obviously, if…

Design

Date: Oct 15, 2016

Do you have many features?

Are you lost in all these features?

Think you can go through all of them one by one?

A tableplot solves your problem.

Row IDs leaking?! Detect it using Nearest Neighbors!Laurae: This post is about a row ID leak in a competition that stroke (openly) a competition 3 days before the end.

Machine Learning / Design

Date: Sep 03, 2016

When you have row ID leakage and you have a not too large sample size (less than 100,000 rows), what does machine learning say?

Leakage fun: Statistical point of view of rows leakingLaurae: This post is about a row ID leak in a competition that stroke (openly) a competition 3 days before the end.

Statistics / Design

Date: Sep 03, 2016

When you have row ID leakage and you have a not too large sample size (less than 100,000 rows), what does statistics say?

Hierarchical Supervised Models: is it better for predicting than without any hierarchy?Laurae: This post talks about hierarchical classification, but applies also to hierarchical regression. When a label has a known hierarchy…

Design / Machine Learning

Date: Sep 01, 2016

What do you have to say about hierarchical supervised machine learning?

The answer: it depends.

Gradient Boosting speed on linear models with 10000+ features: fast, but time can provide better…Laurae: This post is about Gradient Boosting with 10000+ features. It explains how a linear model converges much faster than a non-linear…

Explanation / Machine Learning

Date: Aug 26, 2016

Explains why a generalized linear model is a boosted model, and why having many features do not matter for training speed when they are sparse.

Compiling xgboost in Windows for R – Data Science & Design – MediumLaurae: This post is about compiling xgboost in Windows for R properly, including the permission issue fix if it arises. People can spend…

Explanation / Computers

Date: Aug 03, 2016

We all know xgboost is a nightmare to compile if you are a total beginner. Here is an example of compiling xgboost for both CLI (command line interface) and R!