Blog

Articles about:

  • Benchmarking: comparing different stuff in a similar environment...
  • Machine Learning: mostly supervised machine learning (a lot of xgboost / LightGBM)...
  • Statistics: whatever a data scientist (should) already know...
  • Explanations: going from A to Z when the steps are known...
  • Design: problem-solving (ex: experimental design)...
  • Computers: virtualization, information technology/systems...

... can be found below in an ordered list by date.

They are organized as follows:

  • Link with title and caption (if available) on the left
  • Description on the right:
    • Major (most important) / Super-Minor (supporting) / Sub-Minor (least important) thematics
    • Original date of post (may not match with the real post)
    • Small hard-coded description

Benchmark / Machine Learning

Date: Jun 10, 2017

Are you thinking about using LightGBM on Windows?

If yes, should you choose Visual Studio or MinGW as the compiler? We are checking here the impact on the compiler on the performance of LightGBM!

In addition, some juicy xgboost comparison: they bridged the gap they had versus LightGBM!

Benchmark / Machine Learning

Date: May 25, 2017

Thinking about Intel vs AMD Ryzen?

What about picking both together, and putting them on a ring of xgboost benchmarks? This is what we are doing here!

We are also looking indirectly at Linux vs Windows, and baremetal vs virtualized servers. Turns out virtualized servers and Windows servers are actually doing very well against Linux and baremetal servers.

Benchmark / Machine Learning

Date: May 14, 2017

Using xgboost fast histogram?

Ever heard about old and new fast histogram?

This is the comparison between old and new xgboost fast histogram! Get ready to see... juicy 75% improvements!

Benchmark / Machine Learning

Date: Apr 30, 2017

Remember the comparison between exact and fast histogram xgboost? Here they are both together!

Benchmark / Computers / Machine Learning

Date: Apr 29, 2017

Using fast histogram xgboost? You are going to get served with benchmarks using:

  • Overclocked 5.0 GHz i7-7700K and...;
  • 20 cores / 40 threads 2.7GHz server!

Best practice to remember: fast histogram xgboost scales very well with frequency (GHz). Using too many cores will destroy heavily the speed of your training.

Benchmark / Computers / Machine Learning

Date: Apr 27, 2017

Using exact xgboost? You are going to get served with benchmarks using:

  • Overclocked 5.0 GHz i7-7700K and...;
  • 20 cores / 40 threads 2.7GHz server!

Best practice to remember: exact xgboost scales very well with number of cores. Frequency is secondary.

Benchmark / Machine Learning / Design

Date: Apr 23, 2017

Using decision trees and using categorical features?

Should you use...:

  • Numeric encoded features?
  • One-hot encoded features?
  • Categorical (raw) features?
  • Binary encoded features?

We will show one-hot encoding is the worst you can use, while categorical features are the best ever you can use, if and only if the supervised machine learning program can handle them.

Benchmark / Computers / Design

Date: Apr 16, 2017

When you have a CPU with hyperthreading, make sure you are using all its available performance.

Do not believe the myth "number of threads = number of physical cores" anymore.

We are not in 2000-era where multithreading was horribly done.

Benchmark / Computers / Machine Learning

Date: Jan 10, 2017

Programming practices: is there a sensible difference between floats and doubles when it comes to speed?

Benchmark / Machine Learning

Date: Jan 09, 2017

We are comparing here xgboost (exact) and LightGBM.

The computation speed is 10x faster using LightGBM.

Explanation / Machine Learning

Date: Dec 07, 2016

Think you don't understand xgboost's gblinear? Think again. That's just a generalized linear model.

Benchmark / Computers / Machine Learning

Date: Nov 25, 2016

Using virtualization? The CPU topology you are passing to your virtual machine matters. But by how much? (xgboost exact)

This time, we are looking for increasing number of sockets.

Benchmark / Computers / Machine Learning

Date: Nov 14, 2016

Using virtualization? The CPU topology you are passing to your virtual machine matters. But by how much? (xgboost exact)

We will look at the number of cores passed to the virtual machine.

Statistics / Design

Date: Nov 08, 2016

Statistical tests are not statistical tests anymore when using large amount of data.

They were just not made for that.

Design

Date: Nov 06, 2016

Have a metric which is quadratic?

Then improving it becomes quadratic, this is easy as piece to understand. Explaining the phenomena is something different.

Design

Date: Oct 15, 2016

Do you have many features?

Are you lost in all these features?

Think you can go through all of them one by one?

A tableplot solves your problem.

Machine Learning / Design

Date: Sep 03, 2016

When you have row ID leakage and you have a not too large sample size (less than 100,000 rows), what does machine learning say?

Statistics / Design

Date: Sep 03, 2016

When you have row ID leakage and you have a not too large sample size (less than 100,000 rows), what does statistics say?

Explanation / Machine Learning

Date: Aug 26, 2016

Explains why a generalized linear model is a boosted model, and why having many features do not matter for training speed when they are sparse.

Explanation / Computers

Date: Aug 03, 2016

We all know xgboost is a nightmare to compile if you are a total beginner. Here is an example of compiling xgboost for both CLI (command line interface) and R!

Explanation / Statistics

Date: May 02, 2016

Why you should not post-process rankings when you are dealing with noisy data?

We are taking the example of using Santander Customer Satisfaction to show irrationality of hard rules.

Explanation / Machine Learning

Date: Apr 18, 2016

Still not understanding the basics of gradient descent shrinkage?

The learning rate in gradient boosted trees is explained here using an analogy with a pedestrian.

Design / Machine Learning

Date: Apr 05, 2016

Did you ever know you could use t-SNE on features instead of using them on observations?

Did you ever wanted to map visually the information relationship between features?

Here you have it: t-SNE on features.

Design / Statistics

Date: Mar 12, 2016

How can you prove whether a machine learning problem requires a linear solution or a non-linear solution?

We will be using BNP Paribas Cardif Claims Management as an example.

Explanation / Machine Learning

Date: Mar 06, 2016

Thinking about NAs? Why NAs matters in tree-based models?

We are using xgboost as an example.

Explanation / Machine Learning

Date: May 04, 2016

Need to understand why Gamma will help you squeezing even better performance from xgboost?

Here you are served.