Blog
Articles about:
- Benchmarking: comparing different stuff in a similar environment...
- Machine Learning: mostly supervised machine learning (a lot of xgboost / LightGBM)...
- Statistics: whatever a data scientist (should) already know...
- Explanations: going from A to Z when the steps are known...
- Design: problem-solving (ex: experimental design)...
- Computers: virtualization, information technology/systems...
... can be found below in an ordered list by date.
They are organized as follows:
- Link with title and caption (if available) on the left
- Description on the right:
- Major (most important) / Super-Minor (supporting) / Sub-Minor (least important) thematics
- Original date of post (may not match with the real post)
- Small hard-coded description
Benchmark / Machine Learning
Date: Jun 10, 2017
Are you thinking about using LightGBM on Windows?
If yes, should you choose Visual Studio or MinGW as the compiler? We are checking here the impact on the compiler on the performance of LightGBM!
In addition, some juicy xgboost comparison: they bridged the gap they had versus LightGBM!
Benchmark / Machine Learning
Date: May 25, 2017
Thinking about Intel vs AMD Ryzen?
What about picking both together, and putting them on a ring of xgboost benchmarks? This is what we are doing here!
We are also looking indirectly at Linux vs Windows, and baremetal vs virtualized servers. Turns out virtualized servers and Windows servers are actually doing very well against Linux and baremetal servers.
Benchmark / Machine Learning
Date: May 14, 2017
Using xgboost fast histogram?
Ever heard about old and new fast histogram?
This is the comparison between old and new xgboost fast histogram! Get ready to see... juicy 75% improvements!
Benchmark / Machine Learning
Date: Apr 30, 2017
Remember the comparison between exact and fast histogram xgboost? Here they are both together!
Benchmark / Computers / Machine Learning
Date: Apr 29, 2017
Using fast histogram xgboost? You are going to get served with benchmarks using:
- Overclocked 5.0 GHz i7-7700K and...;
- 20 cores / 40 threads 2.7GHz server!
Best practice to remember: fast histogram xgboost scales very well with frequency (GHz). Using too many cores will destroy heavily the speed of your training.
Benchmark / Computers / Machine Learning
Date: Apr 27, 2017
Using exact xgboost? You are going to get served with benchmarks using:
- Overclocked 5.0 GHz i7-7700K and...;
- 20 cores / 40 threads 2.7GHz server!
Best practice to remember: exact xgboost scales very well with number of cores. Frequency is secondary.
Benchmark / Machine Learning / Design
Date: Apr 23, 2017
Using decision trees and using categorical features?
Should you use...:
- Numeric encoded features?
- One-hot encoded features?
- Categorical (raw) features?
- Binary encoded features?
We will show one-hot encoding is the worst you can use, while categorical features are the best ever you can use, if and only if the supervised machine learning program can handle them.
Benchmark / Computers / Design
Date: Apr 16, 2017
When you have a CPU with hyperthreading, make sure you are using all its available performance.
Do not believe the myth "number of threads = number of physical cores" anymore.
We are not in 2000-era where multithreading was horribly done.
Benchmark / Computers / Machine Learning
Date: Jan 10, 2017
Programming practices: is there a sensible difference between floats and doubles when it comes to speed?
Benchmark / Machine Learning
Date: Jan 09, 2017
We are comparing here xgboost (exact) and LightGBM.
The computation speed is 10x faster using LightGBM.
Explanation / Machine Learning
Date: Jan 07, 2017
xgboost has a new method for boosting, providing excellent performance: fast histogram.
Explanation / Machine Learning
Date: Dec 07, 2016
Think you don't understand xgboost's gblinear? Think again. That's just a generalized linear model.
Benchmark / Computers / Machine Learning
Date: Nov 25, 2016
Using virtualization? The CPU topology you are passing to your virtual machine matters. But by how much? (xgboost exact)
This time, we are looking for increasing number of sockets.
Benchmark / Computers / Machine Learning
Date: Nov 14, 2016
Using virtualization? The CPU topology you are passing to your virtual machine matters. But by how much? (xgboost exact)
We will look at the number of cores passed to the virtual machine.
Statistics / Design
Date: Nov 08, 2016
Statistical tests are not statistical tests anymore when using large amount of data.
They were just not made for that.
Design
Date: Nov 06, 2016
Have a metric which is quadratic?
Then improving it becomes quadratic, this is easy as piece to understand. Explaining the phenomena is something different.
Design
Date: Oct 15, 2016
Do you have many features?
Are you lost in all these features?
Think you can go through all of them one by one?
A tableplot solves your problem.
Machine Learning / Design
Date: Sep 03, 2016
When you have row ID leakage and you have a not too large sample size (less than 100,000 rows), what does machine learning say?
Statistics / Design
Date: Sep 03, 2016
When you have row ID leakage and you have a not too large sample size (less than 100,000 rows), what does statistics say?
Design / Machine Learning
Date: Sep 01, 2016
What do you have to say about hierarchical supervised machine learning?
The answer: it depends.
Explanation / Machine Learning
Date: Aug 26, 2016
Explains why a generalized linear model is a boosted model, and why having many features do not matter for training speed when they are sparse.
Explanation / Computers
Date: Aug 03, 2016
We all know xgboost is a nightmare to compile if you are a total beginner. Here is an example of compiling xgboost for both CLI (command line interface) and R!
Statistics / Machine Learning / Design
Date: Jun 06, 2016
Do you know how to use PCA? Yes...
But do you know when to use it?
Explanation / Statistics
Date: May 02, 2016
Why you should not post-process rankings when you are dealing with noisy data?
We are taking the example of using Santander Customer Satisfaction to show irrationality of hard rules.
Explanation / Machine Learning
Date: Apr 18, 2016
Still not understanding the basics of gradient descent shrinkage?
The learning rate in gradient boosted trees is explained here using an analogy with a pedestrian.
Design / Machine Learning
Date: Apr 05, 2016
Did you ever know you could use t-SNE on features instead of using them on observations?
Did you ever wanted to map visually the information relationship between features?
Here you have it: t-SNE on features.
Design / Statistics
Date: Mar 12, 2016
How can you prove whether a machine learning problem requires a linear solution or a non-linear solution?
We will be using BNP Paribas Cardif Claims Management as an example.
Explanation / Machine Learning
Date: Mar 06, 2016
Thinking about NAs? Why NAs matters in tree-based models?
We are using xgboost as an example.
Explanation / Machine Learning
Date: May 04, 2016
Need to understand why Gamma will help you squeezing even better performance from xgboost?
Here you are served.