Let’s say you have a data set with a million or more training points (“rows”). What’s a reasonable way to implement supervised learning?

One approach, of course, is to only use a subset of the rows. This has its merits, but there may be various reasons why you want to use the entire available data. What then?

Andy Müller created an excellent cheat sheet, thumbnailed below, showing which machine learning techniques are likely to work best in different situations (clickable version here). It’s obviously not meant to be a rigid rule, but it’s still a good place to start answering the question above, or most similar questions.


What we see from the above is that our situation points us towards Stochastic Gradient Descent (SGD) regression or classification.

Why SGD? The problem with standard (usually gradient-descent-based) regression/classification implementations, support vector machines (SVMs), random forests etc is that they do not effectively scale to the data size we are talking, because of the need to load all the data into memory at once and/or nonlinear computation time. SGD, however, can deal with large data sets effectively by breaking up the data into chunks and processing them sequentially, as we will see shortly; this is often called minibatch learning. The fact that we only need to load one chunk into memory at a time makes it useful for large-scale data, and the fact that it can work iteratively allows it to be used for online learning as well. SGD can be used for regression or classification with any regularization scheme (ridge, lasso, etc) and any loss function (squared loss, logistic loss, etc).

What is SGD? It’s been explained very nicely by Andrew Ng in his Coursera class (Week 10: Large Scale Machine Learning), and Léon Bottou has a somewhat more in-depth tutorial on it. Their explanations are excellent, and there’s no point in my duplicating them, so I’ll move on to implementation using Python and the scikit-learn (sklearn) library.

The key feature of sklearn’s SGDRegressor and SGDClassifier classes that we’re interested in is the partial_fit() method; this is what supports minibatch learning. Whereas other estimators need to receive the entire training data in one go, there is no such necessity with the SGD estimators. One can, for instance, break up a data set of a million rows into a thousand chunks, then successively execute partial_fit() on each chunk. Each time one chunk is complete, it can be thrown out of memory and the next one loaded in, so memory needs are limited to the size of one chunk, not the entire data set.

(It’s worth mentioning that the SGD estimators are not the only ones in sklearn that support minibatch learning; a variety of others are listed here. One can use this approach with any of them.)

Finally, the use of a generator in Python makes this easy to implement.

Below is a piece of simplified Python code for instructional purposes showing how to do this. It uses a generator called ‘batcherator’ to yield chunks one at a time, to be iteratively trained on using partial_fit() as described above.

from sklearn.linear_model import SGDRegressor
def iter_minibatches(chunksize):
    # Provide chunks one by one
    chunkstartmarker = 0
    while chunkstartmarker < numtrainingpoints:
        chunkrows = range(chunkstartmarker,chunkstartmarker+chunksize)
        X_chunk, y_chunk = getrows(chunkrows)
        yield X_chunk, y_chunk
        chunkstartmarker += chunksize
def main():
    batcherator = iter_minibatches(chunksize=1000)
    model = SGDRegressor()
    # Train model
    for X_chunk, y_chunk in batcherator:
        model.partial_fit(X_chunk, y_chunk)
    # Now make predictions with trained model
    y_predicted = model.predict(X_test)

We haven’t said anything about the getrows() function in the code above, since it pretty much depends on the specifics of where the data resides. Common situations might involve the data being stored on disk, stored in distributed fashion, obtained from an interface etc.

Also, while this simplistic code calls SGDRegressor with default arguments, this may not be the best thing to do. It is best to carry out careful cross-validation to determine the best hyperparameters to use, especially for regularization. There is a bunch more practical info on using sklearn’s SGD estimators here.

Hopefully this post, and the links within, give you enough info to get started. Happy large-scale learning!