Accelerating Data Science with GPUs

05/24/20

Accelerating Data Science with GPUs (if available)

Blog overview

In this blog post I will compare data science workflows using different tools suited for different hardware. I have listed the packages below, what they are good and a brief comment about them:

  • pandas - one core on the CPU - easiest to use, very mature and huge user base.
  • dask - all cores on the CPU - medium difficulty, similar API and seamless use with pandas. Can scale across machines.
  • scikit-learn - Machine learning using all cores on the CPU - easy to use. Can scale across machines via joblib.parallel_backend("dask").
  • rapids - the GPU - medium difficulty, the most nascent. Can scale across GPU's with dask_cuda.

I will benchmark a group by operation with pandas, dask and rapids. In addition, I will benchmark fitting a machine learning model with scikit-learn and rapids.

Results

TL;DR:

dask was faster than pandas with a 2.3 x speed up; rapids was faster than pandas with a 23.8 x speed up.

cuML RandomForestClassifier was faster than scikit-learn RandomForestClassifier with a 26.2 x speed up. Although the accuracy, with no hyperparameter tuning, was 74 % with cuML, compared to 96 % with scikit-learn.

What are GPUs?

GPU stands for Graphics Processing Unit [1]. They are somewhat similar to CPU's (Central Processing Unit) [2]. The CPU or GPU does the 'computation' in your electronic device. That is, it handles logic, controlling as well as input/output (I/O) as instructed to do by software [3].

GPUs are custom designed to be very efficient at handling computer graphics and image processing, hence the name. Here is a video demonstrating common usage of GPUs by NVIDIA, a company that makes GPU's:

How do GPUs speed up analysis?

CPUs handle computations serially, meaning the logic in handled in one stream: the next task will complete when the subsequent task has finished. CPUs can execute tasks in parallel across cores. For example, most computer CPUs tend to have either two, four or six cores.

In comparison, GPUs have hundreds of 'cores'. This massively parallel architecture is what gives the GPU its high compute performance [4; 5].

Where to to find GPUs?

Most devices have a GPU but it is a different question if it can be used for data science. I've put a star next to the machines that have GPUs which can (easily) be used for data science.

  • Personal computers:
    • Windows laptops tend to have small GPUs.
    • Apple Mac laptops mostly have Intel GPUs.
    • Home built PC's often have a GPU especially if used for gaming or Artificial Intelligence (AI) research *.
  • Cloud (free)
  • Cloud (paid)


In this blog post I will be using the GTX 1080 GPU in my personal PC.

Install GPU software

History

Traditionally, it's been difficult to do data science analysis on the GPU without knowing how to write lower level languages like C to target CUDA: the GPU compilation tool-kit. However, NVIDIA has created the RAPIDS team to create open source python packages which allows traditional data science to be written for the GPU. Note: most deep learning packages are designed to use GPUs where available such as TensorFlow and PyTorch.

Tool from RAPIDS

To get everything we need we'll use the NVIDIA Data Science Stack. This makes it easy to manage the software stacks for GPU accelerated Data Science.

First clone the git repo:

$ git clone https://github.com/NVIDIA/data-science-stack.git
$ cd data-science-stack

Install Tool kits

The first step is to install tool kits including CUDA, Docker and Conda:

$ ./data-science-stack setup-system
$ ./data-science-stack setup-user

Build the Container

Install the software in a Docker container. This will install RAPIDS as well as some Jupyter lab extensions e.g. dask-labextension and nvdashboard. It will also install tensorflow, pytorch, xgboost and scikit-learn. A complete list of the packages is given here.

$ ./data-science-stack build-container

Once the container has finished you can run it as"

$ ./data-science-stack run-container 

Lastly open http://localhost:8888/ in your browser to access jupyter lab.

Bench-marking a group by

First, let's grab a big data set: the NYC taxi dataset. The version i'm using has 359,892,261 records, 14 columns and is stored as a partitioned parquet file.

import gcsfs
import dask.dataframe as dd
from dask.distributed import Client
import cudf

client = Client()

d_df = dd.read_parquet("gs://dask-nyc-taxi/yellowtrip.parquet",
                       engine="fastparquet",
                       storage_options={"token": "anon"})

The whole dataset is too big for my RAM and therefore not usable for the pandas comparison so I will work with a subsection of it which has 22,448,693 records:

d_df = d_df.partitions[0:50]

Pandas

Convert the dask.Dataframe to a pandas.Dataframe for the first benchmark:

pd_df = d_df.compute()

This is now in memory and computations will use one core.

Time how long it takes to do a group by on passenger count and count how many records are in that group:

%time pd_df.groupby('passenger_count').count()

This took 3.58 seconds.

Dask

Next, do the same thing but using all the cores. First, put the dask.Dataframe into memory using persist, then do the group by operation:

d_df = d_df.persist()
%time d_df.groupby('passenger_count').count().compute()

This took 1.53 seconds, a 2.3 x speed up.

cuDF

Lastly, convert the file to a cudf.Dataframe so we can utilize the GPU.

import cudf

cu_df = cudf.DataFrame.from_pandas(pd_df)

and repeat the same operation:

%time cu_df.groupby('passenger_count').count()

This took 0.15 seconds, a 23.8 x speed up.

dask_cuda

It's not covered here but you can convert a dask.Dataframe to a cudf.Dataframe using dask_cuda to analyse datasets bigger than the GPU memory and use multiple GPUs (if available).

Bench-marking machine learning

First, shrink the data to make it manageable during the machine learning process:

pd_df = d_df.partitions[0:3].compute()

This gives us 1,055,567 records.

Keep features of interest:

features = [
    'passenger_count', 'trip_distance',
    'pickup_longitude', 'pickup_latitude',
    'dropoff_longitude', 'dropoff_latitude',
    'fare_amount', 'tolls_amount', 'total_amount',
]

X = pd_df[features]
y = (pd_df['tip_amount'] > 0).astype(int)

scikit-learn

Predict if the passenger tipped or not using scikit-learn:

from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y)

pipeline = make_pipeline(
    StandardScaler(),
    RandomForestClassifier(n_estimators=100, n_jobs=-1)
)

Time the fit:

%time pipeline.fit(X_train, y_train)

That took 47.3 seconds.

Check the accuracy:

pipeline.score(X_test, y_test)

Gives 96 % accuracy.

cuML

Repeat the pipeline using cuML.

cu_X_train = cudf.DataFrame.from_pandas(X_train)
cu_X_test = cudf.DataFrame.from_pandas(X_test)
cu_y_train = cudf.Series(y_train)
cu_y_test = cudf.Series(y_test)

cuML doesn't have the GPU accelerated StandardScaler yet but I can create it using CuPy. It also doesn't have the pipeline yet so i'll do the process in steps.

import cupy as cp
from cuml.ensemble import RandomForestClassifier

X_train_scaled = (X_train.values - cp.mean(X_train.values, axis=0)) / cp.std(X_train.values, axis=0)

rf_clf = RandomForestClassifier(n_estimators=100)

Time the fit (the casting here is suggested (required) by RAPIDS):

%time rf_clf.fit(X_train_scaled.astype('float32'), y_train.astype('int32'))

This took 1.8 seconds, a 26.2 x speed up.

Check the accuracy:

X_test_scaled = (X_test.values - cp.mean(X_test.values, axis=0)) / cp.std(X_test.values, axis=0)

rf_clf.score(cu_X_test_scaled.astype('float32'), cu_y_test.astype('int32'))

Gives 74 % accuracy. It's fast but the out-of-the-box accuracy is much worse than scikit-learn.

Are GPUs the be-all and end-all?

GPUs can certainly speed up your analysis and it would be very time consuming to do deep learning (image, text, voice) without GPUs.

However, GPUs are not cheap to buy or to rent. Depending on the size of the data and the complexity of the problem you may be best sticking with CPUs.

See also

NVIDIA GTC 2020 key note summary: