Accelerating Data Science with GPUs (if available)
In this blog post I will compare data science workflows using different tools suited for different hardware. I have listed the packages below, what they are good and a brief comment about them:
joblib.parallel_backend("dask")
.I will benchmark a group by operation with pandas, dask and rapids. In addition, I will benchmark fitting a machine learning model with scikit-learn and rapids.
TL;DR:
dask was faster than pandas with a 2.3 x speed up; rapids was faster than pandas with a 23.8 x speed up.
cuML RandomForestClassifier was faster than scikit-learn RandomForestClassifier with a 26.2 x speed up. Although the accuracy, with no hyperparameter tuning, was 74 % with cuML, compared to 96 % with scikit-learn.
GPU stands for Graphics Processing Unit [1]. They are somewhat similar to CPU's (Central Processing Unit) [2]. The CPU or GPU does the 'computation' in your electronic device. That is, it handles logic, controlling as well as input/output (I/O) as instructed to do by software [3].
GPUs are custom designed to be very efficient at handling computer graphics and image processing, hence the name. Here is a video demonstrating common usage of GPUs by NVIDIA, a company that makes GPU's:
CPUs handle computations serially, meaning the logic in handled in one stream: the next task will complete when the subsequent task has finished. CPUs can execute tasks in parallel across cores. For example, most computer CPUs tend to have either two, four or six cores.
In comparison, GPUs have hundreds of 'cores'. This massively parallel architecture is what gives the GPU its high compute performance [4; 5].
Most devices have a GPU but it is a different question if it can be used for data science. I've put a star next to the machines that have GPUs which can (easily) be used for data science.
In this blog post I will be using the GTX 1080 GPU in my personal PC.
Traditionally, it's been difficult to do data science analysis on the GPU without knowing how to write lower level languages like C to target CUDA: the GPU compilation tool-kit. However, NVIDIA has created the RAPIDS team to create open source python packages which allows traditional data science to be written for the GPU. Note: most deep learning packages are designed to use GPUs where available such as TensorFlow and PyTorch.
To get everything we need we'll use the NVIDIA Data Science Stack. This makes it easy to manage the software stacks for GPU accelerated Data Science.
First clone the git repo:
$ git clone https://github.com/NVIDIA/data-science-stack.git
$ cd data-science-stack
The first step is to install tool kits including CUDA, Docker and Conda:
$ ./data-science-stack setup-system
$ ./data-science-stack setup-user
Install the software in a Docker container. This will install RAPIDS as well as some Jupyter lab extensions e.g. dask-labextension and nvdashboard. It will also install tensorflow, pytorch, xgboost and scikit-learn. A complete list of the packages is given here.
$ ./data-science-stack build-container
Once the container has finished you can run it as"
$ ./data-science-stack run-container
Lastly open http://localhost:8888/ in your browser to access jupyter lab.
First, let's grab a big data set: the NYC taxi dataset. The version i'm using has 359,892,261 records, 14 columns and is stored as a partitioned parquet file.
import gcsfs
import dask.dataframe as dd
from dask.distributed import Client
import cudf
client = Client()
d_df = dd.read_parquet("gs://dask-nyc-taxi/yellowtrip.parquet",
engine="fastparquet",
storage_options={"token": "anon"})
The whole dataset is too big for my RAM and therefore not usable for the pandas comparison so I will work with a subsection of it which has 22,448,693 records:
d_df = d_df.partitions[0:50]
Convert the dask.Dataframe to a pandas.Dataframe for the first benchmark:
pd_df = d_df.compute()
This is now in memory and computations will use one core.
Time how long it takes to do a group by on passenger count and count how many records are in that group:
%time pd_df.groupby('passenger_count').count()
This took 3.58 seconds.
Next, do the same thing but using all the cores. First, put the dask.Dataframe into memory using persist, then do the group by operation:
d_df = d_df.persist()
%time d_df.groupby('passenger_count').count().compute()
This took 1.53 seconds, a 2.3 x speed up.
Lastly, convert the file to a cudf.Dataframe so we can utilize the GPU.
import cudf
cu_df = cudf.DataFrame.from_pandas(pd_df)
and repeat the same operation:
%time cu_df.groupby('passenger_count').count()
This took 0.15 seconds, a 23.8 x speed up.
It's not covered here but you can convert a dask.Dataframe to a cudf.Dataframe using dask_cuda to analyse datasets bigger than the GPU memory and use multiple GPUs (if available).
First, shrink the data to make it manageable during the machine learning process:
pd_df = d_df.partitions[0:3].compute()
This gives us 1,055,567 records.
Keep features of interest:
features = [
'passenger_count', 'trip_distance',
'pickup_longitude', 'pickup_latitude',
'dropoff_longitude', 'dropoff_latitude',
'fare_amount', 'tolls_amount', 'total_amount',
]
X = pd_df[features]
y = (pd_df['tip_amount'] > 0).astype(int)
Predict if the passenger tipped or not using scikit-learn:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
X_train, X_test, y_train, y_test = train_test_split(X, y)
pipeline = make_pipeline(
StandardScaler(),
RandomForestClassifier(n_estimators=100, n_jobs=-1)
)
Time the fit:
%time pipeline.fit(X_train, y_train)
That took 47.3 seconds.
Check the accuracy:
pipeline.score(X_test, y_test)
Gives 96 % accuracy.
Repeat the pipeline using cuML.
cu_X_train = cudf.DataFrame.from_pandas(X_train)
cu_X_test = cudf.DataFrame.from_pandas(X_test)
cu_y_train = cudf.Series(y_train)
cu_y_test = cudf.Series(y_test)
cuML doesn't have the GPU accelerated StandardScaler
yet but I can create it using CuPy. It also doesn't have the pipeline
yet so i'll do the process in steps.
import cupy as cp
from cuml.ensemble import RandomForestClassifier
X_train_scaled = (X_train.values - cp.mean(X_train.values, axis=0)) / cp.std(X_train.values, axis=0)
rf_clf = RandomForestClassifier(n_estimators=100)
Time the fit (the casting here is suggested (required) by RAPIDS):
%time rf_clf.fit(X_train_scaled.astype('float32'), y_train.astype('int32'))
This took 1.8 seconds, a 26.2 x speed up.
Check the accuracy:
X_test_scaled = (X_test.values - cp.mean(X_test.values, axis=0)) / cp.std(X_test.values, axis=0)
rf_clf.score(cu_X_test_scaled.astype('float32'), cu_y_test.astype('int32'))
Gives 74 % accuracy. It's fast but the out-of-the-box accuracy is much worse than scikit-learn.
GPUs can certainly speed up your analysis and it would be very time consuming to do deep learning (image, text, voice) without GPUs.
However, GPUs are not cheap to buy or to rent. Depending on the size of the data and the complexity of the problem you may be best sticking with CPUs.
NVIDIA GTC 2020 key note summary: