GPU Utilization: The step interval time in the profiler when the GPU was executing a workload. The higher the utilization %, the better.
Drawback of GPU Utilization: Too high-level and coarse to identify the performance bottlenecks. No data about the count of SMs in use. High GPU util might not translate to efficient usage of the GPU. A kernel with single thread running continuously will get a GPU utilization of 100%
Est. Stream Multiprocessor Efficiency: A finer grained metric. It tells us what % of SMs are in use at any point in the trace. Reports the % of time at least one active warp on a SM.
Limitation: SM efficiency doesn’t tell us how busy each SM is. It could have just one thread running or even worse it could be stalling while waiting on the result of a memory load.
Est. Achieved Occupancy: A layer deeper than Est. SM Efficiency and GPU Utilization for diagnosing performance issues. Est. Achieved Occupancy indicates how many warps can be active at once per SMs. Having a sufficient number of active warps is usually key to achieving good throughput. This metric reports the average values of all warp schedulers for the kernel execution period.
This graph shows the distribution of the total time spent by GPU (device)
As expected, looking at the architecture of the proposed solution, majority of the time is spent in doing convolution operations
This graph shows the distribution of the total time spent by CPU (host)
Majority of the time is spent in the data movement (copying data) from CPU to GPU and vice versa
The kernel view gives the actual distribution of the total time spent by the kernels during one particular iteration
As evident from the graph (and the points above), convolution and matrix multiplication kernels form a major chunk of the total time distribution
This image summary gives a brief summary of the different kernels (including their properties and inputs) that are executed in each iteration
If one looks at the last column (Mean Est. Achieved Occupancy), one will realize that there's still some scope of doing these convolution operations in parallel
First things first! aten is a tensor library which contains the implementations of all the tensor operations. The PyTorch library is built on top of this library and any operation that we perform (on tensors) use the aten operators under the hood.
Now that we have some idea about what these operators are, let's decompose the metrics at the top (each column). The allocation count means how many times the memory is allocated for a particular operator. The prefix "Self" is present as some operator might call a child operator within its lifespan (although we don't have any case present in the snippet above). In that case, the allocation count will give the sum of the allocation count of the operator itself as well as its child operators. The same logic/definition applies to "Self" with other metrics as well.
The metric Allocation Size tells us the total amount of memory that was allocated to a given operator without subtracting the amount of memory that was released. On the other hand, Size Increase metric takes into consideration the amount of memory that was released as well. So, if 100 KB memory was allocated at some point of time and 40 KB memory was released, for a given operator, then Allocation Size will show 100 KB and Size Increase will show 60 KB. As is evident in the snippet above, no memory is being freed up.
Each operator in the aten tensor library calls aten:empty to allocate memory for that particular operation (aten::add, aten::mul etc). This is one of the reason why the allocation size and allocation count of that operator is so high!
Indeed, we can! If you compare these numbers with the ones that we see on the top of this page; you'll see that the total time got reduced from 1014 ms to 621 ms. So, what changes led to this improvement?
As we had noticed earlier, the major chunk of the time was being spent doing heavy operations like convolutions and matrix multiplications. For this particular run (profiling), we used a Tesla V100 GPU that has special kind of cores, called the TensorCores, which help in optimizing cuDNN library (by Nvidia) kernel operations incuding matrix multiplications and convolutions. To get a better understanding of how they work, one can watch the video below.
Moving to a different GPU alone won't make the performance better. Code changes are also required to fully utilize the architecture of these special cores to make full use of their compute power. Attaching the performance gain(s) of the new code + Tensor cores improvement over the baseline of the original code with simple CUDA cores.