Profiling CUDA Calls | Learning Deep Learning

Profiling the CUDA calls that occur behind the scenes can help evaluate different approaches and identify performance bottlenecks in GPU-accelerated applications. The NVIDIA Visual Profiler is an easy-to-use tool that enables you to visualize the execution of CUDA kernels and API calls, providing valuable insights for optimizing your code and improving GPU efficiency.

Running a sample CUDA code

First, let’s write a basic CUDA program:

#include <stdio.h>
#include <cuda_runtime.h>

__global__ void add(int a, int b, int *c)
{
    *c = a + b;
}

int main()
{
    int a = 1, b = 2, c;
    int *dev_c;

    cudaMalloc(&dev_c, sizeof(int));

    add<<<1, 1>>>(a, b, dev_c);
    cudaMemcpy(&c, dev_c, sizeof(int), cudaMemcpyDeviceToHost);
    printf("%d + %d is %d\n", a, b, c);

    cudaFree(dev_c);

    return 0;
}

Compiling it:

$ nvcc sample.cu -o sample

Profiling calls with nvprof

We can track the GPU activity of our small test program by simply running it with nvprof:

$ nvprof ./sample
==72071== NVPROF is profiling process 72071, command: ./sample
1 + 2 is 3
==72071== Profiling application: ./sample
==72071== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   56.32%  3.4240us         1  3.4240us  3.4240us  3.4240us  add(int, int, int*)
                   43.68%  2.6560us         1  2.6560us  2.6560us  2.6560us  [CUDA memcpy DtoH]
      API calls:   99.08%  165.86ms         1  165.86ms  165.86ms  165.86ms  cudaMalloc
                    0.82%  1.3800ms       101  13.663us      80ns  732.85us  cuDeviceGetAttribute
                    0.05%  81.893us         1  81.893us  81.893us  81.893us  cudaFree
                    0.02%  38.522us         1  38.522us  38.522us  38.522us  cudaMemcpy
                    0.01%  20.007us         1  20.007us  20.007us  20.007us  cudaLaunchKernel
                    0.01%  9.0370us         1  9.0370us  9.0370us  9.0370us  cuDeviceGetName
                    0.00%  6.5120us         1  6.5120us  6.5120us  6.5120us  cuDeviceGetPCIBusId
                    0.00%  1.4030us         3     467ns     281ns     831ns  cuDeviceGetCount
                    0.00%     572ns         2     286ns     121ns     451ns  cuDeviceGet
                    0.00%     220ns         1     220ns     220ns     220ns  cuDeviceTotalMem
                    0.00%     160ns         1     160ns     160ns     160ns  cuDeviceGetUuid

Here we can see the performed calls directly on stdout. Let’s do it again, but with a GUI now:

$ nvprof -o sample.nvvp ./sample
==74063== NVPROF is profiling process 74063, command: ./sample
1 + 2 is 3
==74063== Generated result file: sample.nvvp
$ nvvp sample.nvvp

The 99% of time spent by cudaMalloc pops out immediatelly when inspecting the profiling visually.

PyTorch

Out of curiosity, let’s check the profiling on a PyTorch MNIST example:

It’s interesting to observe the symmetry of operations on both the left and right sides. Each begins with a couple of ‘host to device’ memory copies, which I believe correspond to the digit image and its label, respectively.