CUDA Blur Filter – Something with a bit more meat…

The last post has been relatively short and was showing only a bit of skeleton code calling a simple kernel function from the host side via a wrapper function. This time I would like to post a snippet which actually does something. It is an example showing how to implement a basic blur-filter using the CUDA programming environment. In addition to the former snippet, this example also contains the missing parts showing how to allocate/deallocate device memory and how to transfer data from the host to the device and vice versa.

I figured that as subject for testing good old Lena would be suitable.

The first component is a .cpp-file containing the main function plus the prototype definition of the wrapper function. It loads the input image and allocates space for the result image (As you can see, the 3D-API of my trust is OpenSceneGraph, but any other API which lets you access the image data in a similar way will do fine). After calling the kernel wrapper which gets passed pointer to the image data, the filtered image is saved to disc.

#include "osgDB\ReadFile"
#include "osgDB\WriteFile"
#include "osg\Image"

using namespace osg;

// wrapper function
extern "C" void
blur(unsigned char* h_src, unsigned char* h_res,
     const int width, const int height,
     const int numBytes, const int maskSize);

// entry point
int main(int argc, char* argv[])
{
  // size of filter mask
  const int MASK_SIZE = 3;
  int numBytes = 1;

  // the source image
  Image* srcImg = osgDB::readImageFile("lena.tga");

  // determine pixel format of source image
  switch(srcImg->getPixelFormat())
  {
  case GL_LUMINANCE:
  case GL_INTENSITY:
    numBytes = 1;
    break;
  case GL_LUMINANCE_ALPHA:
    numBytes = 2;
    break;
  case GL_RGB:
    numBytes = 3;
    break;
  case GL_RGBA:
    numBytes = 4;
    break;
  }

  // allocate space for result image
  Image* resImg = new Image;
  resImg->allocateImage(srcImg->s(), srcImg->t(), srcImg->r(),
                        srcImg->getPixelFormat(), srcImg->getDataType());

  // call the wrapper and pass the data of both images plus
  // the dimensions and the desired filter mask size
  blur(srcImg->data(), resImg->data(),
       resImg->s(), resImg->t(), numBytes, MASK_SIZE);

  // save the filtered image
  osgDB::writeImageFile(*resImg, "lena_low.tga");

  // thats it. were done
  return EXIT_SUCCESS;
}

The main tasks of the wrapper function are the following. Firstly we need to allocate space in device memory for the input image but also for the results of the filtering process. To determine how much space is needed we multiply the image dimensions with the number of channels times the number of bytes we need to store one pixel. In this case I assume a simple 8 bit (uchar) encoded source image. After memory allocation the image data is transfered into the device memory.

Before we can to launch the kernel, we need to tell CUDA how many threads we need. Since we want to blur each single pixel of the source image, we need 512×512 pixels. If we choose a thread block size of 32 x 32, we need a total number of 256 blocks inside our grid to cover the whole input image.

After launching the kernel the remaining steps in this phase are transferring the filtered image back from device to host memory and deallocating the memory on the card.

// wrapper function
extern "C" void
blur(uchar* h_src, uchar* h_res,
     const int width, const int height,
     const int numBytes, const int maskSize)
{
  // allocate device mem
  uint size = width * height * sizeof(uchar) * numBytes;
  uchar* d_src;
  uchar* d_res;
  cudaMalloc((void**) &d_src, size);
  cudaMalloc((void**) &d_res, size); 

  // transfer data from host to device
  cudaMemcpy(d_src, h_src, size, cudaMemcpyHostToDevice);
  cudaMemcpy(d_res, h_res, size, cudaMemcpyHostToDevice);

  // grid configuration
  const dim3 block(32, 32);
  const dim3 grid((width / block.x) * numBytes,
                  (height / block.y) * numBytes);

  // launch the kernel
  d_blur(grid, block)(d_src, d_res,
                      width, height,
                      numBytes, maskSize);

  // mem copy data back from device to host
  cudaMemcpy(h_res, d_res, size, cudaMemcpyDeviceToHost);

  // free device memory
  cudaFree(d_src);
  cudaFree(d_res);
}

Last but not least this is the actual kernel function, that gets executed by all threads.

// device kernel function
__global__ void
d_blur(uchar* d_src, uchar* d_res,
       const int width, const int height,
       const int numBytes, const int maskSize)
{
  // get thread location
  const uint x = blockIdx.x * blockDim.x + threadIdx.x;
  const uint y = blockIdx.y * blockDim.y + threadIdx.y;

  uint res = 0;
  uint samples = 0;

  for(int i = -maskSize; i <= maskSize; i++)
  {
    for(int j = -maskSize; j <= maskSize; j++)
    {
      uint px = width * (y + j * numBytes) + (x + i * numBytes);
      res += d_src[px];
      ++samples;
    }
  }
  // apply weighting
  res /= samples;
  // output data target
  const uint idx = width * y + x;

  __syncthreads();
  // write result into global memory
  d_res[idx] = res;
}

This is a very basic implementation and contains a lot of redundant global memory accesses. A much more efficient way is to utilize the shared memory and load tiles of the input image into the on-chip memory. The loading can be done by one or better a couple of threads in order to exploit memory level parallelism. I leave this open for another post...

And finally a couple of filtered images with an increasing size of the filter mask.

Leave a Comment


NOTE - You can use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>