GPU Computing

cancel
Showing results for 
Search instead for 
Did you mean: 

FFT speed issue.

I tested the 1D FFT of a 512*800 array using

1. LVGPU on an OEM NVIDIA GT 440 (144 CUDA Cores, 1.5GB 192-bit GDDR3, Core Clock 594MHz, Shader Clock 1189MHz), the process time is 60 ms.

and

2. FFT.vi in a parallel for loop, on Intel i5 3570K CPU (4 cores, 3.4GHz), the process time is 30 ms.

So is it that GT 440 is a low end GPU that is not fast enough, or there is some overhead time for LVGPU, or am I doing it in the wrong way?

I need to improve the process time to 10 ms, any suggestion? Maybe a GTX GPU?

Best,

Miao

0 Kudos
Message 1 of 5
(8,885 Views)

I can't speak to your CPU benchmarks and use of parallel for loop. However, I can give you insight into your GPU performanceif your example is using CSG data.

Let me try to summarize the issues w/ your current comparison:

  1. The GPU is fairly low-end (low core count and low clock frequency).
  2. The CPU is higher-end processor comparitively. The i5 is a mid-range desktop processor and you're using a version with one of the fastest clock frequencies.
  3. One of the FFT dimensions in use is not optimal for GPUs. I don't know if you are doing 800 1D FFTs of size 512 or the other way around. The problem is with the 800 (see below).
  4. GPU benchmarks are impacted by more than just the GPU hardware. You're CPU, it's memory throughput, the speed of system memory and any processes which compete for these resources during benchmarking all play a role.
  5. Are you synchronizing the GPU at each iteration before taking your timings? If not, you can get a wide range of timings due to the asynchronous behavior of GPU functions from a LabVIEW diagram. Most often, though, this makes timings look better than they really are.

What should you do?

  • Consider upgrading your GPU hardware. Based on your application goals and problem information, I can tell you that I'm able to do a 1K x 1K FFT (1024 FFTs of size 1K) in <7ms on a GTX460 (1GB) and <6 ms on a Tesla C1060. This was run using my home system w/ a higher end CPU (core i7-2600K 3.4GHz/16GB RAM/nothing overclocked) under Windows 7 x64 using LV 2012 64-bit and the Multi-channel FFT example that ships with the toolkit.
  • Change your computing sizes to be multiples of 32 (or 16 at least) to be optimal for GPU processing. This includes FFT sizes as well as the number of FFTs to perform. Most functions will not auto-pad for you so you have to do it in your implementation.
  • Compare your current GPU results to a more appropriate class of CPU - perhaps a mid-range core i3 (2.8-3.0GHz).
  • If you have access to a system with a faster CPU or memory, rerun you benchmarks in that system to see if that impacts GPU performance. NOTE: If you are not timing data transfers, this won't make a difference.
  • If you haven't already done so, add the Synchronize Device poly VI to your code after the FFT call but before the getting the system time. This ensures the time properly reflects the GPU processing.

Your goal of 10ms is reasonable so I would keep that requirement and continue refining your solution.

0 Kudos
Message 2 of 5
(6,111 Views)

Thank you so much for you detailed instructions, and I have asked my advisor to get a better GPU, I will post here when I get the final test result.

There is another question confuses me, I wonder if you might help.

Can GPU toolkit handle memory allocated by "cudaMalloc", or all GPU memory  should be allocated using the VI GPU toolkit provided?

Is it possible to use shared memory or texture memory in my customized function?

Miao

0 Kudos
Message 3 of 5
(6,111 Views)

Hi, MathGuy,

I found that the actual FFT speed is much faster than I previously posted. In my last timing, I mistakenly include the time of set device and the time of making FFT plan. So even my current GPU is low end, it is still better than CPU for doing the FFT. The GPU analysis toolkit is just fantanstic!

But still there is some time cost transfer data from device memory to host for display, is there anyway that I can display directly the data in device memory, using LabVIEW?

Miao

Message 4 of 5
(6,111 Views)

You'll find that transferring data back to the host has to overcome two hurdles:

  1. The GPU device (and driver) are optimized for downloads. That's a natural optimization as they are first and foremost designed to display data. As a result, they are not as fast at uploading data (although the Tesla series is pretty good based on benchmarks).
  2. Once the data is returned to the host, it must be read back into the process handling the display of the data (e.g. a LabVIEW graph) which, behind the scenes, may load the data into CPU cache. Even if it doesn't hit the cache, it must be copied back down to the device via a display device's driver process.

To render data on the device w/out transferring it back requires external code using an API such as OpenGL. If memory serves, examples which share data between CUDA and OpenGL ship with the CUDA SDK (which is now part of the toolkit installation). If not, you can find coding examples posted online by searching for 'CUDA and OpenGL'.

Unfortunately, the GPU toolkit doesn't help create these OpenGL implementations. However, if you wanted to invoke one or more custom render functions based on OpenGL as part of GPU computing, that could be done from LabVIEW using the toolkit (and probably some custom components based on the toolkit SDK).

You mentioned support for textures. Performing rendering via OpenGL is a separate but related issue. The toolkit does not ship with support for the texture data type in CUDA. The primary reason is that textures (and CUDA arrays) do not support double precision data - the most common numeric type used in LabVIEW.

While the texture type is not present, the toolkit SDK is capable of supporting it. Even though I have not created an example yet, I architected the SDK so that it could.

You may find that textures aren't required. It's possible that some OpenGL functions may consume or copy data from a CUDA data buffer as-is. The examples you find should address this out of necessity.

Lastly, you asked about what memory the toolkit handles. The toolkit works natively with memory allocated by cudaMalloc(). In the Driver API the pointer is of type CUdeviceptr in the C interface. According to documentation, this type is consumable by any CUDA function based on the Runtime or Driver APIs and is used internally by the matix and vector types exported by CUBLAS.

Functions exist to 'convert' this type to other special CUDA types such as CUDA arrays and textures but there are certain limitations to each conversion. The documentation does a good job of explaining the trade-offs.

0 Kudos
Message 5 of 5
(6,111 Views)