04-26-2012 06:25 AM
Hello,
I am trying to write DLL function to allocate cuda memory and get back pointer to cuda (device) memory.
Second function should accept this pointer and do the calculation.
I want this operation to be separate because I need to do many calculations on the same data and I am trying to avoid repeatly copying same data to GPU memory (it take a lot of time).
Q1: what should I change in my DLL to get it work? I am not able to get out pointer to "i_d" from "cuda_Malloc" function in Labview.
Q2: how to set Call Library Node when I am passing i_d pointer to "kernel_cuda_calculation"?
My DLL:
main.cpp:
extern "C" __declspec(dllexport) int cuda_Malloc ( float *i, void *i_d, int N ){
for( float x=0; x<N; x++ )
i
kernel_cuda_Malloc( i, i_d, N );
return 0;
}
extern "C" __declspec(dllexport) int cuda_Calculation( void *i_d, float *result, int N ) {
kernel_cuda_calculation( i_d, result, N );
return 0;
}
simple.cu:
__global__ void kernelTest( float *i, int N ){
unsigned int tid = blockIdx.x*blockDim.x + threadIdx.x;
if ( tid<N )
i[tid] += 10;
}
int kernel_cuda_Malloc( float *i, void *i_d, int N ){
cudaMalloc( (void**)&i_d, N*sizeof( float ) );
cudaMemcpy( i_d, i, N*sizeof( float ), cudaMemcpyHostToDevice );
return 0;
}
void kernel_cuda_calculation( float *i_d, float *result, int N
){
dim3 threads; threads.x = 240;
dim3 blocks; blocks.x = ( N/threads.x ) + 1;
kernelTest<<< threads, blocks >>>( i_d, N );
cudaMemcpy( result, i_d, N*sizeof( float ), cudaMemcpyDeviceToHost );
}
Code is modification of https://decibel.ni.com/content/docs/DOC-20353
Here is similar question but the guy managed to solve it so there is no correct answer : http://forums.ni.com/t5/LabVIEW/CUDA-device-memory-allocation-and-returns-the-device-ptr/m-p/1963837...
04-26-2012 10:08 AM
You're running into multiple issues. Let's address an architectural issue first before answering your specific questions.
Let's suppose you fix the issues with your DLL so that you can retrieve the device pointer and you can pass it to your kernel. The kernel call will periodically fail anyway because the CUDA runtime has execution requirements that LabVIEW's call library node does not meet w/out special handling.
The problem lies in CUDA requirement that a function's parameters must have been allocated from in the same CUDA context (in v4.0 and later). To do this, you have to create a context (or set a device which implicitly creates a context) and then 'push' or 'set' that context on the currenet host thread before you call cudaMallo() and your kernel(s).
This is not easy to do because LabVIEW using any thread available to call functions from your DLL. If you don't set the context and make the calls in the same host thread, you'll get spurious runtime errors from CUDA if you don't set the context (or use the exact same host thread for each call).
The example in the code from DOC-20353 works around this by configuring the call library node to run in the UI thread. It's the easiest solution to force the calls to happen from the same host thread. However, it has an unexpected consequence - data transfers to the GPU perform erratic. This is only seen after several iterations (25-100 at minimum). Although the exact problem is not known, it is known that using the UI thread is the reason for the excessive jitter.
What are your options?
The NILabs LabVIEW GPU Computing was originally created to help work around issues like these but it is out-of-date. We're in the process of releasing the GPU Analysis Toolkit which is designed to address what you're doing. You can request the beta at www.ni.com/beta. It includes CUDA function wrappers for device management, resource allocation, FFT and level-3 BLAS functions as well as a GPU SDK for calling custom GPU kernels.
Now let's tackle your questions:
Q1: I don't see anything wrong from you code so it's possible the configuration of the call library node is the culprit. The CUDA pointer parameter should be configured as an Unsigned Pointer-size Integer and passed (a) by Value when you intend to use the raw pointer value in a function or (b) by Pointer to Value if you are setting it's value.
By configuring using pointer-size integers, you maintain compatibility with 32- and 64-bit versions of LabVIEW.
Q2: This is answered as part of Q1 above.
Additional Comments:
04-27-2012 05:27 AM
Hallo MathGuy,
thanks for reply.
I have downloaded GPU Analysis Toolkit beta as you suggest.
Acording to your advices this is my idea:
1. use "Initialize Device.vi" to get CUDA Context
2. wire CUDA Context to "Allocate Memory.vi"
3. set number of elements ( N ) in "Allocate Memory.vi"
4. choose a type of allocated data in "Allocate Memory.vi" ( possibilities: U8, U16, U32, I8, I16, I32, SGL, DBL, CSG, CDB)
5. wire CUDA U8 Device Ptr to call library node (no.1) as Unsigned Pointer-size Integer and pass it by Value as i_d
6. wire array of my data to call library node (no.1) as Array Data Pointer
7. DLL number one:
extern "C" __declspec(dllexport) int cuda_Copy ( float *i, void i_d, int N ){ // I am not sure if use "void i_d" or "void *i_d"
kernel_cuda_Copy( i, i_d, N );
return 0;
}
int kernel_cuda_Copy( float *i, void i_d, int N ){ // same problem I am not sure if use "void i_d" or "void *i_d"
cudaMemcpy( i_d, i, N*sizeof( float ), cudaMemcpyHostToDevice );
return 0;
}
8. from call library node (no.1) where is "DLL number one" wire i_d output terminal to another call library node (no.2) where is "DLL number two" loaded
9. wire initialized array for results to call library node (no.2)
10.configuration of call library node (no.2) where is "DLL number two":
i_d ... Unsigned Pointer-size Integer and pass it by Value
results ... Array Data Pointer
11.DLL number two:
extern "C" __declspec(dllexport) int cuda_Calculation( void i_d, float *result, int N ) { // same problem use "void i_d" or "void *i_d"
kernel_cuda_calculation( i_d, result, N );
return 0;
}
__global__ void kernelTest( float *i, int N ){
unsigned int tid = blockIdx.x*blockDim.x + threadIdx.x;
if ( tid<N )
i[tid] += 10;
}
void kernel_cuda_calculation( void i_d, float *result, int N ){ // same problem I am not sure if use "void i_d" or "void *i_d"
dim3 threads; threads.x = 240;
dim3 blocks; blocks.x = ( N/threads.x ) + 1;
kernelTest<<< threads, blocks >>>( i_d, N );
cudaMemcpy( result, i_d, N*sizeof( float ), cudaMemcpyDeviceToHost );
12. for deallocating I would like use third DLL with "cudaFree(i_d);" or rather use "Free Memory.vi"?
Ofcours all of the functions can be in same DLL I just devide it so the text remained clear.
I added my block diagram but I faild because CUDA U8 Device Ptr form "Allocate Memory.vi" is not comatible with Unsigned Pointer-size Integer in call library node.
Am I at least close to right solution?
04-27-2012 10:40 AM
Well, you have the right idea but you're mixing two very different solutions. When working w/ the GPU Analsys Toolkit, GPU resources like a device pointer are encapsulated in a LabVIEW classes (.lvclass). This packaging of the GPU data adds the protections necessary to use the raw GPU device pointer from LabVIEW.
To get at the U8 device pointer inside the class and then use it safely in a call to a GPU kernel requires a specific programming pattern in G. Tou can look at Initialize Memory in the toolkit to see how the device pointer is accessed. In fact, all of the toolkit VIs are examples of how to get, set and use GPU resources.
It's difficult to detail the specifics in an email, but here's a high level description of what you'll find in Initialize Memory:
This is rather involved but is straightforward once you know the steps and become familiar with the SDK components. I am in the process of developing a detailed step-by-step guide for calling a custom GPU kernel. The document will be online by the time the toolkit ships but I hope to have an earlier version available before then.
Until it's available, the best approach is to find function wrappers that already exist in the toolkit that have similar inputs and outputs and use that code as a template to create your own custom wrappers. I'll do my best to help you if/when you get stuck.
You may also find it helpful to start from the FFT example, remove additional data you don't need, and replace the FFT computation in the middle. This will also simplify your external library because functions like cudaMemcpy are already provided (i.e Download Data and Upload Data). You won't need to call these functions from you special cuda_* kernels.
04-28-2012 10:26 AM
Thank you for your comprehensive answer. I really appreciate it.
With your advices I manage to run my code without problem.
I am really looking forward to detailed step-by-step guide for calling a custom GPU kernel.
Is a release date of guide or toolkit known?
Regards,
Brano.