Creating a DLL to work on 2D Arrays

rolfk · ‎05-22-2024

It does come at a cost. You need the special Intel compiler and add a dependency to the one API library. In addition the highly optimized code is specific to your machine hardware.

So yes if you are really performance constrained AND have complete control of the target hardware AND are sure to not change hardware at any point, it is a valid optimization. For any other situation it is likely causing more trouble than benefits.

Rolf Kalbermatter
My Blog

Andrey_Dmitriev · ‎05-22-2024

@rolfk wrote:

It does come at a cost. You need the special Intel compiler and add a dependency to the one API library. In addition the highly optimized code is specific to your machine hardware.

So yes if you are really performance constrained AND have complete control of the target hardware AND are sure to not change hardware at any point, it is a valid optimization. For any other situation it is likely causing more trouble than benefits.

Yes, so hard is our work, as Jonathan Edwards wrote "programming is a desperate losing battle against the unconquerable complexity of code, and the treachery of requirements". As pragmatic programmer I always trying to decrease and minimize possible dependencies from specific compilers, libraries and hardware as well, but it is not always possible. From optimization point of view the Intrinsics also here (in NI CVI as well), and AVX is available almost on every CPU.

To be honest — I perform such optimizations rounds pretty rarely, usually high-performance libraries like Intel IPP or NI VDM does the job pretty well, do not need to reinvent the wheels.

But LabVIEW is not always meet my performance requirements, because I working on real-time image-processing systems, usually inline installed on industrial conveyor belts, and the processing time is strictly dictated, otherwise conveyor will wait fully loaded with parts. And the managers saying: you will get the best hardware money can buy, but in average you have so much seconds only. Intel compiler helps me a lot in such cases. In most my cases there are very customized solutions, so I have no any problems with backward compatibility, and when upgraded to most recent CPUs, then software still work.

My recent project was a very interesting — four high speed 1K 16 bit cameras, running synchronously (sync over cRIO+FPGA) at 12800 FPS, collecting up to 640 Gigabytes data within few seconds, which loaded then into PC via 4x 10G network interfaces and processing included stitching, geometrical lens correction, two stage flat field correction and encoding into RAW multipage large TIFF and additional MPEG video, and everything done just in few minutes (yes, it was really powerful Dual Xeon Gold PC with SSDs in RAID and 1 TB RAM). With just pure LabVIEW arrays this will take probably hours for post processing.

And results from modern LabVIEW also not soo bad, I'm impressed with overall performance, because fully understanding huge amount of work under the hood.

Andrey_Dmitriev · ‎05-22-2024

@rolfk wrote:

It does come at a cost. You need the special Intel compiler and add a dependency to the one API library...

Its me again, just would like to drop here another approach, which is independent from Intel (well, dependent on AVX2):

Whole DLL source:

 EUROASM CPU=X64, SIMD=AVX2
 IncImgAsm PROGRAM FORMAT=DLL, Entry=DllEntry, MODEL=FLAT, WIDTH=64

        align 16
EXPORT fnIncImg
; rcx: pointer to source array
; rdx: pointer to destination array
; r8: number of elements in the arrays (multiple of 64)
fnIncImg PROC
    shr r8, 6 ; 128 bytes/64 elements step
    
    vpcmpeqd ymm0, ymm0, ymm0
    vpsrlw ymm0, ymm0, 15 	; Load the increment constant (1) -> ymm0

.loop:
    vmovdqa ymm1, [rcx]  	; Load 16 elements from the src    
    vmovdqa ymm2, [rcx+32]  ; Load next 16 elements from the src    
    vmovdqa ymm3, [rcx+64]  ; and so on    
    vmovdqa ymm4, [rcx+96]  ; unrolled 4 times    
    vpaddsw ymm1, ymm1, ymm0  ;     
    vpaddsw ymm2, ymm2, ymm0  ; Increment the elements    
    vpaddsw ymm3, ymm3, ymm0  ;     
    vpaddsw ymm4, ymm4, ymm0  ;     
    vmovdqa [rdx], ymm1 ; Store the result in the destination array
    vmovdqa [rdx+32], ymm2
    vmovdqa [rdx+64], ymm3
    vmovdqa [rdx+96], ymm4
    
    add rcx, 128 ; Update pointers and loop counter
    add rdx, 128
    dec r8
    jnz .loop
ENDP fnIncImg

DllEntry PROC                      
	mov rax, 1
	ret
ENDPROC DllEntry

ENDPROGRAM IncImgAsm

Unfortunately no syntax highlight for Assembly.

The loop is unrolled 4 times, single thread. There are only two commands — move and add, so the code is clear hopefully. The image must be properly aligned in memory, of course. I know, for more complicated algorithms this way turned to hell.

The EuroAssembler was used, but with minor changes this code can be compiled in any syntax-compatible Assembler — MASM, NASM, YASM, FASM, etc.

Benchmark almost the same:

Spoiler

//==============================================================================
//
// Title:		Assembly vs LabVIEW Benchmark
// Created on:	22.05.2024 at 20:02:29 by AD.
//
//==============================================================================

#include <Windows.h>
#include <inttypes.h>
#include <stdio.h>
#include <malloc.h>

#include "include/SharedLibLabVIEW.h"
#define WIDTH 4096
#define HEIGHT 4096

#define BEGIN_MEASURE QueryPerformanceCounter(&StartTime); \
	for(int i = 0; i < 100; i++) //amount of repetitions

#define END_MEASURE(Message) 	QueryPerformanceCounter(&EndTime); \
	ElapsedMicroseconds.QuadPart = EndTime.QuadPart - StartTime.QuadPart; \
	ElapsedMicroseconds.QuadPart *= 1000000; \
	ElapsedMicroseconds.QuadPart /= Frequency.QuadPart; \
	ElapsedTime = (double)(ElapsedMicroseconds.QuadPart)/100000.0; \
	printf(#Message " is %.3f \xE6s\n", ElapsedTime);

extern "C" void fnIncImg(uint16_t * src, uint16_t * dst, int LENGTH);

int main(int argc, char* argv[])
{
	uint16_t* src, * dst;
	Uint16Array srcImage, dstImage;
	int32 dimSizeArr[2] = { HEIGHT, WIDTH }; //rows, cols

	LARGE_INTEGER StartTime, EndTime, ElapsedMicroseconds, Frequency;
	double ElapsedTime;

	printf("Assembly vs LabVIEW Benchmark for image %d x %d\n", WIDTH, HEIGHT);
	QueryPerformanceFrequency(&Frequency);

	src=(uint16_t*)_aligned_malloc(WIDTH * HEIGHT * sizeof(uint16_t), 4096);
	dst = (uint16_t*)_aligned_malloc(WIDTH * HEIGHT * sizeof(uint16_t), 4096);
	srcImage = AllocateUint16Array(dimSizeArr);
	dstImage = AllocateUint16Array(dimSizeArr);

	//warm up
	fnIncImg(src, dst, HEIGHT * WIDTH);
	LabVIEWIncImage(&srcImage, &dstImage);

	BEGIN_MEASURE 	//ASM Benchmark
		fnIncImg(src, dst, HEIGHT * WIDTH);
	END_MEASURE(Assembly)

	BEGIN_MEASURE //LabVIEW Benchmark
		LabVIEWIncImage(&srcImage, &dstImage);
	END_MEASURE(LabVIEW)

	for (int i = 0; i < WIDTH * HEIGHT; i++) { //Check it
		if (dst && (dst[i] != (*dstImage)->Numeric[i])) {
			printf("FAILED! at %d\n", i);
			break;
		}
	}

	_aligned_free(src);
	_aligned_free(dst);
	DeAllocateUint16Array(&srcImage);
	DeAllocateUint16Array(&dstImage);

	return 0;
}

//============================================================================== // // Title: Assembly vs LabVIEW Benchmark // Created on: 22.05.2024 at 20:02:29 by AD. // //============================================================================== #include <Windows.h> #include <inttypes.h> #include <stdio.h> #include <malloc.h> #include "include/SharedLibLabVIEW.h" #define WIDTH 4096 #define HEIGHT 4096 #define BEGIN_MEASURE QueryPerformanceCounter(&StartTime); \ for(int i = 0; i < 100; i++) //amount of repetitions #define END_MEASURE(Message) QueryPerformanceCounter(&EndTime); \ ElapsedMicroseconds.QuadPart = EndTime.QuadPart - StartTime.QuadPart; \ ElapsedMicroseconds.QuadPart *= 1000000; \ ElapsedMicroseconds.QuadPart /= Frequency.QuadPart; \ ElapsedTime = (double)(ElapsedMicroseconds.QuadPart)/100000.0; \ printf(#Message " is %.3f \xE6s\n", ElapsedTime); extern "C" void fnIncImg(uint16_t * src, uint16_t * dst, int LENGTH); int main(int argc, char* argv[]) { uint16_t* src, * dst; Uint16Array srcImage, dstImage; int32 dimSizeArr[2] = { HEIGHT, WIDTH }; //rows, cols LARGE_INTEGER StartTime, EndTime, ElapsedMicroseconds, Frequency; double ElapsedTime; printf("Assembly vs LabVIEW Benchmark for image %d x %d\n", WIDTH, HEIGHT); QueryPerformanceFrequency(&Frequency); src=(uint16_t*)_aligned_malloc(WIDTH * HEIGHT * sizeof(uint16_t), 4096); dst = (uint16_t*)_aligned_malloc(WIDTH * HEIGHT * sizeof(uint16_t), 4096); srcImage = AllocateUint16Array(dimSizeArr); dstImage = AllocateUint16Array(dimSizeArr); //warm up fnIncImg(src, dst, HEIGHT * WIDTH); LabVIEWIncImage(&srcImage, &dstImage); BEGIN_MEASURE //ASM Benchmark fnIncImg(src, dst, HEIGHT * WIDTH); END_MEASURE(Assembly) BEGIN_MEASURE //LabVIEW Benchmark LabVIEWIncImage(&srcImage, &dstImage); END_MEASURE(LabVIEW) for (int i = 0; i < WIDTH * HEIGHT; i++) { //Check it if (dst && (dst[i] != (*dstImage)->Numeric[i])) { printf("FAILED! at %d\n", i); break; } } _aligned_free(src); _aligned_free(dst); DeAllocateUint16Array(&srcImage); DeAllocateUint16Array(&dstImage); return 0; }

And still faster on my old i7-7700 home PC:

C:\Users\Andrey\Desktop\IncImgAsm\Release_x64>Benchmark.exe
Assembly vs LabVIEW Benchmark for image 4096 x 4096
Assembly is 3.485 µs
LabVIEW is 10.270 µs

Source code and EuroAssembler as well are included.

Now this topic is completed more or less.

rolfk · ‎05-22-2024

I wouldn't consider use if assembly an advantage to the use of a special C compiler 😀

And what if I want to use gas? 😁

Anyways, I agree that this topic can probably be considered closed and stressed to the max.

Rolf Kalbermatter
My Blog

Andrey_Dmitriev · ‎05-22-2024

@rolfk wrote:

I wouldn't consider use if assembly an advantage to the use of a special C compiler 😀

And what if I want to use gas? 😁

Oh, stay away from GAS and its ugly syntax (personally, I really dislike it, but it's just my own subjective judgment). The best assembler I've ever seen was — PDP-11; I still remember the RT11-FB OS.

In my humble opinion, assembler is still good for education, and I strongly recommend to take some exercises in Assembly for every software engineer, especially for those working with LabVIEW (and Python), because very often they unfortunately don't have any idea how a PC & OS (cache, memory, CPU, cdecl, stdcall, etc etc etc) works at all.

rolfk · ‎05-23-2024

@Andrey_Dmitriev wrote:

The best assembler I've ever seen was — PDP-11; I still remember the RT11-FB OS.

Wow, PDP-11! I never worked with them but they used to have them in the research departments at the company I did my vocational education back in, well let's just say quite a bit back in the last century. 😀

The whole company ERP system was running on an IBM 3031 mainframe and the business logic was based on something called CICS.

Rolf Kalbermatter
My Blog

Martin_Henz · ‎05-23-2024

All we need in this thread now is the solution for purists. A HEX editor and documentation of the opcodes and the DLL file format should also be sufficient. 🙄

LabVIEW

Creating a DLL to work on 2D Arrays

Re: Creating a DLL to work on 2D Arrays

Re: Creating a DLL to work on 2D Arrays

Re: Creating a DLL to work on 2D Arrays

Re: Creating a DLL to work on 2D Arrays

Re: Creating a DLL to work on 2D Arrays

Re: Creating a DLL to work on 2D Arrays

Re: Creating a DLL to work on 2D Arrays