05-22-2024 09:17 AM
It does come at a cost. You need the special Intel compiler and add a dependency to the one API library. In addition the highly optimized code is specific to your machine hardware.
So yes if you are really performance constrained AND have complete control of the target hardware AND are sure to not change hardware at any point, it is a valid optimization. For any other situation it is likely causing more trouble than benefits.
05-22-2024 10:01 AM - edited 05-22-2024 10:16 AM
@rolfk wrote:
It does come at a cost. You need the special Intel compiler and add a dependency to the one API library. In addition the highly optimized code is specific to your machine hardware.
So yes if you are really performance constrained AND have complete control of the target hardware AND are sure to not change hardware at any point, it is a valid optimization. For any other situation it is likely causing more trouble than benefits.
Yes, so hard is our work, as Jonathan Edwards wrote "programming is a desperate losing battle against the unconquerable complexity of code, and the treachery of requirements". As pragmatic programmer I always trying to decrease and minimize possible dependencies from specific compilers, libraries and hardware as well, but it is not always possible. From optimization point of view the Intrinsics also here (in NI CVI as well), and AVX is available almost on every CPU.
To be honest — I perform such optimizations rounds pretty rarely, usually high-performance libraries like Intel IPP or NI VDM does the job pretty well, do not need to reinvent the wheels.
But LabVIEW is not always meet my performance requirements, because I working on real-time image-processing systems, usually inline installed on industrial conveyor belts, and the processing time is strictly dictated, otherwise conveyor will wait fully loaded with parts. And the managers saying: you will get the best hardware money can buy, but in average you have so much seconds only. Intel compiler helps me a lot in such cases. In most my cases there are very customized solutions, so I have no any problems with backward compatibility, and when upgraded to most recent CPUs, then software still work.
My recent project was a very interesting — four high speed 1K 16 bit cameras, running synchronously (sync over cRIO+FPGA) at 12800 FPS, collecting up to 640 Gigabytes data within few seconds, which loaded then into PC via 4x 10G network interfaces and processing included stitching, geometrical lens correction, two stage flat field correction and encoding into RAW multipage large TIFF and additional MPEG video, and everything done just in few minutes (yes, it was really powerful Dual Xeon Gold PC with SSDs in RAID and 1 TB RAM). With just pure LabVIEW arrays this will take probably hours for post processing.
And results from modern LabVIEW also not soo bad, I'm impressed with overall performance, because fully understanding huge amount of work under the hood.
05-22-2024 02:18 PM - edited 05-22-2024 02:24 PM
@rolfk wrote:
It does come at a cost. You need the special Intel compiler and add a dependency to the one API library...
Its me again, just would like to drop here another approach, which is independent from Intel (well, dependent on AVX2):
Whole DLL source:
EUROASM CPU=X64, SIMD=AVX2
IncImgAsm PROGRAM FORMAT=DLL, Entry=DllEntry, MODEL=FLAT, WIDTH=64
align 16
EXPORT fnIncImg
; rcx: pointer to source array
; rdx: pointer to destination array
; r8: number of elements in the arrays (multiple of 64)
fnIncImg PROC
shr r8, 6 ; 128 bytes/64 elements step
vpcmpeqd ymm0, ymm0, ymm0
vpsrlw ymm0, ymm0, 15 ; Load the increment constant (1) -> ymm0
.loop:
vmovdqa ymm1, [rcx] ; Load 16 elements from the src
vmovdqa ymm2, [rcx+32] ; Load next 16 elements from the src
vmovdqa ymm3, [rcx+64] ; and so on
vmovdqa ymm4, [rcx+96] ; unrolled 4 times
vpaddsw ymm1, ymm1, ymm0 ;
vpaddsw ymm2, ymm2, ymm0 ; Increment the elements
vpaddsw ymm3, ymm3, ymm0 ;
vpaddsw ymm4, ymm4, ymm0 ;
vmovdqa [rdx], ymm1 ; Store the result in the destination array
vmovdqa [rdx+32], ymm2
vmovdqa [rdx+64], ymm3
vmovdqa [rdx+96], ymm4
add rcx, 128 ; Update pointers and loop counter
add rdx, 128
dec r8
jnz .loop
ENDP fnIncImg
DllEntry PROC
mov rax, 1
ret
ENDPROC DllEntry
ENDPROGRAM IncImgAsm
Unfortunately no syntax highlight for Assembly.
The loop is unrolled 4 times, single thread. There are only two commands — move and add, so the code is clear hopefully. The image must be properly aligned in memory, of course. I know, for more complicated algorithms this way turned to hell.
The EuroAssembler was used, but with minor changes this code can be compiled in any syntax-compatible Assembler — MASM, NASM, YASM, FASM, etc.
Benchmark almost the same:
//==============================================================================
//
// Title: Assembly vs LabVIEW Benchmark
// Created on: 22.05.2024 at 20:02:29 by AD.
//
//==============================================================================
#include <Windows.h>
#include <inttypes.h>
#include <stdio.h>
#include <malloc.h>
#include "include/SharedLibLabVIEW.h"
#define WIDTH 4096
#define HEIGHT 4096
#define BEGIN_MEASURE QueryPerformanceCounter(&StartTime); \
for(int i = 0; i < 100; i++) //amount of repetitions
#define END_MEASURE(Message) QueryPerformanceCounter(&EndTime); \
ElapsedMicroseconds.QuadPart = EndTime.QuadPart - StartTime.QuadPart; \
ElapsedMicroseconds.QuadPart *= 1000000; \
ElapsedMicroseconds.QuadPart /= Frequency.QuadPart; \
ElapsedTime = (double)(ElapsedMicroseconds.QuadPart)/100000.0; \
printf(#Message " is %.3f \xE6s\n", ElapsedTime);
extern "C" void fnIncImg(uint16_t * src, uint16_t * dst, int LENGTH);
int main(int argc, char* argv[])
{
uint16_t* src, * dst;
Uint16Array srcImage, dstImage;
int32 dimSizeArr[2] = { HEIGHT, WIDTH }; //rows, cols
LARGE_INTEGER StartTime, EndTime, ElapsedMicroseconds, Frequency;
double ElapsedTime;
printf("Assembly vs LabVIEW Benchmark for image %d x %d\n", WIDTH, HEIGHT);
QueryPerformanceFrequency(&Frequency);
src=(uint16_t*)_aligned_malloc(WIDTH * HEIGHT * sizeof(uint16_t), 4096);
dst = (uint16_t*)_aligned_malloc(WIDTH * HEIGHT * sizeof(uint16_t), 4096);
srcImage = AllocateUint16Array(dimSizeArr);
dstImage = AllocateUint16Array(dimSizeArr);
//warm up
fnIncImg(src, dst, HEIGHT * WIDTH);
LabVIEWIncImage(&srcImage, &dstImage);
BEGIN_MEASURE //ASM Benchmark
fnIncImg(src, dst, HEIGHT * WIDTH);
END_MEASURE(Assembly)
BEGIN_MEASURE //LabVIEW Benchmark
LabVIEWIncImage(&srcImage, &dstImage);
END_MEASURE(LabVIEW)
for (int i = 0; i < WIDTH * HEIGHT; i++) { //Check it
if (dst && (dst[i] != (*dstImage)->Numeric[i])) {
printf("FAILED! at %d\n", i);
break;
}
}
_aligned_free(src);
_aligned_free(dst);
DeAllocateUint16Array(&srcImage);
DeAllocateUint16Array(&dstImage);
return 0;
}
And still faster on my old i7-7700 home PC:
C:\Users\Andrey\Desktop\IncImgAsm\Release_x64>Benchmark.exe
Assembly vs LabVIEW Benchmark for image 4096 x 4096
Assembly is 3.485 µs
LabVIEW is 10.270 µs
Source code and EuroAssembler as well are included.
Now this topic is completed more or less.
05-22-2024 02:29 PM
I wouldn't consider use if assembly an advantage to the use of a special C compiler 😀
And what if I want to use gas? 😁
Anyways, I agree that this topic can probably be considered closed and stressed to the max.
05-22-2024 02:48 PM - edited 05-22-2024 02:48 PM
@rolfk wrote:
I wouldn't consider use if assembly an advantage to the use of a special C compiler 😀
And what if I want to use gas? 😁
Oh, stay away from GAS and its ugly syntax (personally, I really dislike it, but it's just my own subjective judgment). The best assembler I've ever seen was — PDP-11; I still remember the RT11-FB OS.
In my humble opinion, assembler is still good for education, and I strongly recommend to take some exercises in Assembly for every software engineer, especially for those working with LabVIEW (and Python), because very often they unfortunately don't have any idea how a PC & OS (cache, memory, CPU, cdecl, stdcall, etc etc etc) works at all.
05-23-2024 03:13 AM
@Andrey_Dmitriev wrote:
The best assembler I've ever seen was — PDP-11; I still remember the RT11-FB OS.
Wow, PDP-11! I never worked with them but they used to have them in the research departments at the company I did my vocational education back in, well let's just say quite a bit back in the last century. 😀
The whole company ERP system was running on an IBM 3031 mainframe and the business logic was based on something called CICS.
05-23-2024 03:40 AM
All we need in this thread now is the solution for purists. A HEX editor and documentation of the opcodes and the DLL file format should also be sufficient. 🙄