10-20-2012 02:32 AM
Well, yes. Arrays in LabVIEW are contiguous in memory, so having large data strctures can lead to out of memory errors due to memory fragmentation, especially if memory is constantly reallocated and large arrays shoveled around. The number of allocation dots (for arrays) when showing buffer allocations is one telltale sign if this is happening and my code has significantly fewer such dots. This was my only criterion when I wrote it.
If operating on large arrays, you have few other choices. How do you want to keep the "data small" if it is not. Everything else being equal, keeping a large contiguous array in a fixed memory position (e.g. a DVR or a simple shift register), never reallocate or copy, and do all operations in-place (maybe with a few scratch array of much smaller size (e.g. one row or column)) will most likely beat anything that has the same data scattered over many smaller data structures. Do you have an example to the contrary?
I mentioned that my code can also be fully parallelized, potentially gaining a few factors of two depending on hardware. It is possible that the penalty of the original code is less than expected because of better use of SIMD by the built-in array functions. It might be interestion to try it on a much older LabVIEW version (pre SSE2) to see how the compiler has changed. 😉
Yes, my code probably still has a lot of slack left. Feel free to improve it. As I said it was just a quick draft.
10-20-2012 03:07 AM - edited 10-20-2012 03:27 AM
Parallelizing code running on several processors will create memory traffic and synchronization overhead and there is no clear line between when it is beneficial and slows you down especially when you compute your way through the data in tight/fast loops. Finding this intersection can be tricky. Benchmarking your code helps.
This is why many modern games actually run faster on a single core CPU. Trying to parallellize stuff that better should be optimized to fit in the CPU cache is a better way forward. Specially those Xeons with an insane amount of cache. Heck, even this core i7 have up to 8MB of cache. Now, that's a bloody large array or picture.
Yes, it is contigous, then you do a transpose on your array, or iterate on a column basis, suddenly you find the runtime pointer-bouncing all around memory, eventualy trashing cache (fetching from RAM), and it runs slow(er).
Fragmenting large 2d/3d arrays into smaller row/column based chunks isn't what I ment. This is a blindingly obvious case of using multidim arrays as DVR's, just as in your example. I was thinking more in a general program structure terms.
After a context switch your fast threads/loops/modules/classes/arrays should be "blitted" into the cache, if they have been trashed in previous operations, and executing there without any excessive trashing, such as what can be the case with convoluted/complex programs. Having aligned memory allocations helps the OS/processor to determine what to prefetch, etc.
Br,
/Roger
10-20-2012 03:51 AM
A good read about cache handling that for sure can be applied to LV programs.
Br,
/Roger
10-20-2012 11:45 AM
Hi all,
Thanx a lot for the answers.
I had already got that using empty arrays in the "in-place element" structures was not a good idea because of memory reallocation 😞 At least, I've learned something today about inplaceness 😉 As I'm not involved in computational work, I'm not used with huge amounts of data ; actually, I've never coded with "in-place element" before...
On my 8-Core machine, the best performances are achieved with For loops without parallelization for left & right rotations.
For up & down rotations, even the initial code is faster than the one with non-parallelized For loops ! And the gain with parallelization is not clear compared to the in-place element structure (cf. attached test VI). But I'm also not used with parallelization problems...
Best regards,
HL
10-20-2012 12:00 PM
Yes, a good general idea is keeping the code as simple as possible, avoid fancy stuff such as inplaceness, parallelism, unless you really need them from a CPU performance or memory requirements perspective. These principles are usually the final touches to "cool the hot spots" when your overall architecture is lean and optimal dataflow.
Br,
/Roger
10-20-2012 12:04 PM
The compiler has become so sophisticated that it is really impossible to tell without doing the actual benchmarking. For example, just changing the array to DBL will change the ranking slightly.
That being said, I don't claim that my code is especially optimized. I am sure there are more efficient way possible. 😄
10-20-2012 12:58 PM - edited 10-20-2012 12:59 PM
altenbach wrote:That being said, I don't claim that my code is especially optimized. I am sure there are more efficient way possible. 😄
OK, here's an UP version of the "for loop" variety that is about 4x faster. I am sure it can be further optimized.
Also the other transformations could probably be rewritten similarly. Try it! 😄
10-20-2012 01:09 PM
Is the code even correct? Before we can assess the performance, the code must work correctly.
I didnt get it to work for some cases. Maybe I just didn't do it right?
I leave it for you to test yourself. Attached is a VI.
Br,
/Roger
10-20-2012 01:27 PM
Which code and which cases?
(I would eliminate the outer shift register to keep the result static for better comparison).
10-20-2012 01:33 PM
@altenbach wrote:
Which code and which cases?
(I would eliminate the outer shift register to keep the result static for better comparison).
Based on yours I presume? I unhid the 2d array and put a loop delay to watch the rotation
Without structure rotate left & right. Try them, they aren't that many cases.
I attached it.
Br,
/Roger