10-18-2012 11:28 AM - edited 10-18-2012 11:29 AM
@Herlag wrote:
I don't know if it's really useful for your problem, but here is a very simple code (LabVIEW 2011) showing how to rotate 2D-arrays.
This code seems to very inefficient, because it involves a huge amount of memory allocations due to the constant resizing and building of large 2D arrays. If you do a "tools...profile...show buffer allocations", here's what you'll see:
I would recommend to implement a solution that operates in-place. Since input and output have the same dimensions, that should easily be possible. Try it, then run some benchmarks on large 2D arrays.
10-18-2012 11:34 AM
Sure, you're right.
But does it really matter if the size of the array is quite small ? 😉
... Of course, for an image, the size may be large 😞
HL
10-19-2012 02:06 PM
In a previous message, I proposed a very simple code to rotate 2D-arrays.
Altenbach noticed that the code seemed to be very inefficient because of multiple buffer allocations, and suggested to use in-place elements.
Just for fun, I tried to do so and wrote the attached "2D_In_Place_Rotate" VI.
Of course, there is no difference with the initial code for the small boolean array used in the example.
But I tried with an array which may represent a typical image, ie a 2000x2000 U32 array... and I got a surprise !
The code that operates in-place takes much more time (~ 885 ms on my machine) that the initial one (~ 105 ms) !!!
Any idea why ?
The test VIs are also attached to this message : is there something wrong with them ?
Best regards,
HL
10-19-2012 02:28 PM - edited 10-19-2012 02:30 PM
@Herlag wrote:
In a previous message, I proposed a very simple code to rotate 2D-arrays.
Altenbach noticed that the code seemed to be very inefficient because of multiple buffer allocations, and suggested to use in-place elements.
Just for fun, I tried to do so and wrote the attached "2D_In_Place_Rotate" VI.
Of course, there is no difference with the initial code for the small boolean array used in the example.
But I tried with an array which may represent a typical image, ie a 2000x2000 U32 array... and I got a surprise !
The code that operates in-place takes much more time (~ 885 ms on my machine) that the initial one (~ 105 ms) !!!
Any idea why ?
The test VIs are also attached to this message : is there something wrong with them ?
Best regards,
HL
It is more important to align the array so that you get a low nr of cache misses. If those buffers reside in the cache, its usually no problem. If the in-place structure for some reason cause a cache miss, then you will se performance suffer.
Not sure if it is that, though.
Possibly could just be some LV ineffiency with handling in-placeness?
Br,
/Roger
10-19-2012 02:30 PM
Coming back to the beginning of the thread, as far as I understand, you try to create a 3D-Array, where every 2D-subarray represents one image? This is quite inefficient, as you have to resize ALL 2D-subarrays, as all 2D-subarrays have to have the same size.
For this, i recommend using a cluster of a 2D-array and creating a 1D-array of this cluster. This leads in fact to a 3D-array, but you can resize every 2D-subimage independently.
For comparing the images pixel by pixel, it would be also an option to put another 2 numeric controls into the cluster, representing the shift-x and shift-y. With this, you don't have to shift your arrays, but you keep track of the calculated shift which you can then use in your analysis function (meaning you don't access the position (x,y) of your image but the position (x+shift_x, y+shift_y) of your image. This prevents also large memory operations in your code...
Hope this helps,
best regards,
Christian
10-19-2012 02:31 PM
Hi Roger,
What do you mean by "to align the array" ?
HL
10-19-2012 02:35 PM - edited 10-19-2012 02:45 PM
@Herlag wrote:
Hi Roger,
What do you mean by "to align the array" ?
HL
You can chose to iterate on the rows or the columns, all data in memory is aligned on a per row or column basis (don't remember), try test which is faster for a really large array, for example. Some paths might cause a more optimal processing path from a caching perspective. Modern processors can also prefetch data based on usage and hereby optimize the cache hits.
(If I recall correctly. Don't take my word for it) Test it.
Br,
/Roger
10-19-2012 02:47 PM
@User002 wrote:
@Herlag wrote:
Hi Roger,
What do you mean by "to align the array" ?
HL
You can chose to iterate on the rows or the columns, all data in memory is aligned on a per row or column basis (don't remember), try test which is faster for a really large array, for example. Some paths might cause a more optimal processing path from a caching perspective. Modern processors can also prefetch data based on usage and hereby optimize the cache hits.
(If I recall correctly. Don't take my word for it) Test it.
Br,
/Roger
Toms hardware:
http://www.tomshardware.com/reviews/Intel-i7-nehalem-cpu,2041-12.html
Br,
/Roger
10-20-2012 01:32 AM - edited 10-20-2012 01:34 AM
@Herlag wrote:
The test VIs are also attached to this message : is there something wrong with them ?
Yes, there is plenty wrong with your implementation. Just using an in-place ("weird in-place") structure does not magically execute everything in-place. In fact the way you are using it, you get now even more memory allocations. You can speed it up to about the speed of your original code ("original") by wiring the upper split function across instead of wiring an empty array from the inside ("better weird in-place"). I don't understand why you would wire an empty array, that simply makes no sense! How can the upper part possibly execute in-place if you force the output to a different size??
An in-place solution does not necessarily imply use of the "in place element structure". All you need to avoid buffer allocations is a for loop with a shift register, for example. A quick and dirty draft is included under "my in-place" in the attached benchmark (show buffer allocations and count the black dots!). I am sure if can be further improved. Of course in a typical applications, only one of the transformations runs at the same time. It is probably not such a good idea to run all transformations in parallel. The outcome might not be typical.
The difference is not super dramatic (about 25% on my dual core laptop). Most likely the compiler is able to be relatively smart about the memory allocations, but I haven't studied it in details. This is just a very rough draft to get you started. You will also notice that your code always makes a spike when the size is changed (e.g. from 1000 to 2000) while my code does not. Most likely this involves new allocations that are then re-used in subsequent iterations of the same test.
Also all my inner FOR loops can be parallelized, probably speeding things up by the number of available CPU cores (benchmarking only one transformation at a time, of course). I have a 16 core dual E5 Xeon at work. I could try it next week. 😄
It would be interesting to look at memory use overall for the various implementations. This could be important for large images.
Just run it and watch the chart as you switch agorithms.
(note that I replaced the constant with a control and added indicators to avoid constant folding and dead code elimination. Without this change, your benchmark code would complete in nanoseconds once debugging is disabled.)
10-20-2012 01:47 AM
@altenbach wrote:
The difference is not super dramatic (about 25% on my dual core laptop). Most likely the compiler is able to be relatively smart about the memory allocations, but I haven't studied it in details. This is just a very rough draft to get you started. You will also notice that your code always makes a spike when the size is changed (e.g. from 1000 to 2000) while my code does not. Most likely this involves new allocations that are then re-used in subsequent iterations of the same test.
Dont forget our hardware wizards at Intel.
The processor/OS can also be smart about handling memory and parallelizing instructions.
Never sacrifice code simplicity to the gains of less resource usage.
Unless you absolutely need it, or if it is in a performance critical path of your program.
Large "monolith" LabVIEW clusters containing huge arrays, etc, are extremely slow due to the buffer handling and memory traffic.
Always keep your data small and well structured in several classes/clusters.
Then, if you run into problems with slow buffer handling, you can quite painlessly migrate your performance critical code to DVR's, etc.
Br,
/Roger