How does parallel for loop effect iteration terminal

altenbach · ‎10-04-2023

@Quiztus2 wrote:

It turned out that indexing works on chunks as I expected. Every instance starts with the expected index as it would in plain sequence. The ambiguities came from my combination of test data and processing.

I have a room of interest for my 3D array. I use the index values instead of Array subset.VI+autoindexing. I hope that I save some memory this way, since my 3D input array is of the size of some gigabyte.

Sorry I cannot look at your snippet and we also don't have any of your subVIs. Are they from a toolkit or home-made? What do they do? Are all slow ones reentrant or even inlined?

For a 3D array, it probably also depends on how you "slice it" during processing. It help if e.g. processed planes are adjacent in memory (se comment below).

(For example if you index into a just transposed 2D array to process columns instead of rows, the compiler might decide to just swap the indices instead of doing an actual transpose first, making elements non-adjacent. You might get a significant speedup if you do a real transpose (e.g. with an "always copy" or by using the transpose function form the linear algebra palette instead of "transpose array")

LabVIEW Champion.

wiebe@CARYA · ‎10-05-2023

The max number of parallel loop iterations will effect the behavior.

IIRC, it's 32 by default, as it is for some reason linked to a fictional max number of logical cores that seemed reasonable at the time it was made up.

In LabVIEW.ini:

ParallelLoop.MaxNumLoopInstances=500000

There doesn't need to be a limit...

Although it makes some sense to link the nr of parallel executions to the logical cores (as an attempt to distributing the load evenly across them), there's no guarantee that it will happen. There's other stuff (like windows) utilizing the cores.

Also, spreading CPU load is not the only use case for parallel loop execution.

Search LabVIEW like a graph!

Quiztus2 · ‎10-05-2023

I attached a simplified VI. I have to work in every specific pixel along a set of consecutive shots. Since my first coordinate are the shots (z) and not the pixel data xy I can't use auto indexing well here.

mcduff · ‎10-05-2023

You may be able to save some memory and improve speed if the "whack the mole dots" I am seeing are correct.

After you extract your data points from the 3D array, DO NOT reshape the array.
DO NOT change the array to doubles, let the multiplication handle it.
The summation works on multidimensional arrays, so no need to reshape.

This may improve speed and some memory usage.

altenbach · ‎10-05-2023

@Quiztus2 wrote:

I attached a simplified VI. I have to work in every specific pixel along a set of consecutive shots. Since my first coordinate are the shots (z) and not the pixel data xy I can't use auto indexing well here.

All your cluster elements are zero by default. Does the ROI correspond to "everything" or a subset of the full array? I would probably take the 3D ROI subset before the loop, then use "index array" to get data columns directly as 1D arrays.

Not sure why you even enter the loop stack if there is an error. Is there anything (in the full code) in the inner loop that can generate an error?

LabVIEW Champion.

altenbach · ‎10-05-2023

@mcduff wrote:

You may be able to save some memory and improve speed if the "whack the mole dots" I am seeing are correct.

After you extract your data points from the 3D array, DO NOT reshape the array.

My impression was that the "some heavy lifting" code is very different in the real code and the summing is just a simplified stand-in.

Still, it that's the real code, we should take the sum first and scale it later. (1 multiplication instead of N multiplications!). Of course the compiler might figure it out either way. 😄

LabVIEW Champion.

mcduff · ‎10-05-2023

@altenbach wrote:

@mcduff wrote:

You may be able to save some memory and improve speed if the "whack the mole dots" I am seeing are correct.

After you extract your data points from the 3D array, DO NOT reshape the array.

My impression was that the "some heavy lifting" code is very different in the real code and the summing is just a simplified stand-in.

Still, it that's the real code, we should take the sum first and scale it later. (1 multiplication instead of N multiplications!). Of course the compiler might figure it out either way. 😄

Just like Ivory soap, 99.99%, you are correct. But in this case if the code is real, the pixels are in V16. You would need to convert to double to avoid potential overflow, but that still may be faster than a multi million point multiplication. That is why I suggested let the multiplication do the conversion for you; make memory space and math operations at the same time.

wiebe@CARYA · ‎10-05-2023

Never mind. @altenback mentioned this. I'll keep my comment, maybe two different explainations help.

Spoiler

@mcduff wrote:

You may be able to save some memory and improve speed if the "whack the mole dots" I am seeing are correct.

After you extract your data points from the 3D array, DO NOT reshape the array.

DO NOT change the array to doubles, let the multiplication handle it.

The summation works on multidimensional arrays, so no need to reshape.

This may improve speed and some memory usage.

It is also redundant to multiply the array, esp. twice.

Simply add all elements and than multiply the result.

This works because A*c + B*c = (A + B)*c...

So, sub the array than multiply the scalar 2X.

wrote: You may be able to save some memory and improve speed if the "whack the mole dots" I am seeing are correct. After you extract your data points from the 3D array, DO NOT reshape the array. DO NOT change the array to doubles, let the multiplication handle it. The summation works on multidimensional arrays, so no need to reshape. This may improve speed and some memory usage. It is also redundant to multiply the array, esp. twice. Simply add all elements and than multiply the result. This works because A*c + B*c = (A + B)*c... So, sub the array than multiply the scalar 2X.

Search LabVIEW like a graph!

altenbach · ‎10-05-2023

@mcduff wrote:
You would need to convert to double to avoid potential overflow, but that still may be faster than a multi million point multiplication. That is why I suggested let the multiplication do the conversion for you; make memory space and math operations at the same time.

Both the conversion to DBl or your coercion will need to allocate the full size DBL array to be summed later (unless the compiler can do some real magic).

According to the diagram comments, we typically have 5000 pages, so that's the maximum number of column elements. The sum would comfortably fit in U32, so I would probably convert to U32, do the integer sum, then scale with a DBL.

(In fact, I would probably do a couple of different ways and compare. You can never be sure what's best. 😄 )

LabVIEW Champion.

wiebe@CARYA · ‎10-05-2023

I think there's a (less obvious) optimization.

As OP is iterating the sum over a varying range of subsets (ROIs), depending on the numbers it might be worth making a 2D array of cumulative sums (aka the. integral).

You'd be able to get the sum of a subset by getting the top left and bottom right, and the difference, if I'm not mistaken, will be the sum of the array.

This is a bit of work to do, but once you have the 2d integral array, getting the sums of roi will be blazingly fast.

Search LabVIEW like a graph!

LabVIEW

How does parallel for loop effect iteration terminal

Re: How does parallel for loop effect iteration terminal

Re: How does parallel for loop effect iteration terminal

Re: How does parallel for loop effect iteration terminal

Re: How does parallel for loop effect iteration terminal

Re: How does parallel for loop effect iteration terminal

Re: How does parallel for loop effect iteration terminal

Re: How does parallel for loop effect iteration terminal

Re: How does parallel for loop effect iteration terminal

Re: How does parallel for loop effect iteration terminal

Re: How does parallel for loop effect iteration terminal