Parallel For Loop Increases Iteration Execution Time

mrtoad · ‎01-30-2025

I have been working on a data analysis program that uses a parallel loop to process multiple rows of data at a time. I noticed that the average iteration execution increases significantly for each parallel instance. It is pretty complex code where a single iteration without parallelization has about a 40 second execution time. Increasing to even two parallel instances gives an iteration execution time of about 70 seconds.

I also get this result with simple code. I attached a sample program that gave me similar results. Using this program, I recorded data on average iteration time and total execution time vs the number of parallel instances and attached the results. I am getting 20-50% increases in iteration execution time for each parallel instance added, and the total execution time actually increases after more than 4 parallel instances on an 8 core machine. I've also tried this on a 32 core machine and get maximum performance at 8 parallel instances.

I even simplified it further to just running a basic VI in a parallel loop and still see ~50% increases in iteration time with # of parallel instances.

I would expect some minor slowdown with adding parallel instances but this seems excessive. Any advice you have would be greatly appreciated.

altenbach · ‎01-30-2025

You did not include the function. Also make sure to do a "save for previous, 2020 or below" before attaching so more can have a look.

For a tight loop that does very little, the cost of splitting and reassembling the data can easily exceed the parallelization advantage. You also did not disable debugging and of course having an xy graph (or any indicator) inside a parallel FOR loop is just plain madness.

LabVIEW Champion.

altenbach · ‎01-30-2025

@mrtoad wrote:

I also get this result with simple code. I attached a sample program that gave me similar results. Using this program, I recorded data on average iteration time and total execution time vs the number of parallel instances and attached the results. I am getting 20-50% increases in iteration execution time for each parallel instance added, and the total execution time actually increases after more than 4 parallel instances on an 8 core machine. I've also tried this on a 32 core machine and get maximum performance at 8 parallel instances.

Also please give the exact CPU models. Especially with newer Intel processors, we have P and E cores (some details).

How many P cores?
How many E cores?
How many virtual cores?

I also strongly recommend to use the high resolution relative seconds. Your average is near or even below 1ms, so a ms ticker is highly quantized in the individual measurements.

LabVIEW Champion.

mrtoad · ‎01-30-2025

Thank you for the reply. I guess I didn't put much care in making that sample program. I've attached a revised version with your recommended changes. The function was just a labview default example but I included it as well, also saved for a previous version.

As for core specs, the 8 core machine is an Intel i7-8550u. 4 P cores and 4 virtual cores.

The 32 core machine is an intel i9-14900. 16 E cores, 8 P cores, 8 virtual cores.

I am still seeing the very large increases in iteration time. Also, this program times the VI execution within the loop, so it should exclude the additional overhead which would be reflected in the total execution time. I do not think this is just an overhead problem, because in my larger program each iteration is very computationally intensive, taking 40 seconds per iteration at minimum, and I still see >50% increases in iteration time with each parallel instance.

altenbach · ‎01-30-2025

Sorry, I get a broken wire...

I will try to investigate later....

Virtual cores don't really do much heavy lifting and a minimum at four cores seems about right.

You say it takes 40 seconds per iteration, but your Y axis is labeled in fractions of milliseconds. Did you get the units right?

LabVIEW Champion.

mrtoad · ‎01-30-2025

Hm that is strange.. Looks fine on my end. Should just be a reference to the Quadrature Integrand VI example function. You can find another copy of it in the example folder under examples > mathematics > integration and differentiation > subVIs.

Fair enough on the 8 core machine, but shouldn't the optimum for the 32 core machine be higher than 8?

The units are correct. To clarify, I am writing a MUCH larger program that actually does heavy computing inside the parallel loop. Each iteration of the loop computes a nonlinear fit of a function containing an integral, and I use the quadrature VI to evaluate the integral. Each individual iteration can take up to 40 seconds. However, increasing the # of parallel instances to 2 increases the execution time to about 70 seconds on the 32 core computer. This gets significantly worse as the number of parallel instances increases, and the optimum value of P for this computer is ~8.

To troubleshoot this, I made this sample program just using the quadrature VI in a parallel loop to narrow down the problem and get some quick and easy iteration timings. These sample program iterations are much quicker, and the attached plots were made using this program on the 8 core computer. I realize that overhead for parallel loops will be more noticeable for fast loops, but this sample program has the same behavior as the larger program where individual VI run time within the loop increases significantly (20-50%) with each parallel instance.

Andrey_Dmitriev · ‎01-31-2025

As my first coffee breakfast idea, I would like to recommend avoiding calls by reference in this loop, something like this:

Also, I replaced math subVIs with content (which is very bad from an architectural point of view). The inline should do the same, but anyway, this change will give me a performance boost (while keeping the same result, of course).

Single thread:

two threads:

with four threads by factor 12x:

And now the overall time decreased with more threads, as expected. Just an idea...

Project and benchmark in the attachment.

mrtoad · ‎01-31-2025

Wow, thank you so much for this! You managed to solve the problem AND boost the base speed by 2x. I am still seeing some increases in average iteration time, but now total execution time no longer increases with parallel instances at any point, even when using all cores. I will apply these changes to the larger program and let you know how it goes, but I will mark this as solved.

Out of curiosity, do you know why call by reference causes problems in parallel even when the reference is reentrant?

crossrulz · ‎01-31-2025

@mrtoad wrote:

Out of curiosity, do you know why call by reference causes problems in parallel even when the reference is reentrant?

Likely due to the "root loop", which is a main thread that handles things such as opening VI references.

There are only two ways to tell somebody thanks: Kudos and Marked Solutions
Unofficial Forum Rules and Guidelines
"Not that we are sufficient in ourselves to claim anything as coming from us, but our sufficiency is from God" - 2 Corinthians 3:5

mrtoad · ‎01-31-2025

I thought root loop was only a problem with opening the reference. Is it still a problem even when the reference is opened outside the loop?

LabVIEW

Parallel For Loop Increases Iteration Execution Time

Parallel For Loop Increases Iteration Execution Time

Re: Parallel For Loop Increases Iteration Execution Time

Re: Parallel For Loop Increases Iteration Execution Time

Re: Parallel For Loop Increases Iteration Execution Time

Re: Parallel For Loop Increases Iteration Execution Time

Re: Parallel For Loop Increases Iteration Execution Time

Re: Parallel For Loop Increases Iteration Execution Time

Re: Parallel For Loop Increases Iteration Execution Time

Re: Parallel For Loop Increases Iteration Execution Time

Re: Parallel For Loop Increases Iteration Execution Time