01-30-2025 12:54 PM
I have been working on a data analysis program that uses a parallel loop to process multiple rows of data at a time. I noticed that the average iteration execution increases significantly for each parallel instance. It is pretty complex code where a single iteration without parallelization has about a 40 second execution time. Increasing to even two parallel instances gives an iteration execution time of about 70 seconds.
I also get this result with simple code. I attached a sample program that gave me similar results. Using this program, I recorded data on average iteration time and total execution time vs the number of parallel instances and attached the results. I am getting 20-50% increases in iteration execution time for each parallel instance added, and the total execution time actually increases after more than 4 parallel instances on an 8 core machine. I've also tried this on a 32 core machine and get maximum performance at 8 parallel instances.
I even simplified it further to just running a basic VI in a parallel loop and still see ~50% increases in iteration time with # of parallel instances.
I would expect some minor slowdown with adding parallel instances but this seems excessive. Any advice you have would be greatly appreciated.
Solved! Go to Solution.
01-30-2025 02:50 PM - edited 01-30-2025 02:51 PM
You did not include the function. Also make sure to do a "save for previous, 2020 or below" before attaching so more can have a look.
For a tight loop that does very little, the cost of splitting and reassembling the data can easily exceed the parallelization advantage. You also did not disable debugging and of course having an xy graph (or any indicator) inside a parallel FOR loop is just plain madness.
01-30-2025 03:07 PM - edited 01-30-2025 03:09 PM
@mrtoad wrote:
I also get this result with simple code. I attached a sample program that gave me similar results. Using this program, I recorded data on average iteration time and total execution time vs the number of parallel instances and attached the results. I am getting 20-50% increases in iteration execution time for each parallel instance added, and the total execution time actually increases after more than 4 parallel instances on an 8 core machine. I've also tried this on a 32 core machine and get maximum performance at 8 parallel instances.
Also please give the exact CPU models. Especially with newer Intel processors, we have P and E cores (some details).
I also strongly recommend to use the high resolution relative seconds. Your average is near or even below 1ms, so a ms ticker is highly quantized in the individual measurements.
01-30-2025 04:27 PM
Thank you for the reply. I guess I didn't put much care in making that sample program. I've attached a revised version with your recommended changes. The function was just a labview default example but I included it as well, also saved for a previous version.
As for core specs, the 8 core machine is an Intel i7-8550u. 4 P cores and 4 virtual cores.
The 32 core machine is an intel i9-14900. 16 E cores, 8 P cores, 8 virtual cores.
I am still seeing the very large increases in iteration time. Also, this program times the VI execution within the loop, so it should exclude the additional overhead which would be reflected in the total execution time. I do not think this is just an overhead problem, because in my larger program each iteration is very computationally intensive, taking 40 seconds per iteration at minimum, and I still see >50% increases in iteration time with each parallel instance.
01-30-2025 06:51 PM
Sorry, I get a broken wire...
I will try to investigate later....
Virtual cores don't really do much heavy lifting and a minimum at four cores seems about right.
You say it takes 40 seconds per iteration, but your Y axis is labeled in fractions of milliseconds. Did you get the units right?
01-30-2025 07:57 PM
Hm that is strange.. Looks fine on my end. Should just be a reference to the Quadrature Integrand VI example function. You can find another copy of it in the example folder under examples > mathematics > integration and differentiation > subVIs.
Fair enough on the 8 core machine, but shouldn't the optimum for the 32 core machine be higher than 8?
The units are correct. To clarify, I am writing a MUCH larger program that actually does heavy computing inside the parallel loop. Each iteration of the loop computes a nonlinear fit of a function containing an integral, and I use the quadrature VI to evaluate the integral. Each individual iteration can take up to 40 seconds. However, increasing the # of parallel instances to 2 increases the execution time to about 70 seconds on the 32 core computer. This gets significantly worse as the number of parallel instances increases, and the optimum value of P for this computer is ~8.
To troubleshoot this, I made this sample program just using the quadrature VI in a parallel loop to narrow down the problem and get some quick and easy iteration timings. These sample program iterations are much quicker, and the attached plots were made using this program on the 8 core computer. I realize that overhead for parallel loops will be more noticeable for fast loops, but this sample program has the same behavior as the larger program where individual VI run time within the loop increases significantly (20-50%) with each parallel instance.
01-31-2025 12:20 AM - edited 01-31-2025 12:22 AM
As my first coffee breakfast idea, I would like to recommend avoiding calls by reference in this loop, something like this:
Also, I replaced math subVIs with content (which is very bad from an architectural point of view). The inline should do the same, but anyway, this change will give me a performance boost (while keeping the same result, of course).
Single thread:
two threads:
with four threads by factor 12x:
And now the overall time decreased with more threads, as expected. Just an idea...
Project and benchmark in the attachment.
01-31-2025 01:53 PM
Wow, thank you so much for this! You managed to solve the problem AND boost the base speed by 2x. I am still seeing some increases in average iteration time, but now total execution time no longer increases with parallel instances at any point, even when using all cores. I will apply these changes to the larger program and let you know how it goes, but I will mark this as solved.
Out of curiosity, do you know why call by reference causes problems in parallel even when the reference is reentrant?
01-31-2025 02:13 PM
@mrtoad wrote:
Out of curiosity, do you know why call by reference causes problems in parallel even when the reference is reentrant?
Likely due to the "root loop", which is a main thread that handles things such as opening VI references.
01-31-2025 02:40 PM
I thought root loop was only a problem with opening the reference. Is it still a problem even when the reference is opened outside the loop?