Making LabView Use Multiple cores

gtbennett · ‎06-09-2024

Hello folks,

I am using Labview for physics simulations. I typically have small data sets, but large numbers of complex calculations to solve a set of differential equations for dynamics. There is parallelism in the calculations that different sets of calculations are repeated on different pieces of data and the modified data is sent to another set of calculations. These calculations are involved, non-linear and not trivial, but they are localized in the data space.

I would like to have multiple instances of these calculations assigned to different cores while grouped together inside a WHILE loop that synchronizes the data transfers. How does the LabView G language assign cores to different blocks of code? Where can I find the rules that let me implement this?

I have achieved one sim that used all 32 cores of my i9 with over 50% utilization of the CPU, but haven't been able to replicate that feat in other sims with smaller data sets. I have a new machine with an AMD Threadripper that has 128 logical cores and I would like to use most of them in my sims and knowing how LabView does this is critical.

thanks

glenn

altenbach · ‎06-09-2024

If these code instances don't have data dependencies, LabVIEW will run them in parallel as much as possible, but the compiler will do the hard work for you. No need to assign cores. Make sure to omit all UI things in the inner loops (not indicators, controls, property nodes, local variables, etc.)

You can also parallelize FOR loops, of course.

@gtbennett wrote:
There is parallelism in the calculations that different sets of calculations are repeated on different pieces of data and the modified data is sent to another set of calculations. These calculations are involved, non-linear and not trivial, but they are localized in the data space.

How man different pieces of data do you have? Are all instances similar in execution time? Of course the "another set of calculations" requires all previous results there is only so much that can run in parallel.

Can you be a bit more specific, or even attach some code? Are you sure your code is optimized?

There are many tricks, for example if you transpose and then autoindex a 2D array on a FOR loop to iterate over columns, the compiler might not do the transpose in memory, but mark it as transposed and just switch indices. A hard transpose (i.e. placing an always copy after the transpose) can under certain conditions speed up the FOR loop because now elements are adjacent in memory. etc. etc.

You need to carefully test and see where the bottlenecks really are.

LabVIEW Champion.

gtbennett · ‎06-15-2024

Thanks, a couple of questions

What exactly is a UI?

I did try to "parallelize a FOR loop, but it went 4x slower than what the compiler did by itself.

I have made most of sub calculations pretty equal in processor time, but many times the G compiler just throws everything into one or two cores ( probably for the overall while loops) and they're maxed out and I'm only using 7-9% of the CPU. I know everyone tells me to let the compiler "do the hard work", but the results are not consistent between sims and I would love to know what exactly it is in the diagram that shows the compiler that this can be handled in a new core. Knowing that and knowing what forces the compiler to add more cores or not would be extremely helpful. Right now it is very hit or miss, with a lot of misses. Does it just focus on LOOPs? How does a shift register inside a loop play into this? And why are all of these internal rules not written down in all the books and literature that I've looked in? Also, where is the complete set of formal rules for optimization, like the one you mentioned for the transpose?

P.S.

My sims do many 2D complex FFT's, but very little other matrix manipulations.

The sims are calculators not solvers.

thanks

glenn

Yamaeda · ‎06-17-2024

@gtbennett wrote:

Thanks, a couple of questions

What exactly is a UI?

I did try to "parallelize a FOR loop, but it went 4x slower than what the compiler did by itself.

I have made most of sub calculations pretty equal in processor time, but many times the G compiler just throws everything into one or two cores ( probably for the overall while loops) and they're maxed out and I'm only using 7-9% of the CPU. I know everyone tells me to let the compiler "do the hard work", but the results are not consistent between sims and I would love to know what exactly it is in the diagram that shows the compiler that this can be handled in a new core. Knowing that and knowing what forces the compiler to add more cores or not would be extremely helpful. Right now it is very hit or miss, with a lot of misses. Does it just focus on LOOPs? How does a shift register inside a loop play into this? And why are all of these internal rules not written down in all the books and literature that I've looked in? Also, where is the complete set of formal rules for optimization, like the one you mentioned for the transpose?

UI = User Interface, typically the Front Panel

Shift register hinders parallellism since it creates a dependancy on the previous loop.

If the internal calculation is small, the overhead and memory access of a parallell process makes it slower.

Yes, there's a bunch of trail and error to optimize and some experience. Also with new versions of LV it changes a little.

G# - Award winning reference based OOP for LV, for free! - Qestit VIPM GitHub

Qestit Systems

gtbennett · ‎06-17-2024

Please help me sort this out

1) Shift register hinders parallellism since it creates a dependancy on the previous loop.

Why does that happen as the shift register only cares about which memory location it is talking to, not what's inside it? Or are you talking about the shift registers in a linear series of loops?

2) If the internal calculation is small, the overhead and memory access of a parallell process makes it slower.

What is getting shifted in and out as the data sets are small? If I have enough cores for all of the subroutines and their clones, then it should be just data memory access? How does LabView select cache and other short term memory?

3) Yes, there's a bunch of trail and error to optimize and some experience. Also with new versions of LV it changes a little.

I am stuck with my copy of SP1 2020. I am not on a subscription. Getting through NI to someone that knows the compiler seems impossible. (We're sorry you didn't pay for that service ) . I like LabView for physics sims, but how does anyone really optimize for large number of cores, >64, without knowing what the compiler wants to see? Where is it's decision structure for parallelism written down? The trial and error says I can't rely on the compiler to parse my calculation in the most efficient way. I would like to correct that.

Who can I talk to to get this class of answer?

thanks

glenn

Kyle97330 · ‎06-17-2024

I posted a thread a few years back that might be relevant to you:

https://forums.ni.com/t5/LabVIEW/Parallel-NET-nodes-not-actually-running-in-parallel-unless/td-p/398...

What it seems to boil down to is that LabVIEW's compiler attempts to "clump" things together to run on a single thread, and it doesn't always get it right. If it doesn't get it right, there are workarounds, such as in the example where just putting code in separate 1-iteration FOR loops magically allows for parallel execution.

altenbach · ‎06-17-2024

There are no formal rules, but here are some other guidelines.

If you have a stack of FOR loops, only parallelize the outermost, of course.
shift registers are sometime allowed if the compiler can recognize the transform.
If the code inside the loop is very simple, don't parallelize.
Focus on the innermost code to optimize. Make at least 5 alternative implementations and compare.
Keep data in-place instead of constant resizing arrays. Look for buffer allocations.
If you are not sure, try with and without parallelization and compare.
Make sure your parallelized loops don't have massive non-reentrant subVIs, Small, fast non-reentrant subVIs are OK and sometimes needed to communicate between instances..
...

We could give very specific advice once you show us a simplified version of your code.

Always remember, it is not important to keep all cores busy. You can only gain an upper max proportional to the number of cores. Sometimes more efficient code can give you much more than that.

LabVIEW Champion.

Michael_Munroe · ‎06-18-2024

I verified parallel loops last year with a simple VI.

Create a new VI
Add a for loop with a shift register to increment the value, initialize it to 0 (dbl)
Wire the N to 100M
Add another for loop around the first one
Wire the N to 100
Enable loop parallelization and specify the number of cores of your CPU
Run the VI

Your CPU should show 100%.

Michael Munroe, CLD, CTD, MCP
Automate 1M+ VI Search, Sort and Edit operations with Property Inspector 5.0, now with a new Interactive Window Manager!
Now supports full project automation using one-click custom macros or CLI.

cordm · ‎06-18-2024

You may also be affected by this bug: https://forums.ni.com/t5/LabVIEW/Question-about-Implicit-Multithreading/td-p/4334860

Sprinkle some wait 0 around to see if the situation improves.

LabVIEW

Making LabView Use Multiple cores

Making LabView Use Multiple cores

Re: Making LabView Use Multiple cores

Re: Making LabView Use Multiple cores

Re: Making LabView Use Multiple cores

Re: Making LabView Use Multiple cores

Re: Making LabView Use Multiple cores

Re: Making LabView Use Multiple cores

Re: Making LabView Use Multiple cores

Re: Making LabView Use Multiple cores