06-09-2024 12:07 PM
Hello folks,
I am using Labview for physics simulations. I typically have small data sets, but large numbers of complex calculations to solve a set of differential equations for dynamics. There is parallelism in the calculations that different sets of calculations are repeated on different pieces of data and the modified data is sent to another set of calculations. These calculations are involved, non-linear and not trivial, but they are localized in the data space.
I would like to have multiple instances of these calculations assigned to different cores while grouped together inside a WHILE loop that synchronizes the data transfers. How does the LabView G language assign cores to different blocks of code? Where can I find the rules that let me implement this?
I have achieved one sim that used all 32 cores of my i9 with over 50% utilization of the CPU, but haven't been able to replicate that feat in other sims with smaller data sets. I have a new machine with an AMD Threadripper that has 128 logical cores and I would like to use most of them in my sims and knowing how LabView does this is critical.
thanks
glenn
06-09-2024 05:15 PM
If these code instances don't have data dependencies, LabVIEW will run them in parallel as much as possible, but the compiler will do the hard work for you. No need to assign cores. Make sure to omit all UI things in the inner loops (not indicators, controls, property nodes, local variables, etc.)
You can also parallelize FOR loops, of course.
@gtbennett wrote:There is parallelism in the calculations that different sets of calculations are repeated on different pieces of data and the modified data is sent to another set of calculations. These calculations are involved, non-linear and not trivial, but they are localized in the data space.
How man different pieces of data do you have? Are all instances similar in execution time? Of course the "another set of calculations" requires all previous results there is only so much that can run in parallel.
Can you be a bit more specific, or even attach some code? Are you sure your code is optimized?
There are many tricks, for example if you transpose and then autoindex a 2D array on a FOR loop to iterate over columns, the compiler might not do the transpose in memory, but mark it as transposed and just switch indices. A hard transpose (i.e. placing an always copy after the transpose) can under certain conditions speed up the FOR loop because now elements are adjacent in memory. etc. etc.
You need to carefully test and see where the bottlenecks really are.
06-15-2024 03:14 PM
Thanks, a couple of questions
What exactly is a UI?
I did try to "parallelize a FOR loop, but it went 4x slower than what the compiler did by itself.
I have made most of sub calculations pretty equal in processor time, but many times the G compiler just throws everything into one or two cores ( probably for the overall while loops) and they're maxed out and I'm only using 7-9% of the CPU. I know everyone tells me to let the compiler "do the hard work", but the results are not consistent between sims and I would love to know what exactly it is in the diagram that shows the compiler that this can be handled in a new core. Knowing that and knowing what forces the compiler to add more cores or not would be extremely helpful. Right now it is very hit or miss, with a lot of misses. Does it just focus on LOOPs? How does a shift register inside a loop play into this? And why are all of these internal rules not written down in all the books and literature that I've looked in? Also, where is the complete set of formal rules for optimization, like the one you mentioned for the transpose?
P.S.
My sims do many 2D complex FFT's, but very little other matrix manipulations.
The sims are calculators not solvers.
thanks
glenn
06-17-2024 06:30 AM
@gtbennett wrote:
Thanks, a couple of questions
What exactly is a UI?
I did try to "parallelize a FOR loop, but it went 4x slower than what the compiler did by itself.
I have made most of sub calculations pretty equal in processor time, but many times the G compiler just throws everything into one or two cores ( probably for the overall while loops) and they're maxed out and I'm only using 7-9% of the CPU. I know everyone tells me to let the compiler "do the hard work", but the results are not consistent between sims and I would love to know what exactly it is in the diagram that shows the compiler that this can be handled in a new core. Knowing that and knowing what forces the compiler to add more cores or not would be extremely helpful. Right now it is very hit or miss, with a lot of misses. Does it just focus on LOOPs? How does a shift register inside a loop play into this? And why are all of these internal rules not written down in all the books and literature that I've looked in? Also, where is the complete set of formal rules for optimization, like the one you mentioned for the transpose?
UI = User Interface, typically the Front Panel
Shift register hinders parallellism since it creates a dependancy on the previous loop.
If the internal calculation is small, the overhead and memory access of a parallell process makes it slower.
Yes, there's a bunch of trail and error to optimize and some experience. Also with new versions of LV it changes a little.
06-17-2024 12:01 PM
Please help me sort this out
1) Shift register hinders parallellism since it creates a dependancy on the previous loop.
Why does that happen as the shift register only cares about which memory location it is talking to, not what's inside it? Or are you talking about the shift registers in a linear series of loops?
2) If the internal calculation is small, the overhead and memory access of a parallell process makes it slower.
What is getting shifted in and out as the data sets are small? If I have enough cores for all of the subroutines and their clones, then it should be just data memory access? How does LabView select cache and other short term memory?
3) Yes, there's a bunch of trail and error to optimize and some experience. Also with new versions of LV it changes a little.
I am stuck with my copy of SP1 2020. I am not on a subscription. Getting through NI to someone that knows the compiler seems impossible. (We're sorry you didn't pay for that service ) . I like LabView for physics sims, but how does anyone really optimize for large number of cores, >64, without knowing what the compiler wants to see? Where is it's decision structure for parallelism written down? The trial and error says I can't rely on the compiler to parse my calculation in the most efficient way. I would like to correct that.
Who can I talk to to get this class of answer?
thanks
glenn
06-17-2024 12:54 PM
I posted a thread a few years back that might be relevant to you:
What it seems to boil down to is that LabVIEW's compiler attempts to "clump" things together to run on a single thread, and it doesn't always get it right. If it doesn't get it right, there are workarounds, such as in the example where just putting code in separate 1-iteration FOR loops magically allows for parallel execution.
06-17-2024 07:18 PM - edited 06-17-2024 07:21 PM
There are no formal rules, but here are some other guidelines.
We could give very specific advice once you show us a simplified version of your code.
Always remember, it is not important to keep all cores busy. You can only gain an upper max proportional to the number of cores. Sometimes more efficient code can give you much more than that.
06-18-2024 05:03 AM
I verified parallel loops last year with a simple VI.
Your CPU should show 100%.
06-18-2024 05:32 AM
You may also be affected by this bug: https://forums.ni.com/t5/LabVIEW/Question-about-Implicit-Multithreading/td-p/4334860
Sprinkle some wait 0 around to see if the situation improves.