04-24-2013 02:16 AM - last edited on 08-14-2024 09:32 AM by Content Cleaner
Dear altenbach, I expect that all the eight logical processors (= threads) will be engaged, when the algorithm allows for it. That is how I understand hyper-threading: 4 physical cores, each with 2 threads = 8 virtual processing units. And that is also what the http://www.ni.com/white-paper/3558/en#toc7 article states in the very first paragraph. G
04-24-2013 02:44 AM
@altenbach wrote:
If you turn off parallelization, how much slower is it?
I tried to switch off parallelisation (I was inspired by http://www-w2k.gsi.de/controls/CS/How-To/cs_multithreading.htm see the "How and why to turn multithreading off" section), but I could not find any such option in LabVIEW 2012. I decide to turn on only one single core in the BIOS options. But the start-up of Win7 64bit took more than four time longer than with all 4 cores enabled. I reverted this setting in BIOS after watching the "please wait..." window's boot screen for 20 minutes. Any hint where to switch off parallelisation in LabVIEW only, please? G
04-24-2013 04:47 AM
altenbach wrote:
If you turn off parallelization, how much slower is it?
I think altenbach was refering to turning off the parallelism on your "for loops".
Go to: Tools>>Profile>>Find Parallelizable Loops...
You will get a list of all "for loops" and their parallelized status.
Unfortunately you can't change the settings here.
Double click on a "for loop" in the list to go to it's block diagram and change the setting.
There is a refresh button to update the list status.
steve
04-24-2013 07:46 AM
@ghighuphu wrote:
I expect that all the eight logical processors (= threads) will be engaged, when the algorithm allows for it.
Hi,
I guess your algorithm is not utlize all eight cores. You should adapt is to multicore PC.
Do pretty simple test - put 8 while loops without any delays and see how many cores will be utlized. Then you should see something like that (in my case I have 2 CPUs each with 6 cores and hyperthreading is enabled - therefore 24 while-loops):
As you can see - overall 100% cpu load.
Andrey.
04-24-2013 08:38 AM - edited 04-24-2013 08:45 AM
Thank you Andrey! It was a very easy check on the possibilities of my PC. All eight virtual cores are alive an running 🙂 OK, so I have to rethink the algorithm and its possibilities. (I was sure, it was suppose to use all possible cores. Obviously, it was not...)
04-24-2013 08:38 AM - edited 04-24-2013 08:39 AM
@ghighuphu wrote:
Next I've ran the "4 Calculate N Digits of Pi.vi" with N set to 10000 (ten thousand). The result was, that only four cores out of eight were engaged. Do you get similar or different results please? KR, M
OK, I had a glance at the code and it is NOT optimized for multiple processors at all. What makes you think it is???
There is a place where 4 reentrant "series" subVIs are called in parallel, so there will be some mild parallelization with an upper limit of 4x.
Some simple profiling shows that significant effort is spent in the "powers of two" subVI. Once you disable debugging and inline it, this code will be folded and will take 0 time, overall speeding the calculations dramatically. There are quite a few other places where significant optimization is possible. Please try. I have not.
04-24-2013 08:43 AM
@altenbach wrote:
OK, I had a glance at the code and it is NOT optimized for multiple processors at all. What makes you think it is???
There is a place where 4 reentrant "series" subVIs are called in parallel, so there will be some mild parallelization with an upper limit of 4x.
I did not think it through so far. It was a coincidence that the four re-entrant VIs engaged my four processors.
04-24-2013 09:09 AM
OK, I did a benchmark on my old non-hyperthreaded 4 core machine (Intel Q9300) before and after my 2 minute modifications mentioned above and here are the results for 10000 digits:
stock: 304 seconds, ~75% CPU utilization
my modification: 43 seconds, ~88% CPU utilization
As you can see, a few trivial changes can speed up things by more than a factor of 7 and this is only the tip of the iceberg salad! We can tell that in the stock implementation, a huge percentage of the CPU is just pumping hot air (thread swapping, other overhead, etc.) instead of doing real work. There is no telling what's possible if the entire thing is rearchitected from scratch to optimize for multiprocessor use. There is no upper limit, because the problem can be split into an infinite number of seperate calcaluations. Try it!
04-24-2013 09:13 AM
@Andrey_Dmitriev wrote:
Do pretty simple test - put 8 while loops without any delays and see how many cores will be utlized. Then you should see something like that (in my case I have 2 CPUs each with 6 cores and hyperthreading is enabled - therefore 24 while-loops):
... and don't forget to keep the fire extinguisher nearby. 😄
04-24-2013 10:21 AM - edited 04-24-2013 10:41 AM
altenbach wrote:stock: 304 seconds, ~75% CPU utilization
my modification: 43 seconds, ~88% CPU utilization
This also tells you that core utilization is NOT a useful tool to assess the quality of the code. The goal should be to do it quickest with the least CPU effort. The only measure that counts is the elapsed time to achieve the task. If you have inefficient parallel code that burns 100% of all CPUs, and a better serial algorithm that can get the same result in 10% of the time using a single core, you should go with the latter.
Maxing out all cores at 100% is never the primary goal. It also has disadvantages, such as placing high demands on the thermal management of the computer as well as impacting everything else you are trying to do on the computer at the same time. For a quick test, run Andrey's code above, then try to browse the web or watch a youtube clip. 😮