Parallel Processing to reduce Execution Time

BabarAly1 · ‎12-20-2023

Hello NI Community,

I hope that this message find you in good health and spirits!

Goal: The main objective is to reduce the overall execution time by processing in parallel.

Attached please find the three snapshots of three steps of complete process.

Step 1: Reading Binary File + En-queuing

In this step, I read 16640 samples with I-16 data type, and feed into the Queue Q1 using Enqueue Function.
Total samples are 8,272,276,480 with I-16 data type (roughly 16 GB binary file), whereas my read speed is around 7000 MB/s. It takes rough 2.4 seconds to read complete file explicitly. This is being running in While Loop 1.

Step 2: Parallel Digital Down-conversion
In this step, Queue Q1 is being de-queued and decimation by 2 is performed. Complex data is formed with this decimated I and Q data. In this 16640 samples, I form four data sets with each data set comprises 2080 complex samples, I discard 1040 samples out of each data set and perform digital down-conversion. Firstly, I used MT Digital Downconvert Passband (Complex) from LabVIEW Modulation Toolkit with filter disabled. However, It prevents parallel use of this SubVI as it uses Call Library Function Node. Therefore, I have created subVI using DDC equation and placed such four SubVIs in parallel. Afterwards, I concatenate all outputs of all four DDCs and fed to Downsampler with down-sample factor of 1040. The output of this Downsampler is being enqueued in Queue Q2. This is being running in While Loop 2.

DDC Equation: [I + jQ ]*[Cos(2*pi*Fc*t)+j*Sin(2*pi*Fc*t)]

Step 3: FM Demodulation + Sound File Write
In this step, Queue 2 is being de-queued and output is fed to MT Demodulate FM from LabVIEW Modulation Toolkit. The output of this demodulation is fed to Sound File Write. This is being running in While Loop 3.

Problem is that execution time to process complete file passing through these steps come out to be around 15 seconds. Objective is to reduce this time to be near to 2 seconds. Kind assistance is required in this regard, please.

System Specs:
Processor: Core i-9-12900K 12th Gen with 24 CPUs
RAM: 64 GB
Hard Drive: Samsung 980 Pro (7000 MB/s Seq. Read Speed and 5100 MB/s Seq. Write Speed)

Additional Information:
Input Sample Rate: 49.92 MS/s
Output Sample Rate: 48 KS/s

Kevin_Price · ‎12-20-2023

No full answer here, just a tip to help with diagnosis.

Your timing benchmarks for the 2nd and 3rd vi's in the queue chain include all the time spent waiting to dequeue upstream.

I'd recommend you benchmark with a smaller total dataset that fits in memory all at once. Maybe start with 1 GB and double or halve from there. Put breakpoints (or do some functionally similar thing) on your 2nd and 3rd vi's so you can choose to start their timers *after* their queue has filled up. This may give you better insight about the time requirements of each stage.

Note: this is not a perfect method, just an easy-to-implement first step to see if one of the processing stages requires much more processing time than the others. If so, at least you have better insight where to focus first.

-Kevin P

ALERT! LabVIEW's subscription-only policy came to an end (finally!). Unfortunately, pricing favors the captured and committed over new adopters -- so tread carefully.

mcduff · ‎12-20-2023

A couple of minor suggestions that may help, cannot see your VIs.

When reading the I16 data from disk choose the number of points such that it is a multiple of the disk sector size. Typically this is faster, but may not make a huge difference.
After reading the data from disk, reshape the array before sending it in the Queue; this will give a column of I and Q and may be faster than the decimate function. Then use the array index function to get each column.

See snippets below.

Read a multiple of the disk sector size.Make I & Q columns instead of decimate array.

Jay14159265 · ‎12-20-2023

@BabarAly1 wrote:

Hello NI Community,

I hope that this message find you in good health and spirits!

Goal: The main objective is to reduce the overall execution time by processing in parallel.

Attached please find the three snapshots of three steps of complete process.

Step 1: Reading Binary File + En-queuing

In this step, I read 16640 samples with I-16 data type, and feed into the Queue Q1 using Enqueue Function.
Total samples are 8,272,276,480 with I-16 data type (roughly 16 GB binary file), whereas my read speed is around 7000 MB/s. It takes rough 2.4 seconds to read complete file explicitly. This is being running in While Loop 1.

Step 2: Parallel Digital Down-conversion
In this step, Queue Q1 is being de-queued and decimation by 2 is performed. Complex data is formed with this decimated I and Q data. In this 16640 samples, I form four data sets with each data set comprises 2080 complex samples, I discard 1040 samples out of each data set and perform digital down-conversion. Firstly, I used MT Digital Downconvert Passband (Complex) from LabVIEW Modulation Toolkit with filter disabled. However, It prevents parallel use of this SubVI as it uses Call Library Function Node. Therefore, I have created subVI using DDC equation and placed such four SubVIs in parallel. Afterwards, I concatenate all outputs of all four DDCs and fed to Downsampler with down-sample factor of 1040. The output of this Downsampler is being enqueued in Queue Q2. This is being running in While Loop 2.

DDC Equation: [I + jQ ]*[Cos(2*pi*Fc*t)+j*Sin(2*pi*Fc*t)]

Step 3: FM Demodulation + Sound File Write
In this step, Queue 2 is being de-queued and output is fed to MT Demodulate FM from LabVIEW Modulation Toolkit. The output of this demodulation is fed to Sound File Write. This is being running in While Loop 3.

Problem is that execution time to process complete file passing through these steps come out to be around 15 seconds. Objective is to reduce this time to be near to 2 seconds. Kind assistance is required in this regard, please.

System Specs:
Processor: Core i-9-12900K 12th Gen with 24 CPUs
RAM: 64 GB
Hard Drive: Samsung 980 Pro (7000 MB/s Seq. Read Speed and 5100 MB/s Seq. Write Speed)

Additional Information:
Input Sample Rate: 49.92 MS/s
Output Sample Rate: 48 KS/s

This is an interesting problem.

I would do a bit of profiling to see where the most time is spent and what you can get away with. Based on your comments 2.4s to read the file 15s of total execution time there is ~ 12s of processing time. and you want to make the entire thing run faster than your 2.4s read time. You will need to break the reading and processing of data up into something that the computer can do in parallel. You can read the first chunk (C1) of the file and start a CPU process to process the data. Then read the next chunk (C2) of data and start an second CPU process to process C2. The trick is finding a balance of over how many process you can run and how much data to read at one time and send to each process. Also this assumes that you can break the computation into chunks. I don't see any t-1/t+1 in the equation so this should be ok.

After processing what is the final file size?

______________________________________________________________
Have a pleasant day and be sure to learn Python for success and prosperity.

santo_13 · ‎12-20-2023

The technique Jay is referring to is called Data Pipelining and it will boost the performance significantly when used right.

In the real world, your IQ samples are downsampled and decoded into audio signals by the radio; trying to mimic the same in the software domain will give you the best performance.

Santhosh
Soliton Technologies

New to the forum? Please read community guidelines and how to ask smart questions

Only two ways to appreciate someone who spent their free time to reply/answer your question - give them Kudos or mark their reply as the answer/solution.

Finding it hard to source NI hardware? Try NI Trading Post

BabarAly1 · ‎12-21-2023

Yes, I have tried to benchmark the timings of each loop separately.
First while loop performs reading from binary file and it takes roughly 2.4 seconds to read the whole 16 GB file.

Second while loop performs digital down-conversion, concatenation, and resampling and process 16640 I16 samples for 497,132 iterations in ~15 milliseconds.
Third while loop performs FM demodulation and Sound file writing which takes around 3.1 seconds for 497,132 iterations.

Now, if I run first while loop (reading from binary file+enqueuing) and second while loop (dequeuing+ digital down-conversion+concatenation+downsampling), it takes ~2.8 seconds to complete the file.
But when I add FM demodulation and Sound File Writing, the overall execution is drastically slows down.

I was also wondering that how to determine the rate at which data is being enqueued and dequeued, how fast are the queues?

BabarAly1 · ‎12-21-2023

Yes, I have been trying to take advantage of parallel processing (Multicore processing/multi-threading) but I am still unable to meet the overall objective. The final size of the file is ~3884KB

GerdW · ‎12-21-2023

Hi Babar,

@BabarAly1 wrote:

First while loop performs reading from binary file and it takes roughly 2.4 seconds to read the whole 16 GB file.

Quite good for your hardware.

@BabarAly1 wrote:
Second while loop performs digital down-conversion, concatenation, and resampling and process 16640 I16 samples for 497,132 iterations in ~15 milliseconds.

It's fast, but I would reorganize the code to:

use the timeout output of the Dequeue instead of determining the array size
instead of those 4 case structures to get the array subsets I would use one ArraySubset node, & one IndexArray node to get the array subsets and call that "Up/Down" subVI 4 times in parallel (it is reentrant!?)
do you really need a stop button in a loop that executes in 15ms?

@BabarAly1 wrote:

Third while loop performs FM demodulation and Sound file writing which takes around 3.1 seconds for 497,132 iterations.

Sounds reasonable to me, especially when calling that FileWrite function ~500k times!

Maybe you can create the whole waveform first (in memory) and only write once at the end?

Again I would use the timeout output of the Dequeue to control the case structure…

@BabarAly1 wrote:

I was also wondering that how to determine the rate at which data is being enqueued and dequeued, how fast are the queues?

The enqueue rate is determined by your loops.

The dequeue will react as soon as there is an element in the queue…

Queues are fast…

One more suggestion:

Instead of sending 16GB of data through queues you could put your data blocks into DVRs and only send the DVR through the queue. So the queue itself only needs to handle a DVR pointer instead of large data blocks…

Best regards,
GerdW

using LV2016/2019/2021 on Win10/11+cRIO, TestStand2016/2019

Jay14159265 · ‎12-21-2023

@GerdW wrote:

...

One more suggestion:

Instead of sending 16GB of data through queues you could put your data blocks into DVRs and only send the DVR through the queue. So the queue itself only needs to handle a DVR pointer instead of large data blocks…

Will LabVIEW let you span processes with DVRs?

______________________________________________________________
Have a pleasant day and be sure to learn Python for success and prosperity.

BabarAly1 · ‎12-22-2023

Thank you for your kind responses. It is really helpful in understanding the problem.

I have tried to benchmark timings separately for all three loops again but this time, I just started with one DDC.

Experiment#1

In first while loop, I read 4160 I-16 samples and feed into the Enqueue Function.

In second while loop, I am dequeuing the 4160 samples, discard half of the samples, decimate by 2, form complex array of 1040 samples from I & Q and feed to the SubVI (Digital Down-conversion(DDC)), feed output of DDC to downsampler (downsample factor of 1040), output is only one complex sample.

In this case, time taken by first while loop is 6833 ms and second while loop is 6934 ms for complete file of 16 GBs.

Experiment#2

Now, I have added the third while loop. The output of downsampler from second while loop goes into the Enqueue Function, which is dequeued in third while loop. I perform no other function in this loop.

Now, the results are surprising for me as the first loop takes 19393 ms, whereas second while loop 34176 ms and third while loop takes 34178 ms.

I am unable to understand this strange behavior.

Specs:

I have set max Queues size of both the queue to -1. SubVIs are reentrant. I have disabled Debugging in the main VI for timing benchmarking.

Additionally, I have tried to use DVRs as well.

LabVIEW

Parallel Processing to reduce Execution Time

Parallel Processing to reduce Execution Time

Re: Parallel Processing to reduce Execution Time

Re: Parallel Processing to reduce Execution Time

Re: Parallel Processing to reduce Execution Time

Re: Parallel Processing to reduce Execution Time

Re: Parallel Processing to reduce Execution Time

Re: Parallel Processing to reduce Execution Time

Re: Parallel Processing to reduce Execution Time

Re: Parallel Processing to reduce Execution Time

Re: Parallel Processing to reduce Execution Time