Most efficient data transfer between RT and FPGA

Intaris · ‎02-05-2014

This post is related to THIS post about DMA overhead.

I am currently investigating themost efficient way to transfer a set of variables to a FPGA target for out application. We have been using DMA FIFOs for communications in both directions (to and from FPGA) but I'm recently questioning whether this is the most efficient approach.

Our application must communicate several parameters (around 120 different variables in total) to the FPGA. Approximately 16 of these are critical meaning that they must be sent every iteration of our RT control loop. The others are also important but can be sent at a slightly slower rate without jeopardising the integrity of our system. Until now we have sent these 16 critical parameters plus ONE non-critical parameter over a DMA to the FPGA card. Each 32-bit value sent incorporates an ID which allows the FPGA to demultiplex to the appropriate global variables on the FPGA. Thus over time (we run a 20kHz control loop on the RT system - we have a complete set of paramaters sent @ approx. 200Hz). The DMA transfers are currently a relatively large factor in limiting the execution speed of our RT loop. Of the 50us available per time-slot running at 20kHz approximately 12-20us of these are the DMA transfers to and from the FPGA target. Our FPGA loop is running at 8MHz.

According to NI the most efficient way to transfer data to a FPGA target is via DMA. While this may in general be true, I have found that for SMALL amounts of data, DMA is not terribly efficient in terms of speed. Below is a screenshot of a benchmark program I have been using to test the efficiency of different types of transfer to the FPGA. In the test I create a 32MB data set (Except for the FXP values which are only present for comparison - they have no pertinence to this issue at the moment) which is sent to the FPGA over DMA in differing sized blocks (with the number of DMA writes times the array size being constant). We thus move from a single really large DMA transfer to a multitude of extremely small transfers and monitor the time taken for each mode and data type. The FPGA sends a response to the DMA transfers so that we can be sure that when reading the response DMA that ALL of the data has actually arrived on the FPGA target and is not simply buffered by the system.

We see that the minimum round-time for the DMA Write and subsequent DMA read for confirmation is approximately 30us. When sending less than 800 Bytes, this time is essentially constant per packet. Only when we start sending more than 800 Bytes at a time do we see an increase in the time taken per packet. A packet of 1 Byte and a packet of 800 Bytes take approxiamtely the SAME time to transfer. Our application is sending 64 Bytes of critical information to the FPGA target each time meaning that we are clearly in the "less efficient" region of DMA transfers.

If we compare the times taken when communication over FP controls we see that irrespective of how many controls we write at a time, the overall throughput is constant with a timing of 2.7us for 80 Bytes. For a small dedicated set of parameters, the usage of front panel controls seems to be significantly faster than sending per DMA. Once we need to send more than 800 Bytes, the DMA starts to become rapidly more efficient.

Intaris · ‎02-05-2014

So to continue:

For small data sets the usage of FP controls may be faster than DMAs. OK. But we're always told that each and every FP control takes up resources, so how much more expensive is the varsion with FP controls over the DMA.

According to the resource usage guide for the card I'm using (HERE) the following is true:

DMA (1023 Elements, I32, no Arbitration) : 604 Flip-Flops 733 LUT 1 Block RAM

1x I32 FP Control: 52 Flip-Flops 32 LUTs 0 Block RAM

So the comparison would seem to yield the following result (for 16 elements).

DMA : 604 FLip-Flops 733 LUT 1 Block RAM

FP : 832 FLip-Flops 512 LUT 0 Block RAM

We require more FLip-Flops, less LUTs and no Block RAM. It's a swings and roundabouts scenario. Depending on which resources are actually limited on the target, one version or the other may be preferred.

However, upon thinking further I realised something else. When we use the DMA, it is purely a communications channel. Upon arrival, we unpack the values and store them into global variables in order to make the values available within the FPGA program. We also multiplex other values in the DMA so we can't simply arrange the code to be fed directly from the DMA which would negate the need for the globals at all. The FP controls, however, ARE already persistent data storage values and assuming we pass the values along a wire into subVIs, we don't need additional globals in this scenario. So the burning question is "How expensive are globals?". The PDF linked to above does not explicitly mention the difference in cost between FP controls and globals so I'll have to assume they're similar. This of course massively changes the conclusion arrived to earlier.

The comparison now becomes:

DMA + Globals : 1436 Flip-Flops 1245 LUTs 1 Block RAM

FP : 832 FLip-Flops 512 LUT 0 Block RAM

This seems very surprising to me. I'm suspiscious of my own conclusion here. Can someone with more knowledge of the resource requirements differences between Globals and FP controls weigh in? If this is really the case, we need to re-think our approach to communications between RT and FPGA to most likely employ a hybrid approach.

Shane.

nathand · ‎02-05-2014

I'm not sure I'm any more familiar with FPGA resource requirements than you are, but your analysis looks completely reasonable to me. I don't ever remember seeing NI state that DMA is always the most efficient way to transfer data to and from the FPGA, especially since "efficient" can be measured in a lot of different ways.

The FPGA is hardware, and it's probably useful to think of it in terms of how other hardware devices operate. Most peripherals that transfer large amounts of data use DMA, but they also have a set of registers for configuration, resulting in the sort of hybrid configuration that you describe. Writing to FPGA front-panel items is equivalent to setting register values. I haven't worked with any peripherals that use DMA for setup and configuration.

Do you actually need one front panel item per variable you want to transfer? Have you considered using a handshaking approach for the variables you don't update as frequently, where you have an enumeration or index indicating which value to transfer, a control for the actual value, and a boolean that tells the FPGA a new value is available? I would think that a global (or a register or memory block) would require less resources than a front panel item, although I don't know by how much.

What are your goals here? Minimize FPGA space? Minimize latency? Minimize the time the main processor spends transferring values to the FPGA (one advantage of DMA)?

T-REX$ · ‎02-05-2014

So, if memory serves me correctly, FP items do use more resources than Globals... sorta.

FP items will will take up extra resources if they're larger than 32-bits because we have to make an extra copy of that data to ensure an "atomic" ( <- not really the right word there) data transfer to/from the host. Additionally, the PCI bus/controller is a shared resource amongst all the FP items, so if you're using a lot of FP items for data transfer, you'll end up adding latency to the system.

Globals on the other hand don't implement any host comm, so you'll never need an extra copy of them (for that). A global can be accessed from multiple clock domains however, so if you do that, you have to add in arbitration logic there.

Of course, a FP item with a local variable tied to it would incur arbitration if you cross clock domains as well.

Cheers!

TJ G

Intaris · ‎02-06-2014

Well I have started testing the differences between Clobals and FP controls.

I created a VI with 80x I31 Controls and Indicators. Allt he code does is writes to an indicator for each control.

I then halved the number of Controls to 40 and observed the change in resources.

I also added 80x Globals between the Controls and Indicators (Control -> Global, Global-> Indicator) and observed the change in resources.$

My VI with 80x Controls and Indicators required (All results are from the Synthesis stage of compiling) :

8679 Registers and 9092 LUTs

With 40x Controls and indicators

6116 Registers and 5235 LUTs

Difference = 2563 Registers and 3857 LUTs.

Dividing this by 40 yields an uneven number (!) of 64.075 Registers and 96.425 LUTs per control. 64 Registers are perfectly explainable as 1x32-bit Control and 1x 32-bit Indicator require 64 bits. The exact requirement for the LUTs requires more in-depth knowledge of the FPGA implementation than I could possible guess at.

When we observe the VI version with 80x Controls and 80x Indicators with 80x Globals we see

11239 Registers and 9092 LUTs

Versus the version without globals whis is a change of

2560 Registers and 0 LUTs

Conclusion (Approximate cost):

X-bit FP Controls cost X Registers and 1.5X LUTs

X-bit Globals cost X Registers and 0 LUTs

So Globals are certainly cheaper than FP Controls, but only with regard to LUTs.

Perhaps most interestingly, the numbers for an I32 do NOT match the PDF documents available on the NI website for approximate usage of FP controls and induicators by a significant margin.

PDF: 52 Registers and 32 LUTs

My VI: 32 Registers and 48 LUTs

So my resource comparison for my previous example (FP versus DMA and Globals) for 16x U32 is

DMA + Global = 1116 Registers and 733 LUTs

FP Controls = 512 Registers and 768 LUTs

So the FP Controls require 50% less Registers but approx 10% more LUTs.

Intaris · ‎02-06-2014

Further testing seems to indicate that 4x FIFO (U32 Target to Host no Arbitration 16 elements) cost 1043 Registers and 1191 LUTs each. Again these numbers are quite a way away from the "official" usage statistics available on the web (533 Registers 733 LUT). Granted the stats are for LV 8.6.

Any chance NI could update the usage document with more up-to-date numbers?

Shane.

Intaris · ‎02-06-2014

Oh, and the DMA seems to use 2 Blocl RAM blocks, not one as claimed int he usage document.

Also a Target to Host DMA should cost 359 Registers and 295 LUTs, not 533 and 733 as previously claimed).

Spoiler

Should have been 604 Registers and 733 LUTs anyway

LabVIEW

Most efficient data transfer between RT and FPGA

Most efficient data transfer between RT and FPGA

Re: Most efficient data transfer between RT and FPGA

Re: Most efficient data transfer between RT and FPGA

Re: Most efficient data transfer between RT and FPGA

Re: Most efficient data transfer between RT and FPGA

Re: Most efficient data transfer between RT and FPGA

Re: Most efficient data transfer between RT and FPGA