TCP IP Issue

abasitparker · ‎04-04-2018

Hello, I have a small issue with TCP/IP. I am hesitant to post Work VI's, so I'll just try and describe the problem.

Basically, the two VI's are completely similar, and work perfectly, sending and receiving data on our two SB Rios.

However, when you unexpectedly stop and start one of them, it manages to pick up the connection again.

Do it to the other one however, and it is unable to acquire the connection again, with the VI on the sending SB Rio being stuck on error 56, and the one reading on the other giving an error 60.

However, the reading and writing procedures of both are exactly the same. I'm stumped why this is happening. Another thing, I have to completely close all connections once this is done by closing Labview, even closing and reopening the VI doesn't work, nor does waiting a whole minute.

On the minute thing, if you close and open on the first SB Rio, it doesn't even need to wait 60 seconds before the other SB Rio picks it up. But the code is the same.

For reference though, the first SB Rio is a 9636, and the second is a 9637.

Any tips as to why this might be happening, so I can try implementing a fix would be helpful.

Thank you

rolfk · ‎04-04-2018

This sounds like a problem on the sbRIO side where your server (the code using TCP Create Listener) doesn't properly detect that the connection was lost and keeps trying to service the broken connection. Your server most likely hasn't been written in a way that would be able to serve multiple parallel connection (which isn't necessary either), but when your server doesn't detect that the connection was closed and keeps trying to service the now defunct connection, it can't go back to the listen state to pick up a new incoming connection.

Aside from changing your server code to support parallel connections (which is most likely a complicated re-architecture of your code and therefore not something I would recommend at this stage) you need to make sure the server can detect that the connection went stale. The first step in doing so is to use any error that you may see when you wait in your server loop for new incoming data on the connection. Obviously a timeout error 56 is not a catastrophic error when TCP Read returns, because that usually means that there was simply no data from the other side to retrieve. Any other error however is for sure a clear indication that you better close this connection immediately and return to the listen state to pick up a new incoming connection. And as long as listen doesn't return with success you simply stay in the listen state.

In case you haven't done so, the same applies to the client side only usually even more rigorously. Unless you have a (weird) TCP protocol where the server can send unsolicited data that you need to be able to pick up, ANY error from any of the TCP Read and TCP Write functions is an indication to immediately close that connection and attempt to open a new connection again. This means send a command (TCP Write), wait for the response (TCP Read) and if there was any error with these two then do a TCP Close followed by a TCP Connect).

There is a small chance that despite you doing everything on the server side as explained above that the TCP Read never returns with another error than 56, as the TCP connection state model can be a bit involved and especially the TCP stack on the VxWorks based targets is sometimes not completely foolproof in picking up those state changes properly when a connection goes stale because of a removed network cable or similar. In that case the only possible solution would be to have your server count the number of times the TCP Read does return with a timeout error and after a certain amount of times without any successful transfer, to close the connection anyhow. If the client is written properly it will pick up the dropped connection by the TCP Write returning error 66 (connection closed by the peer), but any other error from TCP Write or Read is also an indication that something went wrong, and then it needs to close the connection too, and reconnect.

Rolf Kalbermatter
My Blog

Jolt · ‎01-04-2019

I didn't run this to the ground but had similar issue.

The "server" (listener) on an sbRIO wouldn't detect the event of unplugging the Ethernet cable. I didn't log everything but it appears to happily keep the connection open returning error 56 - the error that I expected and was ignoring.

Now I count these errors and when reach too many just close the connection. The client can keep the connection open with a regular message and/or written such that a closed connection is detected and the client can automatically attempt to reconnect (which is the way my client was written anyway).

It is a bummer that unplugging the cable is not detected on the listener - really seems like it should throw *some* error that is not a normal timeout!

Thanks for the tip rolfk!

billko · ‎01-04-2019

Why should it throw any other type of error? We're not dealing with the hardware layer here. UDP layer is too high to detect a hardware issue, I think. There's probably a way you can do it with LabVIEW - I haven't needed to research this - but not through UDP.

Bill

(Mid-Level minion.)
My support system ensures that I don't look totally incompetent.
Proud to say that I've progressed beyond knowing just enough to be dangerous. I now know enough to know that I have no clue about anything at all.
Humble author of the CLAD Nugget.

Jolt · ‎01-10-2019

Well, I don't know enough to say anything too smart here - however it just seemed to me that the hardware knows when a cable is unplugged and since at some level the TCP reference I'm using in the LabVIEW layer ties back to the hardware layer I hoped there would be a notification/error.

rolfk · ‎01-10-2019

There usually is, but not a direct one. There are several driver levels between your network interface controller and the TCP/IP interface in LabVIEW. Each of them abstracts a lot of things away to make it more managable. Once the data arrives on the TCP/IP level there is no notion of a specific network interface anymore. It's only a logical interface anymore and this level has no easy way of determiing the actual hardware status of that interface anymore. Yes it could try to access it but that would violate the idea of isolating the different layers from each other and possibly require this level to implement different ways of determing that status depending on the involved hardware, which breaks to whole idea of seperating each driver level from each other as much as possible.

The TCP/IP protocol does use internally status messages such as acknowledges for each message, resend requests and quite a few others that you don't ever see on the application level (but that you can observe in network analyzer traces like in Wireshark). However TCP/IP was designed to work across very low speed connections as low as only a few bytes per second (think dialup connection or interplanetary communication) or very large latencies (think satelite connection and interplanetary communication), while also allowing high speed connections, not the other way around. And it should also continue to work if a line is interrupted but there is another connection available that may go all around the world, over multiple satellite links and what else. So making these checks only wait a few ms before considering a connection broken is simply not an option. Instead they use timeouts in the range of 60 or more seconds with multiple retry attempts, so eventually after maybe 5 minutes the network stack will usually consider such a connection indeed as broken. That may sound silly nowadays with 100Mb/s data links with an uptime of 99% even in residential homes, but TCP/IP is an ARPA project that was designed for military applications where 100Mb per second never ever was a design criteria, but communication in an environment that was heavily damaged and destroyed was.

You can influence some of these parameters with socket options (which LabVIEW does not give you any easy way to access and which also can vary between platform implementations). The most notorious one is the Nagle algorithme, for which there exist some VIs that can disable this on a network connection. For your problem case Nagle does basically nothing so that won't help.

Rolf Kalbermatter
My Blog

LabVIEW

TCP IP Issue

TCP IP Issue

Re: TCP IP Issue

Re: TCP IP Issue

Re: TCP IP Issue

Re: TCP IP Issue

Re: TCP IP Issue