06-24-2024 03:09 PM
I am having a strange issue with a cRIO that I can't seem to figure out myself. The device itself will lock up randomly. This could be minutes after starting up, hours, and sometimes days. There doesn't seem to be any rhyme or reason as to why. When the system freezes up, the embedded UI goes completely unresponsive (mouse and keyboard don't work at all). It will not be discoverable in NI MAX, you cannot connect to it with SSH, or access files via SFTP or WebDAV. Despite all of this, you can still ping the device and it will respond. So it's not 100% frozen but most of the functions stop working entirely. Even setting up an FPGA watchdog, the watchdog actually never triggers so it never reboots the OS even when frozen up. I have monitored memory and CPU and it is very consistent and low (average CPU ~8% and memory does not grow with time). I have gone through all of the code to look for any possible memory leak or issue leading to memory fragmentation (dynamically building arrays/strings, etc.). I then noticed that even removing the RT application and having it boot to the terminal in safe mode led to the exact same result so I'm confident it's not an issue with the application itself. Also the system logs don't seem to have any helpful information for my issue (likely because the device never fully crashes). I've reformatted the disk and installed the latest firmware/software etc with the same result. Through further testing, I've noticed that the issue doesn't seem to appear when disconnected from our corporate network. This then led me to take the device home and try running it on my home network. I did not have any issues running it at home either so it seems to be an issue with running the device on our corporate network. I've talked to our IT team and they are confident that there is nothing that could be affecting it and that the device must be a dud (even though it's brand new and works fine outside of our facility or disconnected from the network).
I am curious if anyone has any ideas as to other things that I can check or things that I can test to help find the true root cause of this issue. I've exhausted pretty much everything that I can possibly think of at this point so any ideas would be helpful.
06-24-2024 05:26 PM
Well, since you have a sorta-kinda solution (disconnecting it from the corporate network), maybe look more into that?
You don't say what the device needs to do on the network. I'm assuming it has to be connected to "something" to function, but is it an option to connect it directly to a 2nd network card on a PC instead of directly to the network? Or to put a firewall between it and the network?
Is there a chance you could set up a router to intercept and log network traffic to it? It could be that something on your corporate network is sending it a packet right before it goes inoperative that you could then isolate as the cause, and see if there's a service on the cRIO you could deactivate or move to a different port that would resolve it.
06-25-2024 09:18 AM
Sorry, I probably should've explained what we are doing with the device a little better. So the cRIO is running some thermal cycle testing in which it is controlling a thermal chamber to make it go hot and cold and dwell for the right amount of time and does this for a little over 24 hours per batch of units. At each hot and cold setpoint, it is running a series of tests on the units under test. It is logging the temperature data to file during the entire dwell and recording the results for each unit and each test in a file. It is connected to the network as the device is inside of a clean room and isn't super quick or easy to access or check on. Being online we can monitor where it is in the testing process as well as pull the data files off of the device and start/stop testing remotely if we need to. In theory we could survive with it being off of the network but being online makes everything significantly more simple.
I do have very little control over the configuration of our devices as we are operating in a secure environment. We already have the OT devices completely segregated to an OT VLAN of our network that isn't reachable by our enterprise network. There are only specific servers which have an interface on both. I am having a hard time getting our IT team to help me do some testing on it or look into anything as they are confident that nothing on the network could cause that.
I did install tshark on the cRIO yesterday to try monitoring on the device itself. The packet filtering only seems to be able to run for about a minute or so though before it returns a "1 packet dropped message" and stops. Not sure if that could be related to the issue or not but I think it would be very hard to capture the potential bad packet using that method. I am considering setting up IP Tables on the device such that only the ports and comms that I want are allowed.
I am curious if anyone has seen anything else like this. I'm not sure if virus scanning, vulnerability scanning, or something else could potentially be causing the issue.