08-01-2018 03:12 PM
Hello all. We have an application in the field that has been running fine for 8 months or so. This application consists of one PXI-8880RT communicating with two daisy-chained 9144 chassis over EtherCAT. A week ago the system started experiencing random Scan Engine faults.
Each time the Scan Engine faults on code -66460 (I/O scan time exceeded). When we reset the fault the scan engine immediately recovers and is no worse for wear until the next fault. The scan engine period is set to 1ms and the -66460 error code uses the default "unconfigured" setting (on Scan Engine page under properties window).
Typically the scan engine faults every few hours or so, but it can also go days between faulting. CPU and memory usage on the PXI are reasonable (<20% cpu usage on all cores, most cores at 0%) and no memory leaks are seen. Again this application has been running for 8 months without issues. We also tried to reboot the PXI several times after the problem started but that didn't fix the issue.
We have also checked all of the associated EtherCAT cables to verify they are OK.
Does anyone have any ideas or insight? Anything I could probe to help find the cause of this issue?
08-02-2018 04:20 PM
08-02-2018 07:05 PM
Hi there!
That is indeed a very interesting situation. May I ask if there were any changes a week (or more) ago in your system? Perhaps a update in your code, or in the version of software packages on the computer? If there way any variance in your system (however slight), it may give us something to go off of.
08-03-2018 05:04 PM
Trevor, thanks for the reply. Having a variance or change in the system would obviously help troubleshoot this problem, but unfortunately we've racked our brains and just can't think of any difference.
The hardware is all the same (cables, etc). No new equipment was installed nearby. We even checked the incoming power and couldn't find any issues.
The software is also unchanged. The PXI was originally configured with LabVIEW 2017 and associated drivers back when the project was developed and then commissioned, but it hasn't been touched since January 2018. And the system had been running in the 8 months since.
One extra piece of info: when I went to the jobsite and debugged the PXI code via "Operate -> Debug Application or shared library" (code was deployed with debugging enabled), navigating and probing the block diagrams was very slow. We tried rebooting the PXI but the slowness remained. For example when trying to probe a wire that was within a timed loop sync'd to the scan engine it would occasionally hang LabVIEW and require a force close on the development computer. Note the PXI cpu usage was very low at this point seen via Distributed System Manager.
After updating the scan engine period to 10ms the debugging immediately became fast again and we didn't have any further issues. So something related to the scan engine was hanging or stalling at 1ms, but again the PXI cpu usage didn't seem to be outrageous. We do have timers in our code to measure the execution speed and noticed some jitter when updating at 1ms, but the jitter decreased significantly when the period was changed to 10ms. For reference our main timed loop was taking ~0.1ms to execute on average, so we figured a 1ms update rate wasn't pushing anything.
The system has been running for almost two days now since changing to 10ms, so whatever was failing seems to have been fixed by the slower update rate.
08-05-2018 09:16 PM - last edited on 12-03-2024 05:59 PM by Content Cleaner
After reading all the descriptions, I think the temperature must be the only change to your system. The whole world is having a very hot summer now.
Referring to the manual, can you confirm if the PXIe-8880 controller is still working under the epxected "Operating Envrionment" (0 - 50 °C)? High temperature will cause CPU frequency variation, this will increase the jitter of the RT.
My suggestion:
1. Ensure the sponge of the chassis is in a healthy state:
2. Ensure the chassis is well sealed in the unused slot:
3. Turn the fans on High.
4. In the BIOS, disable the following CPU configuration which will impact the jitter of the RT: Hyper-Threading, Turbo Boost, C-States, and Intel VT-d.
Hope these will help.
08-06-2018 11:17 AM
08-07-2018 05:11 PM - last edited on 12-03-2024 05:59 PM by Content Cleaner
I can confirm that the either the slave devices or the master could be the bottleneck for needed Scan Engine time. There are a couple benchmarks we have for 9144 performance in this regard.
08-17-2018 10:46 AM
Unfortunately the problem re-occurred last night. It has successfully ran for over a week with the slower (10ms) update time, but again the EtherCAT link went down.
We did check the PXI filter but it is very clean. Again the PXI is in an air-conditioned lab (20C) so it has consistent environmental conditions. When the system went down the entire lab was cool (it has cooled off here in the past week, so ambient temps are <22C).
Our EtherCAT cables are wired in the following way:
8880RT (local processor Port configured for EtherCAT) -> 9144 #1 -> 9144 #2
Cables have been re-crimped and checked for tight bends, but everything looks good. The first 9144 chassis is fully populated, the other is half populated so we shouldn't be pushing I/O rates too hard at 10ms. The PXI shows no memory leak or high cpu usage. And the system did run without any issues for 8-9 months before the problems started to occur.
Are there any advanced diagnostics we can examine to figure out why the EtherCAT ring is having problems so intermittently?
08-17-2018 12:54 PM - last edited on 12-03-2024 06:00 PM by Content Cleaner
@shansen1 wrote:
....
Are there any advanced diagnostics we can examine to figure out why the EtherCAT ring is having problems so intermittently?
This KB seems to outline how to use Wireshark to monitor EtherCAT.
Please keep us updated if you go that route and find something. I woul dbe interested in what you find, and what was involved in filtering for the failing condition.
May God smile on your efforts (since I can not help you beyond that link)
Ben
02-03-2020 02:30 PM
I'm seeing this now with the cRIO-9068 and a string of Beckhoff EtherCAT slaves. 10 ms scan engine time. If I just set error -66460 to "ignore" in the scan engine config, will scan engine recover from these intermittent errors?