04-10-2017 01:39 PM
I am looking for some help and getting suggestion about a critical issue we saw on one of the CompactRio-9068 platform running with NI Real Time Linux. CPU SW implementation was done using C/C++ Eclipse IDE provided from NI.
Background on SW implementation:
Issue seen while running SW:
One of the /var/log/message file is attached on this message. Part of it is shown below for a quick reference:
Apr 1 13:45:01 lc kernel: [1967263.177879] [<c0013f08>] (unwind_backtrace+0x0/0xec) from [<c03a4e64>] (dump_stack+0x20/0x24)
Apr 1 13:45:01 lc kernel: [1967263.177907] [<c03a4e64>] (dump_stack+0x20/0x24) from [<c0076870>] (__rcu_pending+0x98/0x3e8)
Apr 1 13:45:01 lc kernel: [1967263.177933] [<c0076870>] (__rcu_pending+0x98/0x3e8) from [<c00781d8>] (rcu_check_callbacks+0x194/0x24c)
Apr 1 13:45:01 lc kernel: [1967263.177964] [<c00781d8>] (rcu_check_callbacks+0x194/0x24c) from [<c003b138>] (update_process_times+0x4c/0x58)
Apr 1 13:45:01 lc kernel: [1967263.177991] [<c003b138>] (update_process_times+0x4c/0x58) from [<c005b268>] (tick_sched_timer+0xa4/0xdc)
Apr 1 13:45:01 lc kernel: [1967263.178021] [<c005b268>] (tick_sched_timer+0xa4/0xdc) from [<c004f344>] (__run_hrtimer+0x160/0x260)
Apr 1 13:45:01 lc kernel: [1967263.178046] [<c004f344>] (__run_hrtimer+0x160/0x260) from [<c0050328>] (hrtimer_interrupt+0x148/0x2b0)
Apr 1 13:45:01 lc kernel: [1967263.178069] [<c0050328>] (hrtimer_interrupt+0x148/0x2b0) from [<c00134d4>] (twd_handler+0x30/0x38)
Apr 1 13:45:01 lc kernel: [1967263.178094] [<c00134d4>] (twd_handler+0x30/0x38) from [<c0073ccc>] (handle_percpu_devid_irq+0x110/0x1c0)
Apr 1 13:45:01 lc kernel: [1967263.178124] [<c0073ccc>] (handle_percpu_devid_irq+0x110/0x1c0) from [<c006ffc0>] (generic_handle_irq+0x30/0x40)
Apr 1 13:45:01 lc kernel: [1967263.178151] [<c006ffc0>] (generic_handle_irq+0x30/0x40) from [<c000e7c4>] (handle_IRQ+0x88/0xc8)
Apr 1 13:45:01 lc kernel: [1967263.178174] [<c000e7c4>] (handle_IRQ+0x88/0xc8) from [<c0008438>] (asm_do_IRQ+0x18/0x1c)
Apr 1 13:45:01 lc kernel: [1967263.178195] [<c0008438>] (asm_do_IRQ+0x18/0x1c) from [<c000d8bc>] (__irq_svc+0x3c/0xb8)
Apr 1 13:45:01 lc kernel: [1967263.178209] Exception stack(0xdee83e58 to 0xdee83ea0)
Apr 1 13:45:01 lc kernel: [1967263.178222] 3e40: dee82020 dee83ea0
Apr 1 13:45:01 lc kernel: [1967263.178243] 3e60: de1742d8 dee82000 de1742d0 de392ac0 dee83fb0 de174280 8000000f 00000318
Apr 1 13:45:01 lc kernel: [1967263.178263] 3e80: de1742d0 dee83eb4 00000000 dee83ea0 c0022bbc c03a9ed8 a0000113 ffffffff
Apr 1 13:45:01 lc kernel: [1967263.178287] [<c000d8bc>] (__irq_svc+0x3c/0xb8) from [<c03a9ed8>] (rt_mutex_trylock+0x40/0x80)
Apr 1 13:45:01 lc kernel: [1967263.178312] [<c03a9ed8>] (rt_mutex_trylock+0x40/0x80) from [<c005fdb0>] (rt_down_read_trylock+0x3c/0x68)
Apr 1 13:45:01 lc kernel: [1967263.178337] [<c005fdb0>] (rt_down_read_trylock+0x3c/0x68) from [<c00155d4>] (do_page_fault+0x88/0x210)
Apr 1 13:45:01 lc kernel: [1967263.178361] [<c00155d4>] (do_page_fault+0x88/0x210) from [<c00083b8>] (do_PrefetchAbort+0x44/0xa8)
Apr 1 13:45:01 lc kernel: [1967263.178384] [<c00083b8>] (do_PrefetchAbort+0x44/0xa8) from [<c000dd3c>] (ret_from_exception+0x0/0x10)
Apr 1 13:45:01 lc kernel: [1967263.178398] Exception stack(0xdee83fb0 to 0xdee83ff8)
04-14-2017 10:49 AM
Hi WillsG,
Those types of kernel messages are indicative of some higher-priority loop preventing RCU callbacks from being run. Note, also, the time in the log that you'd posted: there are distinct, 5 minute jumps between the various chunks of the log messages. You have a lot running (with FIFO policy) on core 0, and until the upstream PREEMPT_RT patchset get core isolation to a good place (meaning normal kernel-type stuff moves off the core as well), you need to allot some time on each core to do kernel-type housekeeping, otherwise you'll see the RCU-related warning that you do.
It would be helpful to see a fuller log (what you attached is only noise after the original issue occurred) as well as to know what version of the kernel that you're using on this controller.
04-17-2017 01:33 PM
Thank you, BradM for providing some insights on the issue.
We are using Labview RT 2013 version. Following output from "uname -a" cmd provides all details about Kernel we are using:
admin@lc:/LC/bin# uname -a
Linux lc 3.2.35-rt52-1.0.0f1 #1 SMP PREEMPT RT Wed Jun 12 09:57:26 CDT 2013 armv7l GNU/Linux
We noticed the issue after system was recovered from exception during which all /var/log/message files were rotated and its content was only left with this after-issue log messages. I believe that the 5-minute jumps in exception logs are because of cron-jobs running is background. I have attached a cron-job log file here for reference. There are 2 cron-jobs which runs in background: one runs at 5 minute interval and other is at 15 minute interval. The exception logs also show this time characteristics.
Does kernel level house keeping runs with normal scheduling policy (SCHED_OTHER)? or if with lower priority than our listed core-0 threads? In that case, I might have to change scheduling policy for some threads in core-0. Any specific suggestions on this?
04-24-2017 10:04 AM
WillsG,
As I recall, for 3.2 (2013 or sp1), the threaded irq model used for almost all HW is half of the max RT prio (50), while some other kernel threads are lower prio (may not even use the RT scheduling classes).
You can check the priority of the various different tasks on the system, mapping some of the kernel threaded tasks to PIDs to check the /proc pseudofs (since the included ps is the fairly limited busybox implementation, you can't simply use the ps -eo fields to get formatted output of the tasks running on the system). You can check fields 40 and 41 of the /proc/$PID/stat file to see the RT priority as well as the scheduling policy being used for a task. On a fairly recent target (2016 stack, I think), I see that the rcu_callback task is priority 1, policy SCHED_FIFO. This means if you run a core at 100%, even at RT priority 2, rcu_callback won't get a chance to run, resulting in posted writes to rcu datastructures within the kernel not completing, resulting in a warning after some timeout (default is 60 seconds).
Putting this all together, it makes sense that you're seeing the kernel spamming at 5 minute intervals with an offset of 1 minute: cron runs on the core that actually has some breathing room, it seems like it's hitting a RCU-protected variable, on the :05, :10, …, and a minute later, the RCU stall detection times out, spamming the kernel log.
Since it seems that this occurs when the app gets hung up or otherwise is not working correctly, I'd take a nice, hard look at the loops within your C++ app to see if there's a way that one (or more) of the loops can hard-loop on a core, resulting in the that thread (at a higher-than-rcu_callback prio) filling that core's spare time, preventing the app from making progress, from rcu_callback from running, and some other important kernel threads (e.g. I think that will preempt the page cache flushing from happening, meaning that posted writes to disk hung up in the VFS won't make it to disk)
Taking another step back, this is why RT is not easy: you'll need to get your hands dirty and dig into the system, figuring out RT-safe ways to debug what's going on. Or, you know, use LabVIEW RT 🙂
04-28-2017 05:35 PM
Thank you, BradM for the insights.
I will take another look at the threads with RT priority in my application.