Overview
This article covers a resolution for an unexpected shutdown due to tigon3(tg3) NIC drivers issue. These are drivers for the NIC cards in some Exindas.
The logs show lots of messages relating to tg3.
kernel: [24713281.496825] tg3 0000:01:00.1: eth2: 0x00007010: 0x1a5876c2, 0x01c08073, 0x00d70081, 0x03008200
>kernel: [24713281.506949] tg3 0000:01:00.1: eth2: 0x00007020: 0x00000000, 0x00000000, 0x00000406, 0x10004000
kernel: [24713281.517057] tg3 0000:01:00.1: eth2: 0x00007030: 0x000e0000, 0x0000486c, 0x00170030, 0x00000000
kernel: [24713281.527179] tg3 0000:01:00.1: eth2: 0: Host status block [00000005:00000003:(0000:0000:0000):(0000:0000)]
kernel: [24713281.538372] tg3 0000:01:00.1: eth2: 0: NAPI info [00000003:00000003:(0000:0000:01ff):0000:(00c8:0000:0000:0000)]
kernel: [24713281.550233] tg3 0000:01:00.1: eth2: 1: Host status block [00000001:00000001:(0000:0000:0000):(0000:0000)]
kernel: [24713281.561426] tg3 0000:01:00.1: eth2: 1: NAPI info [00000001:00000001:(0000:0000:01ff):0000:(0000:0000:0000:0000)]
kernel: [24713281.573289] tg3 0000:01:00.1: eth2: 2: Host status block [00000001:00000001:(0000:0000:0000):(0000:0000)]
kernel: [24713281.584483] tg3 0000:01:00.1: eth2: 2: NAPI info [00000001:00000001:(0000:0000:01ff):0000:(0000:0000:0000:0000)]
kernel: [24713281.596357] tg3 0000:01:00.1: eth2: 3: Host status block [00000001:00000001:(0000:0000:0000):(0000:0000)]
kernel: [24713281.607549] tg3 0000:01:00.1: eth2: 3: NAPI info [00000001:00000001:(0000:0000:01ff):0000:(0000:0000:0000:0000)]
kernel: [24713281.619413] tg3 0000:01:00.1: eth2: 4: Host status block [00000001:00000001:(0000:0000:0000):(0000:0000)]
kernel: [24713281.630597] tg3 0000:01:00.1: eth2: 4: NAPI info [00000001:00000001:(0000:0000:01ff):0000:(0000:0000:0000:0000)]
kernel: [24713291.298767] tg3 0000:01:00.1: eth2: transmit timed out, resetting
This causes the Core processing this traffic to be busy hence missing the checks by Bypassd process. Bypassd process will log that one of the processor cores took more than configured timeout to respond. Eventually, this ends up being detected as a kernel lock as the device is having issues due to the tg3 driver issue. When a kernel lock is detected, bypassd will reboot the device.
bypassd[23594]: TID 139700058126080: [bypassd.ERR]: Processor core 21 took more than 4 seconds to respond
bypassd[23594]: TID 139700058126080: [bypassd.ERR]: Kernel lock detected, waiting for kdump then rebooting
Resolution
The solution here is to upgrade the device to the latest firmware as this issue has been resolved in newer tg3 drivers which will be on latest Exinda firmware.
Priyanka Bhotika
Comments