SummaryThe error 'failed to keep watchdog alive' occurs in the logs, accompanying a NIC flap or other outage in the system
OverviewAn Exinda keeps a detailed log in order to determine when something happens, trying to output as much useful information about the behaviours as possible, in order to help administrators diagnose problems as well as gather troubleshooting information for investigations. One such problem that can occur is the NICs on the device stopping their standard behaviour - either for a short period of time (A NIC flap) or going into bypass for an extended period of time on its own. After one of these events, checking the logs for a reason why is a first step for troubleshooting.
It is possible to see the following error message in the log at the time of the outage:
bypassed: TID 140158059296544: [bypassd.ERR] (watchdog) failed to keep watchdog alive
CauseThe bypass process in the Exinda ('bypassd') is the process which looks after the NICs and determines whether or not they should go into bypass mode while the system is active. It knows the state that the NICs should be in from the Configuration > System > Networks page, and it will continually monitor the system in order to determine the health and wellbeing of it.If the device gets into an unstable state, it will switch the NICs into bypass mode in order to prevent an outage.
The mechanism through which it does this is the System Watchdog. The watchdog expects a response from the system every second. If it does not receive it, it will wait for a total of 8 seconds before triggering. If no responses come to it in that time, it will preemptively either change the NIC states to go into bypass, or reboot the device (depending on its system settings) because it indicates that the system is failing to acknowledge it for one of many reasons:
- The device is too busy to handle the load on it (in the middle of an attack, the traffic load is too high, the RAM use on the box is too much, etc.)
- The device has locked up or frozen
- The device is in a state where it is unresponsive
WorkaroundPreemptively rebooting the device will ensure that the system gets into a fresh and clean state, and that bypassd will be able to contact the watchdog to keep it running as expected.
ResolutionTo keep the watchdog from timing out, the following should be ensured on the system:
- That the number of connections on the device are under proper system specifications
- The RAM use on the device is not extremely high (90-100%) and the swap is not under any heavy use