Watchdog Code

Brought to you by: meskes, paulcrawford
[434bcf]: / watchdog.8 Maximize Restore History

435 lines (419 with data), 15.7 kB

.TH WATCHDOG 8 "February 2019"
.UC 4
.SH NAME
watchdog \- a software watchdog daemon
.SH SYNOPSIS
.B watchdog
.RB [ \-F | \-\-foreground ]
.RB [ \-f | \-\-force ]
.RB [ \-c " \fIfilename\fR|" \-\-config\-file " \fIfilename\fR]"
.RB [ \-v | \-\-verbose ]
.RB [ \-s | \-\-sync ]
.RB [ \-b | \-\-softboot ]
.RB [ \-q | \-\-no\-action ]
.SH DESCRIPTION
The Linux kernel can reset the system if serious problems are detected.
This can be implemented via special watchdog hardware, or via a slightly
less reliable software-only watchdog inside the kernel. Either way, there
needs to be a daemon that tells the kernel the system is working fine. If the
daemon stops doing that, the system is reset.
.PP
.B watchdog
is such a daemon. It opens
.IR /dev/watchdog ,
and keeps writing to it often enough to keep the kernel from resetting,
at least once per minute. Each write delays the reboot
time another minute. After a minute of inactivity the watchdog hardware will
cause the reset. In the case of the software watchdog the ability to
reboot will depend on the state of the machines and interrupts.
.PP
The watchdog daemon can be stopped without causing a reboot if the device
.I /dev/watchdog
is closed correctly, unless your kernel is compiled with the
.I CONFIG_WATCHDOG_NOWAYOUT
option enabled.
.SH TESTS
The watchdog daemon does several tests to check the system status:
.IP \(bu 3
Is the process table full?
.IP \(bu 3
Is there enough free memory?
.IP \(bu 3
Is there enough allocatable memory?
.IP \(bu 3
Are some files accessible?
.IP \(bu 3
Have some files changed within a given interval?
.IP \(bu 3
Is the average work load too high?
.IP \(bu 3
Has a file table overflow occurred?
.IP \(bu 3
Is a process still running? The process is specified by a pid file.
.IP \(bu 3
Do some IP addresses answer to ping?
.IP \(bu 3
Do network interfaces receive traffic?
.IP \(bu 3
Is the temperature too high? (Temperature data not always available.)
.IP \(bu 3
Execute a user defined command to do arbitrary tests.
.IP \(bu 3
Execute one or more test/repair commands found in /etc/watchdog.d.  These commands are called with the argument \fBtest\fP or \fBrepair\fP.
.PP
If any of these checks fail watchdog will cause a shutdown. Should any of
these tests except the user defined binary last longer than one minute the
machine will be rebooted, too.
.PP
.SH OPTIONS
Available command line options are the following:
.TP
.BR \-v ", " \-\-verbose
Set verbose mode. Only implemented if compiled with
.I SYSLOG
feature. This
mode will log each several infos in
.I LOG_DAEMON
with priority
.IR LOG_DEBUG.
This is useful if you want to see exactly what happened until the watchdog rebooted
the system. Currently it logs the temperature (if available), the load
average, the change date of the files it checks and how often it went to sleep. You
can use this twice to enable some more verbose debug message for testing.
.TP
.BR \-s ", " \-\-sync
Try to synchronize the filesystem every time the process is awake. Note that
the system is rebooted if for any reason the synchronizing lasts longer
than a minute.
.TP
.BR \-b ", " \-\-softboot
Soft-boot the system if an error occurs during the main loop, e.g. if a
given file is not accessible via the
.BR stat (2)
call. Note that
this does not apply to the opening of
.I /dev/watchdog
and
.IR /proc/loadavg ,
which are opened before the main loop starts. Now this is implemented by disabling the
error re-try timer.
.TP
.BR \-F ", " \-\-foreground
Run in foreground mode, useful for running under systemd (for example).
.TP
.BR \-f ", " \-\-force
Force the usage of the interval given or the maximal load average given
in the config file. Without this option these values are sanity checked.
.TP
.BR \-c " \fIconfig-file\fR, " \-\-config\-file " \fIconfig-file"
Use
.I config-file
as the configuration file instead of the default
.IR /etc/watchdog.conf .
.TP
.BR \-q ", " \-\-no\-action
Do not reboot or halt the machine. This is for testing purposes. All checks
are executed and the results are logged as usual, but no action is taken.
Also your hardware card or the kernel software watchdog driver is not
enabled. NOTE: This still allows 'repair' actions to run, but the daemon
itself will not attempt a reboot.
.TP
.BR \-X " \fInum\fR, " \-\-loop\-exit " \fInum"
Run for 'num' loops then exit as if SIGTERM was received. Intended for test/debug (e.g. using
.B valgrind
for checking memory access). If the daemon exits on a loop counter and you have the
.I CONFIG_WATCHDOG_NOWAYOUT
option compiled for the kernel or device-driver then an unplanned reboot will follow - be warned!
.SH FUNCTION
After
.B watchdog
starts, it puts itself into the background and then tries all checks
specified in its configuration file in turn. Between each two tests it will
write to the kernel device to prevent a reset. After finishing all tests
watchdog goes to sleep for some time. The kernel drivers expects a write to the
watchdog device every minute. Otherwise the system will be reset.
.B watchdog
will sleep for a configure interval that defaults to 1 second to make sure it
triggers the device early enough.
.PP
Under high system load
.B watchdog
might be swapped out of memory and may fail
to make it back in in time. Under these circumstances the Linux kernel will
reset the machine. To make sure you won't get unnecessary reboots make
sure you have the variable
.I realtime
set to
.I yes
in the configuration file
.IR watchdog.conf .
This adds real time support to
.BR watchdog :
it will lock itself into memory and there should  be no problem even under the
highest of loads.
.PP
On system running out of memory the kernel will try to free enough memory by killing process. The
.B watchdog
daemon itself is exempted from this so-called out-of-memory killer.
.PP
Also you can specify a maximal allowed load average. Once this load average
is reached the system is rebooted. You may specify maximal load averages for
1 minute, 5 minutes or 15 minutes. The default values is to disable this
test. Be careful not to set this parameter too low. To set a value less then
the predefined minimal value of 2, you have to use the
.B -f
option.
.PP
You can also specify a minimal amount of virtual memory you want to have
available as free. As soon as more virtual memory is used action is taken by
.BR watchdog .
Note, however, that watchdog does not distinguish between
different types of memory usage. It just checks for free virtual memory.
.PP
If you have a machine with temperature sensor(s) you can specify the maximal
allowed temperature. Once this temperature is reached on any sensor the system
is powered off. The default value is 90 C. Typically the temperature information
is provided by the
.B sensors
package as files in the virtual filesystem /sys/device and can be found
using, for example, the command

    find /sys -name 'temp*input' -print

These files hold the temperature in milli-Celsius. You can have multiple sensors
used in the config file. For example to change to 75C maximum and to check two
virtual files for the system temperature you might have this:

    max-temperature = 75
    temperature-sensor = /sys/class/hwmon/hwmon0/device/temp1_input
    temperature-sensor = /sys/class/hwmon/hwmon0/device/temp2_input

The
.B watchdog
will issue warnings once the temperature increases 90%, 95% and 98% of
the configured maximum temperature.
.PP
When using file mode
.B watchdog
will try to
.BR stat (2)
the given files. Errors returned
by stat will
.B not
cause a reboot. For a reboot the stat call has to last at least the re-try
time-out value (default 1 minute).
This may happen if the file is located on an NFS mounted filesystem. If your
system relies on an NFS mounted filesystem you might try this option.
However, in such a case the
.I sync
option may not work if the NFS server is
not answering.
.PP
.B watchdog
can read the pid from a pid file and
see whether the process still exists. If not, action is taken
by
.BR watchdog .
So you can for instance restart the server from your
.IR repair-binary .
.PP
.B watchdog
will try periodically to fork itself to see whether the process
table is full. This process will leave a zombie process until watchdog wakes
up again and catches it; this is harmless, don't worry about it.
.PP
In ping mode
.B watchdog
tries to ping the given IPv4 addresses. These addresses do
not have to be a single machine. It is possible to ping to a broadcast
address instead to see if at least one machine in a subnet is still living.
.PP
.B Do not use this broadcast ping unless your MIS person a) knows about it and
.B b) has given you explicit permission to use it!
.PP
.B watchdog
will send out three ping packages and wait up to <interval> seconds
for the reply with <interval> being the time it goes to sleep between two
times triggering the watchdog device. Thus a unreachable network will not
cause a hard reset but a soft reboot.
.PP
You can also test passively for an unreachable network by just monitoring
a given interface for traffic. If no traffic arrives the network is
considered unreachable causing a soft reboot or action from the
repair binary.
.PP
.B watchdog
can run an external command for user-defined tests. A return code not equal 0
means an error occurred and watchdog should react. If the external command is
killed by an uncaught signal this is considered an error by watchdog too.
The command may take longer than the time slice defined for the kernel device
without a problem. However, error messages are
generated into the syslog facility. If you have enabled softboot on error
the machine will be rebooted if the binary doesn't exit in half the time
.B watchdog
sleeps between two tries triggering the kernel device.
.PP
If you specify a repair binary it will be started instead of shutting down
the system. If this binary is not able to fix the problem
.B watchdog
will still cause a reboot afterwards.
.PP
If the machine is halted an email is sent to notify a human that
the machine is going down. Starting with version 4.4
.B watchdog
will also notify the human in charge if the machine is rebooted.
.PP
The re-try timer applies to most errors, except reset/reboot calls and too hot.
It allows a given error source to recover, and treats most tests in this way.
Exceptions are file handle test, load averages, and system memory. If set to
the minimum time of 1 second it will still allow a single re-try at any polling
interval of the system.
.SH "SOFT REBOOT"
A soft reboot (i.e. controlled shutdown and reboot) is initiated for every
error that is found. Since there might be no more processes available,
watchdog does it all by himself. That means:
.IP 1. 4
Kill all processes with SIGTERM.
.IP 2. 4
After a short pause kill all remaining processes with SIGKILL.
.IP 3. 4
Record a shutdown entry in wtmp.
.IP 4. 4
Save the random seed from
.IR /dev/urandom .
If the device is non-existant or
there is no filename for saving this step is skipped.
.IP 5. 4
Turn off accounting.
.IP 6. 4
Turn off quota and swap.
.IP 7. 4
Unmount all partitions
.IP 8. 4
Finally reboot.
.SH "CHECK BINARY"
If the return code of the check binary is not zero
.B watchdog
will assume an
error and reboot the system. Be careful with this if you are using the
real-time properties of watchdog since
.B watchdog
will wait for the return of
this binary before proceeding. An exit code smaller than 245 is interpreted as an
system error code (see
.I errno.h
for details). Values of 245 or larger than are special to
.BR watchdog :
.TP
255
(based on \-1 as unsigned 8\-bit number)
Reboot the system. This is not exactly an error message but a command to
.BR watchdog .
If the return code is this the
.B watchdog
will not try to run a shutdown
script instead.
.TP
254
Reset the system. This is not exactly an error message but a command to
.BR watchdog .
If the return code is this the
.B watchdog
will attempt to hard-reset the machine without attempting any sort of orderly
stopping of process, unmounting of file systems, etc.
.TP
253
Maximum load average exceeded.
.TP
252
The temperature inside is too high.
.TP
251
.I /proc/loadavg
contains no (or not enough) data.
.TP
250
The given file was not changed in the given interval.
.TP
249
.I /proc/meminfo
contains invalid data.
.TP
248
Child process was killed by a signal.
.TP
247
Child process did not return in time.
.TP
246
Free for personal watchdog-specific use (was \-10 as an unsigned 8\-bit
number).
.TP
245
Reserved for an unknown result, for example a slow background test that is
still running so neither a success nor an error.
.SH "REPAIR BINARY"
The repair binary is started with one parameter: the error number that
caused
.B watchdog
to initiate the boot process. After trying to repair the
system the binary should exit with 0 if the system was successfully repaired
and thus there is no need to boot anymore. A return value not equal 0 tells
.B watchdog
to reboot. The return code of the repair binary should be the error
number of the error causing
.B watchdog
to reboot. Be careful with this if you
are using the real-time properties since
.B watchdog
will wait for
the return of this binary before proceeding.

The configuration file parameter
.B
repair-maximum
controls the number of successive repair attempts that report 0 (i.e. success) but
fail to clear the tested fault. If this is exceeded then a reboot takes place. If set
to zero then a reboot can always be blocked by the repair program reporting success.
.SH "TEST DIRECTORY"
Executables placed in the test directory are discovered by watchdog on
startup and are automatically executed.  They are bounded time-wise by
the test-timeout directive in watchdog.conf.

These executables are called with either "test" as the first argument
(if a test is being performed) or "repair" as the first argument (if a
repair for a previously-failed "test" operation on is being performed).

As with test binaries and repair binaries, expected exit codes for
a successful test or repair operation is always zero.

If an executable's test operation fails, the same executable is automatically
called with the "repair" argument as well as the return code of the
previously-failed test operation.

For example, if the following execution returns 42:

    /etc/watchdog.d/my-test test

The watchdog daemon will attempt to repair the problem by calling:

    /etc/watchdog.d/my-test repair 42

This enables administrators and application developers to make intelligent
test/repair commands.  If the "repair" operation is not required (or is
not likely to succeed), it is important that the author of the command
return a non-zero value so the machine will still reboot as expected.

Note that the watchdog daemon may interpret and act upon any of the reserved
return codes noted in the Check Binary section prior to calling a given
command in "repair" mode.

As for the repair binary, the configuration parameter
.B
repair-maximum
also controls the number of successive repair attempts that report success
(return 0) but fail to clear the fault.
.SH BUGS
None known so far.
.SH AUTHORS
The original code is an example written by Alan Cox
<alan@lxorguk.ukuu.org.uk>, the author of the kernel driver. All
additions were written by Michael Meskes <meskes@debian.org>. Johnie Ingram
<johnie@netgod.net> had the idea of testing the load average. He also took
over the Debian specific work. Dave Cinege <dcinege@psychosis.com> brought
up some hardware watchdog issues and helped testing this stuff.
.SH FILES
.TP
.I /dev/watchdog
The watchdog device.
.TP
.I /var/run/watchdog.pid
The pid file of the running
.BR watchdog .
.SH "SEE ALSO"
.BR watchdog.conf (5)
Watchdog Code

Branches

Tags

[434bcf]: / watchdog.8 Maximize Restore History

435 lines (419 with data), 15.7 kB