Back to OP5 Monitor FAQ

How to fix and diagnose "ERROR: There are # expired checks" on check_distribution Service

If the check_distribution service notifies of an error, or if “mon node status” indicates expired checks, this can be due to several reasons.

Example of service error:

ERROR: There are 11 expired checks

Possible cause: The responsible node is experiencing high load Copied

Before doing anything else, use a process monitoring tool such as top or other derivatives such as htop or iotop to check system load. Do this to ensure that the checks aren’t expiring simply due to the node being unable to keep up with the amount of checks it has been assigned. If the node is experiencing high load, consider adding more capacity, or re-distributing your checks across your cluster. Look in the OP5 documentation for details of how to do this. If you need assistance with this, please contact support.

Possible cause: active checks may be disabled on some nodes Copied

When running a distributed solution, expired checks can be the product of diverging settings on the attribute “active_checks_enabled”. The configuration should be unified across the cluster, but there are also runtime settings that decide whether checks should run, and these could diverge between nodes. To investigate this as a possible issue you can download and run the script: mon_node_output_parse_diff.pl (Author: Jonatan Sundeen)

Usage:

perl mon_node_output_parse_diff.pl
Usage:
my-program <input-file-name>
Use case to check services active_checks_enabled
Get data for service checks
## mon node ctrl --self --all "mon query ls services -c host_name,description,active_checks_enabled" > mon_node_services.txt
Parse data with scrip
perl mon_node_output_parse_diff.pl mon_node_services.txt
Use case to check hosts active_checks_enabled
Get data for host checks
## mon node ctrl --self --all "mon query ls hosts -c name,active_checks_enabled" > mon_node_hosts.txt
Parse data with scrip
perl mon_node_output_parse_diff.pl mon_node_hosts.txt

The script expects 2 input files (nominally mon_node_services.txt and “mon_node_hosts.txt” in the following example) that contain the output of the named mon node ctrl commands. It will output the checks that differ and on what hosts they differ.

If there is no output, the settings do not differ between nodes.

An example usage (one-liner) collects data on services and hosts across all nodes in your cluster, then runs the command for both of the txt files containing the collected data. In this example the perl script is expected to be in /tmp.

# cd /tmp && mon node ctrl --self --all "mon query ls services -c host_name,description,active_checks_enabled" > mon_node_services.txt && mon node ctrl --self --all "mon query ls hosts -c name,active_checks_enabled" > mon_node_hosts.txt && perl /tmp/mon_node_output_parse_diff.pl mon_node_services.txt && perl /tmp/mon_node_output_parse_diff.pl mon_node_hosts.txt

Review the services that have active checks disabled. Passive checks including business services should have active checks disabled.

You can run the following command which will display all checks with active_checks disabled for service checks on the local node:

## mon query ls services -c host_name,description,active_checks_enabled | grep "0\$"

And this command for the hosts (local node only):

## mon query ls hosts -c name,active_checks_enabled | grep "0\$"

If you wish to correct this automatically rather than inspect and fix the discrepancies manually (which may be a good idea to understand what happened), the next step is to decide a host which will act as master data for updating the others.

You can run the one liner below on the chosen master server to propagate its settings to the other master and pollers in the cluster via external commands. It will log all commands to the file indicated at the end of the command, and will run in the background. To see current status, “tail” the txt file where output is logged.

Propagate settings for active service checks Copied

(IFS=$'\n'; for line in $(mon query ls services -c host_name,description,active_checks_enabled); do varHost=$(echo $line | cut -d ";" -f1); varService=$(echo $line | cut -d ";" -f2); varActive=$(echo $line | cut -d ";" -f3); if [ "$varActive" -eq 1 ] && [ "$varService" != "" ]; then varSubmitCommand="ENABLE_SVC_CHECK"; elif [ "$varActive" -eq 0 ] && [ "$varService" != "" ]; then varSubmitCommand="DISABLE_SVC_CHECK"; fi; varHost=$(echo $line | cut -d ";" -f1); varService=$(echo $line | cut -d ";" -f2); varSave="mon ecmd submit $varSubmitCommand \"$varHost;$varService\""; eval "$varSave"; done) > propagate_active_service_checks.txt &

Propagate settings for active host checks Copied

(IFS=$'\n'; for line in $(mon query ls hosts -c name,active_checks_enabled); do varHost=$(echo $line | cut -d ";" -f1); varActive=$(echo $line | cut -d ";" -f2); if [ "$varActive" -eq 1 ]; then varSubmitCommand="ENABLE_HOST_CHECK"; elif [ "$varActive" -eq 0 ]; then varSubmitCommand="DISABLE_HOST_CHECK"; fi; varHost=$(echo $line | cut -d ";" -f1); varService=$(echo $line | cut -d ";" -f2); varSave="mon ecmd submit $varSubmitCommand \"$varHost\""; eval "$varSave"; done) > propagate_active_host_checks.txt &
["Geneos"] ["FAQ"]

Was this topic helpful?