Back to Opsview FAQ

Collector Cluster Health

The Configuration > Monitoring Collectors page shows details on the “Health” of both individual Collector nodes and each Collectot Cluster.

Please visit the Collector Offline document for basic/initial troubleshooting of your investigation as to why a collector cluster may be offline.

Clusters Tab

The Status column shows the current state of the cluster. Possible values are:

Cluster Health Alarms

The table below describes the possible alarms that will be shown when users hover over the status of a DEGRADED cluster. These alarms refer to conditions of the following Opsview components:

Alarms Description Suggestions / Actions
All [Components Name] components are unavailable e.g. All opsview-executor components are unavailable Master/ Orchestrator server can’t communicate with any [Components Name] components on collector cluster. This may be because of a network/communications issue, or because no [Components Name] components are running on the cluster. Note: this alarm only triggers when all [Components Name] components on the collector cluster are unavailable, since a cluster may be configured to only have these components running on a subset of the collectors. Furthermore, the cluster may be able to continue monitoring with some (though not all) of the [Components Name] components stopped. To resolve this, ensure that the master/orchestrator server can communicate with the collector cluster (i.e. resolve any network issues) and that at least one scheduler is running e.g. SSH to collector and run /opt/opsview/watchdog/bin/opsview-monit start [Component Name]
Not enough messages received ([Components Name 1] → [Components Name 2]): [Time Period] [Percentage Messages Received]%. e.g. Not enough messages received (opsview-scheduler → opsview-executor):[15m] 0%. Less than 70% of the messages sent by [Components Name 1] have been received by [Components Name 2] within the time period. This could indicate a communication problem between the components on the collector cluster, or that [Components Name 2] is overloaded and is struggling to process the messages it is receiving in a timely fashion. e.g. 0% of messages sent by the scheduler have been received by the executor within a 15-minute period. If 0% of the messages sent have been received by [Components Name 2] and no other alarms are present then this may imply a communications failure on the cluster. To resolve this ensure that the collectors in the cluster can all communicate on all ports (seehttps://knowledge.opsview.com/docs/ports#collector-clusters) and that opsview-messagequeue is running on all the collectors without errors. Alternatively, this may be indicate that not all the required components are running on the collectors in the cluster. Please run /opt/opsview/watchdog/bin/opsview-monit summary on each collector to check that all the components are in a running state. If any are stopped then run /opt/opsview/watchdog/bin/opsview-monit start [component name]to start them. If > 0% messages sent have been received by [Components Name 2], then this likely implies a performance issue in the cluster. To address this you can: Reduce the load on the cluster e.g. - Reduce the number of objects monitored by that cluster - Reduce the number of checks being performed on each object in the cluster (i.e. remove host templates/service checks). - Increase the check interval for monitored hosts Increase the resources in the cluster - Add additional collectors to the cluster - Improve the hardware/resources of each collector in the cluster (i.e. investigate bottleneck by inspecting self-monitoring statistics and allocate additional CPU/memory resources as needed).

Note: For a fresh collector/cluster which has just been set up or which has minimal activity, the “Not enough messages received” alarm will be suppressed to avoid unnecessary admin/user concerns. This does not impact the “All [Components Name] components are unavailable” alarm, which will still be raised for an offline collector.

["Geneos"] ["FAQ"]

Was this topic helpful?