Alert Governance Identification Template
One of the most common problems when configuring Geneos (or any monitoring tool for that matter) is ensuring that the users are told when things are going wrong and action is required, that alerts are not missed and maybe more importantly that false alerts, or alerts which are not actionable, are minimized or removed completely. “One of the most common mistakes when monitoring is to alert on too many things. Once the number of alerts exceeds what is manageable, you are essentially not monitoring at all” While on the face of it this seems obvious, its essentially hard to achieve, and requires constant tweaking and change to keep it tuned to a changing environment. In a sister article we discussed what kind of events and configurations could lead to an unmanageable number of alerts, and some simple configuration changes that might help some specific situations. Here we will try and build systems that will actually highlight the source of the alerts, allowing us as administrators to review those parts of our systems to actively reduce alert levels. NOTE: this page is work in progress, we will be adding to it over time as we identify more and more configuration that can help identify high alert levels. ALERT GOVERNANCE - Detection Template Prerequisites You will need the following to deploy the Alert Governance Identification Template
-
That your gateway is logging its events to a database
-
That you have a probe installed on which you can run an SQL toolkit to query the Gateways database
-
The template assume a MYSQL database. If you have another database type please review the specific section below for then The following are the steps you will need to deploy the Alert Governance Identification Template
-
Add the following include file to your gateway - alert_governance.xml, set its priority to low (by setting a high number in its priroty field)
-
In the ‘Alert Governance’ Managed entity select the probe that has sufficient access and network routes to run an SQL toolkit against the gateway database
-
In the ‘Samplers –> Alert Governance’ section you need to enable the ‘Alerts for review’ sampler (in the ‘Samplers –> Alert Governance’ section) that maps to the database type your gateway uses. Review the ‘Description’ field of the samplers to see the types.
-
In the ‘Environments’ section of the Include file:
-
Set the Database connection settings (for the database type that your gateway uses), the other database connection settings can be left blank. Types of false alert Types of false alert Within the Alert Governance part we identified a number of situations which have the capacity to generate high or false alert levels. They are listed below Possible false Alert Description 1 The Alert is already being looked at An event has occurred but a member of the team is working on it, in theory this may reduce the severity for the remainder of the team, or even negate the situation completely. 2 The specific alert occurs very frequently In this case the team may become desensitized to the alert and simply ignore it 3 It has occurred outside hours The alert has occurred outside a given time window, and requires no action. It may self correct before the relevant time window occurs. Predictable maintenance windows would be an example of this. 4 Its a consequence of another fault and is not the root problem Another system or function has failed and this is an inevitable consequence of that failure. For example A disk has filled up, and the application relying on that disk has failed. While the disk being full maybe critical (its the root problem), and the app failing maybe just a warning, so the action is not on the app but the server. 5 The rule is too generic A blanket rule has been applied, and triggers alerts on systems where that behavior is acceptable. For example, a rule which dictates CPU should be critical if > 95% may not be applicable on a mainframe which is expected to run close to 100% most of the time. or an FKM is configured to look for the word ‘Error’ in a log file, but that word occurs far to often to be useful. 6 The situation is temporary or transitive The alerting situation occurs for a period but then normal operation resumes without human intervention. Applications may be busy for periods for example, and monitoring may be set to detect a busy applications without any leeway for that busy period to end under normal operation. 7 The alerts are on secondary systems, such as a UAT or Development environment If teams are simultaneously monitoring UAT and Production environments and their chosen alerting essentially merges these alerts (for example they are using the notifier in the Active Console), then they will receive noise on secondary systems which may obscure actual alerts. 8 Alerts on situations which can be automatically recovered For example if a process goes down, and the situation is detected and a script is automatically run to restart the process; if all this works as planned there maybe no need for an alert . Alerts may be set if the script fails to restart the process, or the process terminates frequently, both of which may require operator investigation and intervention. 9 The monitored items have been modified and removed and the monitoring has not been updated Or to out it another way the team responsible for changing the monitoring are not keeping up with change in the systems that are monitored. This maybe that they have insufficient access, are under constant time pressure, or are not aware that the system has been changed. 10 Miss Configured sampler / rule 11 Inappropriate severity Inappropriate Severity: Even if the frequency of alerts seems workable, if the team are getting dozens of critical alerts a week then this maybe indicative of some very unstable systems that are having a significant impact on the business, or poorly configured monitoring. The simplest way to think about what severity level a given alert should be at is how long you will tolerate that situation before acting. In rough terms: Severity level Time to act Critical Minutes Warning Hours or a day or two Ok and Undefined No Limit In all cases an action is defined as something which reduces the severity level, therefore buys you more time, or resolves the situation completely. Below we try and define systems tat can be used to detect these situations.
DEALING WITH HIGH FREQUENT ALERTS (TYPE 2 ALERTS) The Theory In this section we will deal with a specific case of where The specific alert occurs very frequently. Specifically we are interested in: Situations where the same alert fires X times in Y seconds at Z Severity or above The theory is that the same alert being fired to frequently within a short window of time essentially creates noise in the monitoring system, I.E. we get 100 alerts in 5 minutes where 1 would have done. Note this is not an attempt to solve the more complex case of root cause analysis (I.E. one thing occurring causes multiple other alerts and events both up stream and down stream). The most significant question to consider in this scenario is one of equivalence. On what condition is one alert considered equivalent to another? The most precise answer is that the same cell switches severity frequently (Ok –> Critical –> OK –> Critical), but in reality when we investigated these kind of scenarios this was almost never the case. Instead we found cases where equivalence could be justified at the row, column or data view level. I.E. A column of data generated X alerts at the same time, where X was the number of rows and they all turned critical at the same time, or a Data view had aggressive rules which meant that a large percentage of the alerts on a given gateway could be attributed to that view. In both cases the events tables for these gateways were being populated with thousands of alerts. We found it was therefore practical to consider two alerts equivalent if they came from the same data view within a given time window. We can therefore modify our high frequency alert description to be: Situations where the same data view fires X alerts in Y seconds at Z Severity or above The Implementation Assuming you have deployed the template you will see the following view Alert Governance (Entity) –> Alert for review –> Frequency (SQL Toolkit)
Each row shows a data view that has generated X or more alerts in a Y second window within a time period Z. The defaults as shipped are 5 or more alerts at critical all within a 120 second window within the last 24 hours. You can modify the parameters in the Environments –> Alerts parameters section of the include file:
- ALERTS_FREQUENCY_NUM_ALERT_THRESHOLD - defines the number of alerts which have to happen within the X second time window. I.E. a value of 10 means that 10 cells or headlines have to become critical within the time bracket for this view to be worthy of review.
- ALERTS_FREQUENCY_SERVERITY_THRESHOLD - the severity and above which we are interested in. 1 = warning, 2 = critical
- ALERTS_FREQUENCY_TIME_WINDOW_SEC - The period in seconds that the X events have to occur within. I.E a value of 120 means that the alerts all have o happen within 2 minutes of each other for it to be of interest
- ALERTS_FREQUENCY_GO_BACK_X_HOURS
- The period of time to look back for events in hours. I.E. a value of 24 means that we will look at all the events in the last 24 hours to find high frequency alerts.
In addition there will be historical view of the number of alerts at the selected severity level that have occurred on the gateway
Each row is a day, and displays the number of alerts that have occurred at the selected severity level or higher (defined via the ALERTS_FREQUENCY_SERVERITY_THRESHOLD attribute). The table will go back X months where X is defined by the ALERTS_HISTORY_FREQUENCY_MONTHS environment variable. The data can be cut and pasted into Excel to produce a chart over time of alert levels. Correcting Frequency based alerts While the template will highlight data views that are generating excessive alerts within selected time brackets, it cannot provide advice on if it needs to be fixed, and if so, what to so - this takes expert judgement and knowledge of why it is setup like it is. In principle the objective is to reduce the number of alerts that are generated within a small time scale. Methods include:
- Creating a headline which detects one or more alert situations (or selecting just one cell), and set the rule at the selected severity level only on the headline as long as one or more of these events is on-going. The cells of interest can be set at a lower severity level so they can be identified.