Duplicate or out of time email alerts
This problem is related to duplicate email alerts and out of time email alerts.
Problem Copied
-
A user receives duplicate email alerts at exactly the same time. For example, a user receives two instances of
EmailAlert01at 10:00 AM. -
A user receives duplicate email alerts at different times. For example:
- User receives one instance of
EmailAlert01at 10:00 AM. - User receives another instance of
EmailAlert01at 2:00 PM.
- User receives one instance of
-
User receives an unexpected email alert. These email alerts are unexpected because of the following scenarios:
Possible causes Copied
-
A Rule-Action pair and an Alerting-Effect pair are configured to monitor the same data items.
- The email script (or external email application such as sendmail) that is configured in the Action/Effect ran more than once.
-
An error keyword from an old file was detected again by the FKM sampler (or a State Tracker sampler) when the old file changed its filename. The FKM sampler detects this as a new file and reads the file again. This happens because of the following combination of events:
- The FKM sampler’s files > file > source > filename setting has a wildcard. For example: /opt/application/log/application*.log. See FKM configuration.
- The FKM sampler’s wildcardMonitorAllMatches setting is disabled.
- The old file’s filename is
/opt/application/log/application.log. - The FKM sampler detected an error in
/opt/application/log/application.log. - The FKM error trigger key was cleared using the FKM’s Clear This Trigger command (or any similar FKM command).
- The old file’s filename was changed from
/opt/application/log/application.logto/opt/application/log/application\_old.log. The latter still matches the value configured in the files > file > source > filename. - The FKM sampler treats
/opt/application/log/application\_old.logas a new file and monitors it. - The FKM sampler detected the errors again in
/opt/application/log/application\_old.log.
-
An error keyword from an old file was detected again by the FKM sampler (or a State Tracker sampler) when the old file’s contents were updated. The FKM sampler detects the said old file as a new file and reads the file again. This happens because of the following combination of events:
- The FKM sampler’s files > file > source > filename setting has a wildcard. For example: /opt/application/log/application*.log. See FKM configuration.
- The FKM sampler’s wildcardMonitorAllMatches setting is disabled.
- The old file’s filename is
/opt/application/log/application\_01.log. - The FKM sampler detected an error in
/opt/application/log/application\_01.log. - The FKM error trigger key was cleared using the FKM’s Clear This Trigger command (or any similar FKM command).
- A new file was created. The filename is
/opt/application/log/application\_02.log. - The FKM sampler treats
/opt/application/log/application\_02.logas a new file and monitors it. - The old file,
/opt/application/log/application\_01.log, was updated. - The FKM sampler treats
/opt/application/log/application\_01.logas a new file and monitors it. - The FKM sampler detected the errors again in
/opt/application/log/application\_01.log.
-
A Cell value changes before the Sampler becomes inactive and the Rule does not have any Active Time defined on it. Using the following screenshot (view the image on a new tab):
The Rule still applied the severity and fired the email alert (see below) even though the Sampler was inactive.
2021-08-06 22:30:39.241+0800 INFO: ActionManager Action DataItem 'send email alert' generated (variable=/geneos/gateway[(@name="MNL_MAYA_GATEWAY_9370")]/directory/probe[(@name="PUGO_6370")]/managedEntity[(@name="FAQ ALERTS AND ACTIONS")]/sampler[(@name="toolkit for alerts and actions")][(@type="")]/dataview[(@name="toolkit for alerts and actions")]/rows/row[(@name="pugo")]/cell[(@column="status")])
2021-08-06 22:30:39.241+0800 INFO: ActionManager Firing action 'send email alert'
2021-08-06 22:30:40.150+0800 INFO: ActionManager Finished executing '/home/MNL/rgonzales/scripts/scripts/print_env.bash' with arguments ''.
2021-08-06 22:30:40.150+0800 INFO: ActionManager Completed action 'send email alert', Exit code: 0
- The Cell was snoozed after the email alert was fired. This can be easily identified by searching the Gateway log file for the strings
ActionManager(orAlertManagerif the email alert was produced by an Effect) andCommandManager. The resulting log entries are as follows:
2021-08-06 22:54:41.100+0800 INFO: ActionManager Action DataItem 'send email alert' generated (variable=/geneos/gateway[(@name="MNL_MAYA_GATEWAY_9370")]/directory/probe[(@name="PUGO_6370")]/managedEntity[(@name="FAQ ALERTS AND ACTIONS")]/sampler[(@name="toolkit for alerts and actions")][(@type="")]/dataview[(@name="toolkit for alerts and actions")]/rows/row[(@name="pugo")]/cell[(@column="status")])
2021-08-06 22:54:41.100+0800 INFO: ActionManager Firing action 'send email alert'
2021-08-06 22:54:42.109+0800 INFO: ActionManager Finished executing '/home/MNL/rgonzales/scripts/scripts/print_env.bash' with arguments ''.
2021-08-06 22:54:42.109+0800 INFO: ActionManager Completed action 'send email alert', Exit code: 0
2021-08-06 23:06:47.886+0800 INFO: GatewayControl _commandExec: /SNOOZE:manualAllMe [ImXSSIo] requestId=1
2021-08-06 23:06:47.887+0800 INFO: CommandManager Executing command [/SNOOZE:manualAllMe] with id [54] for request id [1]
2021-08-06 23:06:47.887+0800 INFO: CommandManager Executing command '/SNOOZE:manualAllMe' on DataItem '/geneos/gateway[(@name="MNL_MAYA_GATEWAY_9370")]/directory/probe[(@name="PUGO_6370")]/managedEntity[(@name="FAQ ALERTS AND ACTIONS")]/sampler[(@name="toolkit for alerts and actions")][(@type="")]/dataview[(@name="toolkit for alerts and actions")]/rows/row[(@name="pugo")]/cell[(@column="status")]', issued by user 'MNL\rgonzales' on '192.168.200.6'
From the log entries, the Snooze Command was executed on the cell after the email alert was fired.
-
Specific to the FKM sampler, the Active Time is defined on the Sampler level and not on the File level.
- Sampler level - The FKM’s file pointer stops when the sampler becomes inactive. Hence, when the sampler becomes active, the file pointer starts on the line where it stopped.
- File level - The FKM’s file pointer continues to scan the log entries but does not detect the configured error keywords when the sampler is inactive. The sampler’s file pointer starts to detect error keywords when the sampler becomes active.
Possible solutions Copied
- Access the Gateway log file and review the AlertManager and ActionManager log entries. Below are the steps:
- Locate the Gateway log file.
- Open the Gateway log file using any text viewer application (e.g. Windows’ Notepad app or Linux’s VIM).
- Search for the string “ActionManager”.
- Copy and paste the output to a text file.
- Search for the string “AlertManager”.
- Copy and paste the output to a text file.
- Review the text file.
Based on the preceding instructions, here are the next steps after reviewing the log entries.
- Look for entries that have the same timestamp and target XPaths. Below are example log entries:
2021-08-06 15:17:59.263+0800 INFO: ActionManager Action DataItem 'send email alert' generated (variable=/geneos/gateway[(@name="MNL_MAYA_GATEWAY_9370")]/directory/probe[(@name="PUGO_6370")]/managedEntity[(@name="FAQ ALERTS AND ACTIONS")]/sampler[(@name="Toolkit for Alerts and Actions")][(@type="")]/dataview[(@name="Toolkit for Alerts and Actions")]/rows/row[(@name="pugo")]/cell[(@column="status")])
2021-08-06 15:17:59.264+0800 INFO: ActionManager Firing action 'send email alert'
2021-08-06 15:17:59.433+0800 INFO: AlertManager Alert DataItem 'alerting for rowName / pugo / CRITICAL / 0' generated (variable=/geneos/gateway[(@name="MNL_MAYA_GATEWAY_9370")]/directory/probe[(@name="PUGO_6370")]/managedEntity[(@name="FAQ ALERTS AND ACTIONS")]/sampler[(@name="Toolkit for Alerts and Actions")][(@type="")]/dataview[(@name="Toolkit for Alerts and Actions")]/rows/row[(@name="pugo")]/cell[(@column="status")])
2021-08-06 15:17:59.434+0800 INFO: AlertManager Alert: 'alerting for rowName / pugo / CRITICAL / 0'; Effect: 'fire email alert'; TO: ; CC: ; BCC: ; DataItem: /geneos/gateway[(@name="MNL_MAYA_GATEWAY_9370")]/directory/probe[(@name="PUGO_6370")]/managedEntity[(@name="FAQ ALERTS AND ACTIONS")]/sampler[(@name="Toolkit for Alerts and Actions")][(@type="")]/dataview[(@name="Toolkit for Alerts and Actions")]/rows/row[(@name="pugo")]/cell[(@column="status")]
2021-08-06 15:17:59.434+0800 INFO: EffectManager Firing effect 'fire email alert'
2021-08-06 15:18:00.143+0800 INFO: EffectManager Finished executing '/home/MNL/rgonzales/scripts/scripts/print_env.bash' with arguments ''.
2021-08-06 15:18:00.143+0800 INFO: ActionManager Completed effect 'fire email alert' for alert 'alerting for rowName / pugo / CRITICAL / 0', Exit code: 0
2021-08-06 15:18:00.143+0800 INFO: ActionManager Finished executing '/home/MNL/rgonzales/scripts/scripts/print_env.bash' with arguments ''.
2021-08-06 15:18:00.143+0800 INFO: ActionManager Completed action 'send email alert', Exit code: 0
From the above:
The send email alert Action fired on 2021-08-06 15:18:00.143+0800. It was triggered because of the following data item:
2021-08-06 15:17:59.263+0800 INFO: ActionManager Action DataItem 'send email alert' generated (variable=/geneos/gateway[(@name="MNL_MAYA_GATEWAY_9370")]/directory/probe[(@name="PUGO_6370")]/managedEntity[(@name="FAQ ALERTS AND ACTIONS")]/sampler[(@name="Toolkit for Alerts and Actions")][(@type="")]/dataview[(@name="Toolkit for Alerts and Actions")]/rows/row[(@name="pugo")]/cell[(@column="status")])
The fire email alert Effect fired on 2021-08-06 15:18:00.143+0800. It was triggered because of the following data item:
2021-08-06 15:17:59.433+0800 INFO: AlertManager Alert DataItem 'alerting for rowName / pugo / CRITICAL / 0' generated (variable=/geneos/gateway[(@name="MNL_MAYA_GATEWAY_9370")]/directory/probe[(@name="PUGO_6370")]/managedEntity[(@name="FAQ ALERTS AND ACTIONS")]/sampler[(@name="Toolkit for Alerts and Actions")][(@type="")]/dataview[(@name="Toolkit for Alerts and Actions")]/rows/row[(@name="pugo")]/cell[(@column="status")])
- Identify the Rule that uses the Action.
- Identify the Alerting hierarchy that uses the Effect.
- Review the configuration.
- Implement the necessary changes.
-
Access the Gateway log file and look for more than one entry of either the same AlertManager or ActionManager. If there are none:
- Check the log file of the external email script/email application.
- Implement the necessary changes on the external email script/email application.
-
An error keyword from an old file was detected again by the FKM sampler (or a State Tracker sampler) when the old file changed its filename.
- Solution 1 — The new filename of the rolled over file (or old file) should not match the value of the FKM sampler’s files > file > source > filename setting.
- Solution 2 — Move the old file to a different directory.
- Solution 3 — If the problem happens to a State Tracker sampler, move the old file to a different directory.
-
An error keyword from an old file was detected again by the FKM sampler (or a State Tracker sampler) when the old file’s contents were updated.
- Solution 1 — Enable the FKM sampler’s wildcardMonitorAllMatches setting so that there is separate monitoring for every file.
- Solution 2 — Move the old file to a different directory.
- Solution 3 — If the problem happens to a State Tracker sampler, move the old file to a different directory.
-
A Cell value changes before the Sampler becomes inactive and the Rule does not have any Active Time defined on it.
- Solution 1 — Use the Sampler’s Active Time on the Rule so that it also becomes inactive when the Sampler is inactive.
- Solution 2 — If the Rule is using the delay function, either remove it or use “delay X samples” instead of “delay X seconds”. The “delay X samples” relies on a sampling Sampler. Since an inactive Sampler does not sample, the “delay X samples” function’s counter also stops.
-
The Cell was snoozed after the email alert was fired.
- Ensure that the Snooze command (whether manual or via a Scheduled Command) is executed before the email alert is fired.
-
Specific to the FKM sampler, the Active Time is defined on the Sampler level and not on the File level.
- Define the Active Time on the appropriate setting:
Sampler level (view the image on a new tab)
File level (view the image on a new tab)
For more information, see Gateway Rules, Actions and Alerts and File Keyword Monitor (FKM).