Terracotta Universal Messaging Solution

Introduction

As technology evolves, web and mobile platforms are becoming increasingly essential to clients and business. The Universal Messaging Solution enhances productivity and provides real-time alerts to support staff. Although the status of Terracotta Universal Messaging can be viewed through the Enterprise Manager tool supplied, this only provides a static view of the Cluster and no ability for automated alerts or integration with other monitoring tools. As more of the global FX e-platforms and banks rely on Terracotta Universal Messaging for streaming, this new plug-in enables them to see more and deal with issues faster than before.

Glossary

Cluster - a selection of at least 3 Universal Messaging Realms. The Realms work in conjunction with each other to supply a messaging service, in a fault tolerant load balanced way. If a Realm fails for what every reason, the other Realms in the cluster will continue to work as if it hadn’t failed. To achieve this goal, a single Realm is identified as a Master Node which keeps the status of the other Realms. If this master Realm fails, then the other Realms will vote to select a new master Realm.
Channel - a generic name for a Topic or Queue, as defined by the JMS standard.
Connection - a physical connection over which a message is sent.
Realm - identifies an individual instance of a Universal Messaging Server.
RNAME - the key to all information in the Universal Messaging API. This uniquely identifies a Universal Messaging instance or Realm. The RNAME is made up of three parts:

<protocol>://<hostname>:<port>

<protocol> - can be one of the four available native communications protocol identifiers: nsp (socket), nhp (HTTP), nsps (SSL), and nhps (HTTPS).
<hostname> - the hostname or IP address that the Universal Messaging Realm is running.
<port> - the TCP port on the hostname that the Universal Messaging Realm is bound to be, using the same wire protocol.

Technology

The Universal Messaging plug-in makes extensive use of the Terracotta Universal Messaging admin API to provide detailed, real-time statistics about the performance of Universal Messaging Clusters. Using this information in conjunction with Geneos FKM (File Keyword Monitoring) against the Universal Messaging log provides a comprehensive view of the health of your Universal Messaging Cluster. Rules can be set up to provide the support team with automated alerts, which frees up resources and enables swift resolution to issues.

Many people successfully use Universal Messaging as a middleware delivery mechanism for low latency applications; however, the technology is a blind spot for them from a monitoring point of view. Surely, they need to gather more sophisticated metrics from Terracotta Universal Messaging. The following sections briefly describe the views that are available.

Prerequisites

Instrumentation XML-RPC API.
Universal Messaging Solution package with dependent libs (these are included in the lib subdirectory).
Universal Messaging plug-in license: UMMonitor.lic. Please contact ITRS Support for a trial licence.

Java requirements

You must have Java installed on the machine running the Netprobe. For information on supported Java versions, see Java support in 5.x Compatibility Matrix.

Installation

Sampler

Set up a sampler. Ths is set up as an API plug-in.

Set the name to “Cluster”. If you wish to change the name, make sure that this value is used in the UMMonitor.properties file.
Set the plugin type to API.

<sampler name="Cluster">
<plugin>
<api></api>
</plugin>
</sampler>

Netprobe

Select a netprobe, preferably on the machine where you will be running the plug-in code. In this example, it’s called “UM probe”.

Managed Entity

Set up a managed entity that joins the probe and the sampler.

Set the name to “Universal Messaging”. If you wish to change the name, make sure that this value is used in the UMMonitor.properties file.
Set Options to probe, and select the probe you set up in Netprobe.
Reference the sampler you set up in Sampler.

<managedEntity name="Universal Messaging">
<probe ref="UM probe"></probe>
<sampler ref="Cluster"></sampler>
</managedEntity>

Universal Messaging Permissions

Using the Enterprise Manager, ensure that the user has full access permissions. You can do this using the command line tools:

./naddrealmacl <user> <server where plugin is running> full

Universal Messaging Plug-in with Dependent Libs

Create a directory on the server where you are running the netprobe you want to use to monitor Universal Messaging. Copy the contents of the tar file to this location.

You should see the following:

UMMonitor/
UMMonitor.jar
lib/
log4j-1.2.16.jar
vim25.jar
ws-commons-util-1.0.2.jar
xmlrpc-client-3.1.3.jar
xmlrpc-common-3.1.3.jar
xmlrpc-server-3.1.3.jar

Plug-in Configuration

By default, the plug-in uses a config file called UMMonitor.properties. You can specify a different config file on the command line or use Java properties on the command line to override properties in the config file. If there is no config file to be found, running the plug-in the first time will generate a default config file. Confirm that the UMMonitor.properties file has the correct settings especially:

netprobeServer=localhost
netprobePort=7036

Logging Configuration

The logging is configured using log4j. By default, it is configured to log to the console and a log file (UMMonitor.log) that will roll twice a day (AM and PM).

Initialisation

To run the UMMonitor.jar file:

java -jar UMMonitor.jar

Views

Cluster Monitor

The Cluster Monitor dataview contains the state, master, online, and can be master columns.

Realm Monitor

../ImportedGeneosImages/RealmMonitor1.png

Interface Monitor

../ImportedGeneosImages/InterfaceMonitor1.png

Channel Monitor

../ImportedGeneosImages/ChannelMonitor1.png

Connection Monitor

../ImportedGeneosImages/ConnectionMonitor1.png

Thread Monitor

../ImportedGeneosImages/ThreadMonitor1.jpg

Default Rules

Once the Universal Messaging monitoring is in place, the real value that Geneos brings is to allow you to create alerts that identify a Realm in trouble. This allows your support teams to identify a problem before it affects your business service. To create the alerts, you need to set up rules based on the information that is available in the data views.

Cluster Monitoring

These values are available from the Cluster Monitor view. It’s important to build a rule that works across both the Master Realm and each Realm, as the two views may differ.

Rule Description	Alert Level
Highlight if less than 51% of the Realms are available – this means that Quorum cannot be reached	Red
Highlight if the Master nodes view differs from the view of the individual node. E.g. if the Master believes it loses connection with one realm, it appears as offline, but it may just be that the connection is broken	Amber

Realm Monitoring

On each individual Realm, it is worth setting up monitoring for the following fields.

Rule Description	Alert Level
If getUsedMemory is greater than Warning memory level (default 85%)	Amber
If getUsedMemory is greater than Error memory level (default 95%)	Red
If getUsedMemory increases rapidly in value over three consecutive sample periods	Red
Consumed per sec should not be 0	Amber
The number of messages Consumed should have a near linear relationship Published value.You can use Breach Predictor to calculate this relationship.	Amber

Channel Monitoring

The following rules relate to the Channel views.

Rule Description	Alert Level
Consumed per sec should not be 0	Amber
The number of messages Consumed should have a near linear relationship Published value. For example, if the Consumed value increased 10%, the Published value should also increase.Try to use Breach Predictor to calculate this relationship.	Amber
If Published Rate or Consumed Rate increase dramatically (over 10%)	Amber
If getUsedSpace > 1GB, this means the storage may be persisting incorrectly.Universal Messaging supports 5 channel types: nChannelAttributes.RELIABLE_TYPE nChannelAttributes.SIMPLE_TYPE nChannelAttributes.TRANSIENT_TYPE nChannelAttributes.MIXED_TYPE nChannelAttributes.PERSISTENT_TYPE The top 3 in this list are all memory-based channels, and therefore have no disk usage, so getUsedSpace relates to memory, and disk space will be 0. The other 2 are a bit trickier, but basically, disk usage in persistent channels is the getUsedSpace value. Memory can also be implied as the same (we do have a memory cache of events that has a replacement policy, so memory and disk may not actually be the same, but it’s a rough guide). Mixed channels support both persistent and in-memory events, so as such can be treated the same as persistent.	Red

Connection Monitoring

The following rules relate to the Connection views.

Rule Description	Alert Level
Alert If the getQueueSize value is always increasing	Amber

Log Monitoring

This uses the File Keyword Monitor to examine the logs and raise alerts based on the following information. These are the suggested Red and Amber FKM Keys to watch.

Grep	Description	Test	Alert Level
`Disconnected\s+from`	This means a Realm is disconnected and should be picked up by the Cluster Monitor.	Shut down realm.	Amber
`Inactive\s+drivers\s+bound\s+to`	This occurs when the network stack does not correctly report a connection closure.	Block traffic using firewall.	Amber
`Driver\s+inactive\s+on\s+adapter`	This occurs when the network stack does not correctly report a connection closure.	Block traffic using firewall.	Amber
`Fatal`	This is a serious error.	This is hard to reproduce, Tag. The next level is security, which is easy to check for. Just create a channel from one host, remove all permissions in the ACL panel for `@` acl entry, and run the `npubchan` example program on the channel from another host. You will see a security log message.	Red
`Logged\s+Out\s+using\s+(nsp\|nhp\|nsps\|nsp)\s+session\s+established\s+for\s[0-6]?[0-9]\s+Seconds`	Try to find the text “Logged out” using the `<PROTOCOL>` session established for `<x>` seconds.If the X seconds is less than 60, then alert.	Start up an `APP`, and shut it down before 60 seconds is up.	Amber
`ThreadPool:<“+myName+”>(“+myIdleThreadCount+”)hasbeenactiveforover60seconds`	Alert if the thread pool has taken 60 seconds to process.	This will be hard to recreate.	Red