Gateway Hub
Gateway Hub self monitoring is now a required part of a Gateway Hub system. You must configure self monitoring to recieve full support.
Overview Copied
You can monitor Gateway Hub via Gateway by using the metrics collected from the internal Netprobes running on each Gateway Hub node.
Configuring Gateway Hub self monitoring allows you to:
- View Gateway Hub metrics in your Active Console or Web Console.
- Investigate historical Gateway Hub metrics in your Active Console or Web Console
- Set Gateway alerting rules on Gateway Hub metrics.
You can also use the Gateway Hub data Gateway plugin to monitor the status of the connection between Gateway and Gateway Hub. For more information see Gateway Hub data in Gateway Plug-Ins.
Prerequisites Copied
The following requirements must be met prior to the installation and setup of this integration:
- Gateway version 5.5.x or newer.
- Gateway Hub version 2.3.0 or newer.
You must have a valid Gateway license to use this feature outside of demo mode. Any licensing errors are reported by the Health plugin.
Gateway Hub self monitoring Copied
Caution
Gateway Hub self monitoring relies on new simplified mappings for dynamic entities pilot feature. This feature is subject to change.
Each Gateway Hub node runs an internal Netprobe and Collection Agent to collect metrics on its own performance. You can connect to the internal Netprobe to monitor Gateway Hub from your Active Console. You must connect to each node individually.
Gateway Hub datapoints must be processed by configuring Dynamic Entities, to simplify this process you can download an include file that provides this configuration. The Gateway Hub self monitoring includes file uses the new simplified Dynamic Entities configuration options, this is a pilot feature and is subject to change.
Note
In Gateway Hub version 2.4.0, the Collection Agent plugin used to collect self-monitoring metrics was updated fromlinux-infra
tosystem
. This requires an updated include file with the correct plugin name and other changes.Ensure you have the correct include file for the version of Gateway Hub you are using.
Configure default self monitoring Copied
To enable Gateway Hub self monitoring in Active Console:
- Download the Gateway Hub Integration from Downloads . This should contain the
geneos-integration-gateway-hub-<version>.xml
include file and you should save this file to a location accessible to your Gateway. - Open the Gateway Setup Editor.
- Right-click the Includes top level section, then select New Include.
- Set the following options:
- Priority — Any value above
1
. - Location — Specify the path to the location of the
geneos-integration-gateway-hub-<version>.xml
include file.
- Priority — Any value above
- Right-click the include file in the State Tree and select Expand all.
- Select Click to load. The new includes file will load.
- Right-click the Probes top-level section, then select New Probe.
- Set the following options in the Basic tab:
- Name — Specify the name that will appear in Active Console, for example
Gateway_Hub_Self_Monitoring
. - Hostname — Specify the hostname of your Gateway Hub node.
- Port —
7036
. - Secure —
Enabled
.
- Name — Specify the name that will appear in Active Console, for example
- Set the following options in the Dynamic Entities tab:
- Mapping type —
Hub
.
- Repeat steps 7 to 9 for each Gateway Hub node you wish to monitor.
- Click Save current document .
In your Active Console, Dynamic Entities are created for metrics from Gateway Hub self monitoring.
Dataviews are automatically populated from available metrics, in general an entity will be created for each Gateway Hub component. Components that use Java will also include metrics on JVM performance.
Depending on how you have configured your Active Console, you may see repeated entities in the State Tree. This is because there are multiple Gateway Hubnodes each running the same components. To organise your State Tree by node, perform the following steps:
- Open your Active Console.
- Navigate to
Tools > Settings > General
. - Set the Viewpath option to
hostname
. - Click Apply.
Configure log file monitoring Copied
You can additionally monitor Gateway Hub log files using the FKM plugin. To do this you must configure a new Managed Entity using the hub-logs
sampler provided as part of the geneos-integration-gateway-hub-<version>.xml
include file.
To enable Gateway Hub log monitoring:
- Open the Gateway Setup Editor.
- Right-click the Managed Entities top level section, then select New Managed Entity.
- Set the following options:
- Name — Specify the name that will appear in Active Console, for example
Hub_Log_Monitoring
. - Options > Probe — Specify the name of the internal Gateway Hub probe.
- Sampler —
hub-logs
.
- Name — Specify the name that will appear in Active Console, for example
- Repeat steps 2 to 3 for each Gateway Hub node you wish to monitor.
- Click Save current document .
In your Active Console an additional Managed Entity, with the name you specified, will show the status of Gateway Hub’s log files on that node.
If you have configured Gateway Hub to store log files in a directory other than the default, you must direct the sampler to your logs directory. To do this, specify the hub-logs-dir
variable from the advanced tab of your Managed Entity. For more information about setting variables, see managedEntities > managedEntity > var in Managed Entities and Managed Entity Groups.
Important metrics Copied
If you have configured your Gateway to receive Gateway Hub self monitoring metrics, you may want to set up alerts for changes in the most important metrics. This section outlines the key metrics for the major Gateway Hub components and provides advice to help create meaningful alerts.
JVM memory Copied
You can observe JVM memory with the following metrics:
Observing high heap usage, where jvm_memory_pool_heap_used
is nearing jvm_memory_heap_max
, can be an indicator that the heap memory allocation may not be sufficient.
Note
While there might be a temptation to use increase the memory allocation, the Gateway Hub installer calculates the ideal Gateway Hub memory settings based on the size of the machine being used. It is important not to over-allocate memory to any Gateway Hub component, as it may result in an over-commitment that can produce unexpected behaviours including failures or swapping.
Metric | Source | Description |
---|---|---|
jvm_memory_heap_committed |
StatsD |
Amount of memory (in bytes) that is guaranteed to be available for use by the JVM. The amount of committed memory may change over time (increase or decrease). The value of
|
jvm_memory_heap_max
|
StatsD |
Maximum amount of memory (in bytes) that can be used for memory management. Its value may be undefined. The maximum amount of memory may change over time if defined. The value of Memory allocation may fail if it attempts to increase the used memory such that the used memory is greater than the committed memory, even if the used memory would still be less than or equal to the maximum memory (for example, when the system is low on virtual memory). |
jvm_memory_pool_heap_used
|
StatsD | Amount of memory currently used (in bytes) by the JVM. |
jvm_memory_gc_collection_count
|
StatsD | Number of garbage collections that have occurred in the JVM life cycle. |
jvm_memory_gc_collection_time
|
StatsD | Time spent in garbage collection. |
JVM garbage collection Copied
You can observe JVM garbage collection with the following metrics:
When creating alerts, note the following:
- Long pauses due to garbage will negatively impact any JVM based process, particurly if it is latency sensitive.
Metric | Source | Description |
---|---|---|
jvm_memory_gc_collection_count
|
StatsD | Total number of collections that have occurred |
jvm_memory_gc_collection_time
|
StatsD | Approximate accumulated collection elapsed time in milliseconds. |
Kafka consumer metrics Copied
Several Gateway Hub daemons contain Kafka consumers that consume messages from Kafka topic partition and process them.
Key consumer metrics Copied
You can observe Kafka consumers with the following metrics:
When creating alerts, note the following:
- The
kafka_consumer_records_consumed_rate
metric is a measure of network bandwidth. This should stay largely constant. If it does not, then this may indicate network problems. - The
kafka_consumer_records_consumed_total
metric is a measure of actual records consumed. This may fluctuate depending on the message size. It may not correlate with bytes consumed. In a healthy application, you would expect this metric to stay fairly constant. If this measure drops to zero it may indicate a consumer failure.
Metric | Source | Dimensions | Description |
---|---|---|---|
kafka_consumer_bytes_consumed_rate
|
StatsD | app , client-id , hostname , topic |
Average bytes consumed per topic, per second. Example: |
kafka_consumer_bytes_consumed_rate
|
StatsD | app , client-id , hostname |
Average bytes consumed per second. Example: |
kafka_consumer_bytes_consumed_total
|
StatsD | app , client-id , hostname |
Total bytes consumed. Example: |
kafka_consumer_bytes_consumed_total
|
StatsD | app , client-id , hostname , topic |
Total bytes consumed by topic. Example: |
kafka_consumer_records_consumed_rate
|
StatsD | app , client-id , hostname , topic |
The average number of records consumed per second. Example: |
kafka_consumer_records_consumed_rate
|
StatsD | app , client-id , hostname |
The average number of records consumed per second. Example: |
kafka_consumer_records_consumed_total
|
StatsD | app , client-id , hostname , topic |
Total records consumed. Example: |
Consumer lag Copied
If a Kafka topic fills faster than the topic is consumed then we get what is known as “lag”. High lag means that your system is not keeping up with messages. Near zero lag means that it is keeping up. High lag, in operational terms, means that there may be significant latency between a message being ingested into Gateway Hub, and when that message is reflected to a user (via a query or some other means). You should try to ensure that lag is close to zero, increasing lag is a problem.
You can observe lag with the following metrics:
These Kafka consumer metrics work at the consumer level and not at the consumer group level. To get a complete picture you should watch the equivalent metrics across all nodes in the cluster.
When creating alerts, note the following:
- The
kafka_consumer_records_lag
metric is the actual lag between the specific consumer in the daemon and the producer for the specified topic/partition. You should monitor this metric closely as it is a key indicator that the system may not be processing records quickly enough. - If lag for the same topic across all nodes is roughly
0
there is no problem. - If lag is significantly higher for the same topics on different nodes, then a problem is likely present on specific nodes.
- If lag is high across all nodes, then it may be an indicator that the Gateway Hub is overloaded across all nodes, possibly because the load present is higher than the node hardware is rated for.
Metric | Source | Dimensions | Description |
---|---|---|---|
kafka_consumer_records_lag
|
StatsD | app , client-id , hostname , partition , topic |
Number of messages the consumer is behind the producer on this partition. Example: |
kafka_consumer_records_lag_avg
|
StatsD | app , client-id , hostname , partition , topic |
Average number of messages the consumer is behind the producer on this partition. Example: |
kafka_consumer_records_lag_max
|
StatsD | app , client-id , hostname , partition , topic |
Max number of messages the consumer is behind the producer on this partition Example: |
Fetch rate Copied
You can observe fetch rates with the following metrics:
When creating alerts, note the following:
- The
kafka_consumer_fetch_rate
metric is an indicator that the consumer is performing fetches and this should be fairly constant for a healthy consumer. If it drops, then this could mean there is a problem in the consumer.
Metric | Dimensions | Description | |
---|---|---|---|
kafka_consumer_fetch_rate
|
StatsD | app , client-id , hostname |
Number of fetch requests per second. Example: |
kafka_consumer_fetch_size_avg
|
StatsD | app , client-id , hostname |
Average number of bytes fetched per request. Example: |
kafka_consumer_fetch_size_avg |
StatsD | app , client-id , hostname , topic |
Average number of bytes fetched per request. Example: |
kafka_consumer_fetch_size_max
|
StatsD | app , client-id , hostname , topic |
Max number of bytes fetched per request. Example: |
Kafka producer metrics Copied
Kafka producers publish records to Kafka topics. When producers publish messages in a reliable system, they must be sure that messages have been received (unless explicitly configured not to care). To do this publishers receive acknowledgements from brokers. In some configurations, a producer does not require acknowledgements from all other brokers, it merely needs to receive a minimum number (to achieve a quorum). In other configurations, it may need acknowledgements from all brokers. In either case, the act of receiving acknowledgements is somewhat latency-sensitive and will impact how fast a producer can push messages.
When configuring Kafka producers, consider the following:
- Producers can send messages in batches, this will generally be more efficient than sending individual messages as it means the conversation with the broker is less extensive, and fewer acknowledgements are required.
- Producers can compress messages, compressing messages will make them smaller which requires less network bandwidth. However, compression means more CPU power is needed.
You can observe Kafka producer behaviour with the following metrics:
When creating alerts, note the following:
- The
kafka_producer_batch_size_avg
metric indicates the average size of batches sent to the broker. Large batches are preferred, since small batches do not compress well, and need to be sent more often, thus require more network traffic. This value should not vary greatly under a reasonably constant load. - If the
kafka_producer_node_response_rate
metric is low this may indicate that the producer is falling behind and that data can’t be consumed at an ideal rate. This value should not vary greatly under a reasonably constant load. - If the
kafka_producer_request_rate
is low under high load this could indicate an issue with the producer. An extremely high rate could also indicate a problem as it may mean the consumer may struggle to keep up ( which could require throttling). - Generally batches should be large. However, large batches may also increase the
kafka_producer_node_request_latency_avg
metric. This is because a producer may wait until it builds up a big enough batch before it initiates a send operation (this behaviour is controlled by thelinger.ms
Kafka setting). You should prefer throughput over latency, however, this is a trade-off and too much latency can also be problematic. Large batches are most likely to cause a problem in high load scenarios. - If the
kafka_producer_io_wait_time_ns_avg
metric is high this means that the producer is spending a lot of time waiting on network resources while CPU is essentially idle. This may point to resource saturation, a slow network, or similar problems.
Metric | Source | Dimensions | Description |
---|---|---|---|
kafka_producer_batch_size_avg
|
StatsD | app , client-id , hostname |
Average number of bytes sent per partition, per request. Example: |
kafka_producer_compression_rate_avg
|
StatsD | app , client-id , hostname |
Average compression rate of record batches. Example: |
kafka_producer_node_response_rate
|
StatsD | app , client-id , hostname , node-id |
Average number of responses received per second from the broker. Example: |
kafka_producer_request_rate |
StatsD | app , client-id , hostname |
Example: |
kafka_producer_node_request_latency_avg
|
StatsD | app , client-id , hostname , node-id |
Average request latency in milliseconds for a node. Example: |
kafka_producer_io_wait_time_ns_avg
|
StatsD | app , client-id , hostname |
Average length of time the I/O thread spends waiting for a socket ready for reads or writes in nanoseconds. Example: |
kafka_producer_outgoing_byte_rate
|
StatsD | app , client-id , hostname |
Average number of bytes sent per second to the broker. Example: |
Kafka broker metrics Copied
All Kafka messages pass through a broker, so if the broker is encountering problems this can have a wider impact on performance and reliability.
Note that Kafka broker metrics are collected via the Kafka-plugin which uses JMX. The consumer and producer metrics listed above are gathered by the StatsD plugin for each specific process.
You can observe Kafka broker behaviour with the following metrics:
When creating alerts, note the following:
- The
server_replica_under_replicated_partitions
metric should never be greater than zero. - The
server_replica_isr_expands_per_sec_<attribute>
metric should not vary significantly. - The
controller_active_controller_count
metric must be equal to one. There should be exactly one controller per cluster. - The
controller_active_controller_count
metric should never be greater than zero. - A high rate of leader elections, as indicated by the
controller_leader_election_rate_and_time_ms
metric, suggests brokers are fluctuating between offline and online statuses. Additionally, taking too long to elect a leader will result in partitions being inaccessible for long periods. - The
controller_unclean_leader_elections_per_sec_count
metric should never be greater than zero.
Metric | Source | Dimensions | Description |
---|---|---|---|
server_replica_under_replicated_partitions
|
Kafka | app , broker_id , cluster_id , hostname |
Number of under-replicated partitions. Each Kafka partition may be replicated in order to provide reliability guarantees. In the set of replicas for a given partition, one is chosen as the leader. The leader is always considered in sync. The remaining replicas will also be considered in sync, providing they are not too far behind the leader. Synchronised replicas form the ISR (In Sync Replica) pool. If a partition lags too far behind it is removed from the ISR pool. Producers may require a minimum number of ISRs in order to operate reliably. When the ISR pool shrinks you will surely see an increase in under-replicated partitions. Example: |
server_replica_isr_expands_per_sec_<attribute>
|
Kafka | app , broker_id , cluster_id , hostname |
Rate at which the pool of in-sync replicas (ISRs) expands. Example: |
server_replica_isr_shrinks_per_sec_<attribute>
|
Kafka | app , broker_id , cluster_id , hostname |
Rate at which the pool of in-sync replicas (ISRs) shrinks. Example: |
controller_active_controller_count
|
Kafka | app , broker_id , cluster_id , hostname |
Number of active controllers in the cluster. Example: |
controller_offline_partitions_count
|
Kafka | app , broker_id , cluster_id , hostname |
Number of partitions that do not have an active leader and are hence not writable or readable. Example: |
controller_leader_election_rate_and_time_ms
|
Kafka | app , broker_id , cluster_id , hostname |
Rate of leader elections per second and the overall duration the cluster went without a leader. |
controller_unclean_leader_elections_per_sec_count
|
Kafka | app , broker_id , cluster_id , hostname |
Unclean leader election rate. If a broker goes offline, some partitions will be leaderless and Kafka will elect a new leader from the ISR pool. Gateway Hub does not allow unclean elections, hence the new leader must come from the ISR pool. Example: |
Zookeeper metrics Copied
Kafka uses Zookeeper to store metadata about topics and brokers. It plays a critical role in ensuring Kafka’s performance and stability. If Zookeeper is not available Kafka cannot function.
Zookeeper is very sensitive to IO latency. In particular, disk latency can have severe impacts on Zookeeper, this is because quorum operations must be completed quickly.
You can observe Zookeeper with the following metrics:
When creating alerts, note the following:
- The
zookeeper_outstanding_requests
metric should be low. Confluent suggests that this value should be below10
. - The
zookeeper_avg_request_latency
metric should be low as possible (typical values should be less than10ms
), and ideally fairly constant. You should investigate if this number spikes or has variability. - If the
zookeeper_fsync_threshold_exceed_count
metric increases steadily there may be a problem with disk latency. Ideally, this metric should be zero or non moving (in the case of recovery after a latency problem).
Metric | Source | Dimensions | Description |
---|---|---|---|
zookeeper_outstanding_requests
|
Zookeeper | app , hostname , port |
Number of requests from followers that have yet to be acknowledged. Example: |
zookeeper_avg_request_latency
|
Zookeeper | app , hostname , port |
Average time to respond to a client request. Example: |
zookeeper_max_client_cnxns_per_host
|
Zookeeper | app , hostname , port |
Example: |
zookeeper_num_alive_connections
|
Zookeeper | app , hostname , port |
Number of connections currently open. Should be well under the configured maximum connections for safety. Example: |
zookeeper_fsync_threshold_exceed_count
|
Zookeeper | app , hostname , port |
Count of instances f-sync time has exceeded the warning threshold. Example: |
etcd metrics Copied
etcd is used by Gateway Hub as a key-value store. Disk latency is the most important etcd metric, but CPU starvation can also cause problems.
Many etcd metrics are provided as histograms composed of several gauges. Etcd histogram buckets are cumulative. See below for an example:
disk_wal_fsync_duration_seconds_bucket_0.001 = 2325
disk_wal_fsync_duration_seconds_bucket_0.002 = 4642
disk_wal_fsync_duration_seconds_bucket_0.004 = 5097
disk_wal_fsync_duration_seconds_bucket_0.008 = 5187
disk_wal_fsync_duration_seconds_bucket_0.016 = 5248
disk_wal_fsync_duration_seconds_bucket_0.032 = 5253
disk_wal_fsync_duration_seconds_bucket_0.064 = 5254
disk_wal_fsync_duration_seconds_bucket_0.128 = 5254
disk_wal_fsync_duration_seconds_bucket_0.256 = 5254
disk_wal_fsync_duration_seconds_bucket_0.512 = 5254
disk_wal_fsync_duration_seconds_bucket_1.024 = 5254
disk_wal_fsync_duration_seconds_bucket_2.048 = 5254
disk_wal_fsync_duration_seconds_bucket_4.096 = 5254
disk_wal_fsync_duration_seconds_bucket_8.192 = 5254
disk_wal_fsync_duration_seconds_sum = 7.362459756999995
disk_wal_fsync_duration_seconds_count = 5254
The value of the disk_wal_fsync_duration_seconds_bucket_<x.y>
metric indicates the cumulative total over the whole duration ending with the bucket specified by the <x.y>
postfix. In this example, the disk_wal_fsync_duration_seconds_<x.y>
value increases with each time step for 0.064 seconds then remains static.
Key metrics Copied
You can observe etcd with the following metrics:
When creating alerts, note the following:
- The rate of server health failures should be low.
- If the
server_heartbeat_send_failures_total
metric is increasing, this may indicate a slow disk.
Metric | Source | Description |
---|---|---|
server_health_failures
|
prometheus-target | Total number of failed health checks. |
server_heartbeat_send_failures_total
|
prometheus-target | Total number of leader heartbeat send failures (likely overloaded from slow disk). If non zero and increasing, this could be due to a slow disk, and can be a prelude to a cluster failover. |
Etcd latency metrics Copied
Most issues with etcd relate to slow disks. High disk latency can cause long pauses that will lead to missed heartbeats, and potentially fail overs in the etcd cluster. Disk latency will also contribute to high request latency.
You can observe etcd latency with the following metrics:
When creating alerts, note the following:
- The 0.99th percentile of the
disk_backend_commit_duration_seconds_bucket_<bucket>
metric should be less than 25ms. - The
disk_wal_fsync_duration_seconds_bucket_<bucket>
metric should be fairly constant and ideally, as low as possible.
Metric | Source | Description |
---|---|---|
disk_backend_commit_duration_seconds_bucket_<bucket>
|
prometheus-target | Presented as a histogram. The latency distributions of commit called by backend. |
disk_wal_fsync_duration_seconds_bucket_<bucket>
|
prometheus-target | Presented as a histogram. The latency distributions of fsync called by the Write Ahead Log (WAL). |
PostgreSQL metrics Copied
Gateway Hub stores metrics in a PostgreSQL database.
You can observe PostgreSQL performance with the following metrics:
When creating alerts, note the following:
- Ensure that excess CPU usage by PostgreSQL is monitored. For example, a runaway query could consume excessive CPU which can affect the whole Gateway Hub node. Other causes of excessive CPU usage include: expensive back ground workers, expensive queries, high ingestion rates, misconfiguration of background worker counts, and more.
- Do not manually configure PostgreSQL after installation.
- At no time should PostgreSQL or other process show swappiness, the memory budget should ensure that overall memory allocation does not exceed the total memory.
- It is possible that PostgreSQL may run at or near to its upper memory allocation. Storage systems will buffer pages into memory for caching and other efficiency reasons, so it is normal to see high but fairly constant memory usage.
- The
disk_free
metric should not exceed 90 percent ofdisk_total
for any volume. - PostgreSQL sets an upper limit on concurrent connections. If this limit is exceeded, database clients (including application code), may be blocked while waiting to be allocated a connection. A blocked client can result in a timeout which can significantly impact performance or throughput. During Gateway Hub installation PostgreSQL is configured with a maximum connection limit based on the available hardware.
- The likely causes of a connection limit problem are connections leaks in applications code or manually accessing the PostgreSQL database. Both should be avoided.
- The
active_connections
metric should not exceed 80 percent ofmax_connections
.
Metric | Source | Dimensions | Description |
---|---|---|---|
processes_cpu_load
|
System | comm , hostname , pid |
CPU usage for a specific process identified by the pid dimension. |
memory_used_swap_percent
|
System | hostname
|
Swap memory used for a specific process. |
processes_virtual_size
|
System | hostname , process_id , process_name |
Sum of all mapped memory used by a specific process, including swap. |
processes_resident_set_size
|
System | hostname , process_id , process_name |
Sum of all physical memory used by a specific process, excluding swap. Known as the resident set. |
disk_free
|
System | hostname , volume |
Total free space on the specified volume. |
disk_total
|
System | hostname , volume |
Total space on the specified volume. |
active_connections
|
PostgreSQL | app , hostname |
Number of connections currently active on the server. |
app_connections
|
PostgreSQL | app , hostname , username |
Number of connections currently active on the server, grouped by application. |
max_connections
|
PostgreSQL | app , hostname |
Maximum number of connections for the PostgreSQL server. |