Gateway Hub

Gateway Hub self monitoring is now a required part of a Gateway Hub system. You must configure self monitoring to recieve full support.

Overview Copied

You can monitor Gateway Hub via Gateway by using the metrics collected from the internal Netprobes running on each Gateway Hub node.

Configuring Gateway Hub self monitoring allows you to:

View Gateway Hub metrics in your Active Console or Web Console.
Investigate historical Gateway Hub metrics in your Active Console or Web Console
Set Gateway alerting rules on Gateway Hub metrics.

You can also use the Gateway Hub data Gateway plugin to monitor the status of the connection between Gateway and Gateway Hub. For more information see Gateway Hub data in Gateway Plug-Ins.

Prerequisites Copied

The following requirements must be met prior to the installation and setup of this integration:

Gateway version 5.5.x or newer.
Gateway Hub version 2.3.0 or newer.

You must have a valid Gateway license to use this feature outside of demo mode. Any licensing errors are reported by the Health plugin.

Gateway Hub self monitoring Copied

Caution
Gateway Hub self monitoring relies on new simplified mappings for dynamic entities pilot feature. This feature is subject to change.

Each Gateway Hub node runs an internal Netprobe and Collection Agent to collect metrics on its own performance. You can connect to the internal Netprobe to monitor Gateway Hub from your Active Console. You must connect to each node individually.

Gateway Hub datapoints must be processed by configuring Dynamic Entities, to simplify this process you can download an include file that provides this configuration. The Gateway Hub self monitoring includes file uses the new simplified Dynamic Entities configuration options, this is a pilot feature and is subject to change.

Note
In Gateway Hub version 2.4.0, the Collection Agent plugin used to collect self-monitoring metrics was updated from linux-infra to system. This requires an updated include file with the correct plugin name and other changes.Ensure you have the correct include file for the version of Gateway Hub you are using.

Configure default self monitoring Copied

To enable Gateway Hub self monitoring in Active Console:

Download the Gateway Hub Integration from Downloads . This should contain the geneos-integration-gateway-hub-<version>.xml include file and you should save this file to a location accessible to your Gateway.
Open the Gateway Setup Editor.
Right-click the Includes top level section, then select New Include.
Set the following options:
- Priority — Any value above 1.
- Location — Specify the path to the location of the geneos-integration-gateway-hub-<version>.xml include file.
Right-click the include file in the State Tree and select Expand all.
Select Click to load. The new includes file will load.
Right-click the Probes top-level section, then select New Probe.
Set the following options in the Basic tab:
- Name — Specify the name that will appear in Active Console, for example Gateway_Hub_Self_Monitoring.
- Hostname — Specify the hostname of your Gateway Hub node.
- Port — 7036.
- Secure — Enabled.
Set the following options in the Dynamic Entities tab:

Mapping type — Hub.

Repeat steps 7 to 9 for each Gateway Hub node you wish to monitor.
Click Save current document .

In your Active Console, Dynamic Entities are created for metrics from Gateway Hub self monitoring.

Dataviews are automatically populated from available metrics, in general an entity will be created for each Gateway Hub component. Components that use Java will also include metrics on JVM performance.

Depending on how you have configured your Active Console, you may see repeated entities in the State Tree. This is because there are multiple Gateway Hubnodes each running the same components. To organise your State Tree by node, perform the following steps:

Open your Active Console.
Navigate to Tools > Settings > General.
Set the Viewpath option to hostname.
Click Apply.

Configure log file monitoring Copied

You can additionally monitor Gateway Hub log files using the FKM plugin. To do this you must configure a new Managed Entity using the hub-logs sampler provided as part of the geneos-integration-gateway-hub-<version>.xml include file.

To enable Gateway Hub log monitoring:

Open the Gateway Setup Editor.
Right-click the Managed Entities top level section, then select New Managed Entity.
Set the following options:
- Name — Specify the name that will appear in Active Console, for example Hub_Log_Monitoring.
- Options > Probe — Specify the name of the internal Gateway Hub probe.
- Sampler — hub-logs.
Repeat steps 2 to 3 for each Gateway Hub node you wish to monitor.
Click Save current document .

In your Active Console an additional Managed Entity, with the name you specified, will show the status of Gateway Hub’s log files on that node.

If you have configured Gateway Hub to store log files in a directory other than the default, you must direct the sampler to your logs directory. To do this, specify the hub-logs-dir variable from the advanced tab of your Managed Entity. For more information about setting variables, see managedEntities > managedEntity > var in Managed Entities and Managed Entity Groups.

Important metrics Copied

If you have configured your Gateway to receive Gateway Hub self monitoring metrics, you may want to set up alerts for changes in the most important metrics. This section outlines the key metrics for the major Gateway Hub components and provides advice to help create meaningful alerts.

JVM memory Copied

You can observe JVM memory with the following metrics:

Observing high heap usage, where jvm_memory_pool_heap_used is nearing jvm_memory_heap_max, can be an indicator that the heap memory allocation may not be sufficient.

Note
While there might be a temptation to use increase the memory allocation, the Gateway Hub installer calculates the ideal Gateway Hub memory settings based on the size of the machine being used. It is important not to over-allocate memory to any Gateway Hub component, as it may result in an over-commitment that can produce unexpected behaviours including failures or swapping.

Metric	Source	Description
`jvm_memory_heap_committed`	StatsD	Amount of memory (in bytes) that is guaranteed to be available for use by the JVM. The amount of committed memory may change over time (increase or decrease). The value of `jvm_memory_heap_committed` may be less than `jvm_memory_heap_max` but will always be greater than `jvm_memory_pool_heap_used`.
`jvm_memory_heap_max`	StatsD	Maximum amount of memory (in bytes) that can be used for memory management. Its value may be undefined. The maximum amount of memory may change over time if defined. The value of `jvm_memory_pool_heap_used` and `jvm_memory_heap_committed` will always be less than or equal to `jvm_memory_heap_max` if defined. Memory allocation may fail if it attempts to increase the used memory such that the used memory is greater than the committed memory, even if the used memory would still be less than or equal to the maximum memory (for example, when the system is low on virtual memory).
`jvm_memory_pool_heap_used`	StatsD	Amount of memory currently used (in bytes) by the JVM.
`jvm_memory_gc_collection_count`	StatsD	Number of garbage collections that have occurred in the JVM life cycle.
`jvm_memory_gc_collection_time`	StatsD	Time spent in garbage collection.

JVM garbage collection Copied

You can observe JVM garbage collection with the following metrics:

When creating alerts, note the following:

Long pauses due to garbage will negatively impact any JVM based process, particurly if it is latency sensitive.

Metric	Source	Description
`jvm_memory_gc_collection_count`	StatsD	Total number of collections that have occurred
`jvm_memory_gc_collection_time`	StatsD	Approximate accumulated collection elapsed time in milliseconds.

Kafka consumer metrics Copied

Several Gateway Hub daemons contain Kafka consumers that consume messages from Kafka topic partition and process them.

Key consumer metrics Copied

You can observe Kafka consumers with the following metrics:

When creating alerts, note the following:

The kafka_consumer_records_consumed_rate metric is a measure of network bandwidth. This should stay largely constant. If it does not, then this may indicate network problems.
The kafka_consumer_records_consumed_total metric is a measure of actual records consumed. This may fluctuate depending on the message size. It may not correlate with bytes consumed. In a healthy application, you would expect this metric to stay fairly constant. If this measure drops to zero it may indicate a consumer failure.

Metric	Source	Dimensions	Description
`kafka_consumer_bytes_consumed_rate`	StatsD	`app`, `client-id`, `hostname`, `topic`	Average bytes consumed per topic, per second. Example: `kafka_consumer_bytes_consumed_rate Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, client-id=consumer-1`
`kafka_consumer_bytes_consumed_rate`	StatsD	`app`, `client-id`, `hostname`	Average bytes consumed per second. Example: `kafka_consumer_bytes_consumed_rate Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=consumer-1`
`kafka_consumer_bytes_consumed_total`	StatsD	`app`, `client-id`, `hostname`	Total bytes consumed. Example: `kafka_consumer_bytes_consumed_total Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=consumer-1`
`kafka_consumer_bytes_consumed_total`	StatsD	`app`, `client-id`, `hostname`, `topic`	Total bytes consumed by topic. Example: `kafka_consumer_bytes_consumed_total Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, client-id=consumer-1`
`kafka_consumer_records_consumed_rate`	StatsD	`app`, `client-id`, `hostname`, `topic`	The average number of records consumed per second. Example: `kafka_consumer_records_consumed_rate Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, client-id=consumer-1`
`kafka_consumer_records_consumed_rate`	StatsD	`app`, `client-id`, `hostname`	The average number of records consumed per second. Example: `kafka_consumer_records_consumed_rate Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=consumer-1`
`kafka_consumer_records_consumed_total`	StatsD	`app`, `client-id`, `hostname`, `topic`	Total records consumed. Example: `kafka_consumer_records_consumed_total Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, client-id=consumer-1`

Consumer lag Copied

If a Kafka topic fills faster than the topic is consumed then we get what is known as “lag”. High lag means that your system is not keeping up with messages. Near zero lag means that it is keeping up. High lag, in operational terms, means that there may be significant latency between a message being ingested into Gateway Hub, and when that message is reflected to a user (via a query or some other means). You should try to ensure that lag is close to zero, increasing lag is a problem.

You can observe lag with the following metrics:

These Kafka consumer metrics work at the consumer level and not at the consumer group level. To get a complete picture you should watch the equivalent metrics across all nodes in the cluster.

When creating alerts, note the following:

The kafka_consumer_records_lag metric is the actual lag between the specific consumer in the daemon and the producer for the specified topic/partition. You should monitor this metric closely as it is a key indicator that the system may not be processing records quickly enough.
If lag for the same topic across all nodes is roughly 0 there is no problem.
If lag is significantly higher for the same topics on different nodes, then a problem is likely present on specific nodes.
If lag is high across all nodes, then it may be an indicator that the Gateway Hub is overloaded across all nodes, possibly because the load present is higher than the node hardware is rated for.

Metric	Source	Dimensions	Description
`kafka_consumer_records_lag`	StatsD	`app`, `client-id`, `hostname`, `partition`, `topic`	Number of messages the consumer is behind the producer on this partition. Example: `kafka_consumer_records_lag Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, partition=1, client-id=consumer-1`
`kafka_consumer_records_lag_avg`	StatsD	`app`, `client-id`, `hostname`, `partition`, `topic`	Average number of messages the consumer is behind the producer on this partition. Example: `kafka_consumer_records_lag_avg Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, partition=1, client-id=consumer-1`
`kafka_consumer_records_lag_max`	StatsD	`app`, `client-id`, `hostname`, `partition`, `topic`	Max number of messages the consumer is behind the producer on this partition Example: `kafka_consumer_records_lag_max Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, partition=1, client-id=consumer-1`

Fetch rate Copied

You can observe fetch rates with the following metrics:

When creating alerts, note the following:

The kafka_consumer_fetch_rate metric is an indicator that the consumer is performing fetches and this should be fairly constant for a healthy consumer. If it drops, then this could mean there is a problem in the consumer.

Metric		Dimensions	Description
`kafka_consumer_fetch_rate`	StatsD	`app`, `client-id`, `hostname`	Number of fetch requests per second. Example: `kafka_consumer_fetch_rate Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=consumer-1`
`kafka_consumer_fetch_size_avg`	StatsD	`app`, `client-id`, `hostname`	Average number of bytes fetched per request. Example: `kafka_consumer_fetch_size_avg Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=consumer-1`
`kafka_consumer_fetch_size_avg`	StatsD	`app`, `client-id`, `hostname`, `topic`	Average number of bytes fetched per request. Example: `kafka_consumer_fetch_size_avg Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, client-id=consumer-1`
`kafka_consumer_fetch_size_max`	StatsD	`app`, `client-id`, `hostname`, `topic`	Max number of bytes fetched per request. Example: `kafka_consumer_fetch_size_max Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=consumer-1`

Kafka producer metrics Copied

Kafka producers publish records to Kafka topics. When producers publish messages in a reliable system, they must be sure that messages have been received (unless explicitly configured not to care). To do this publishers receive acknowledgements from brokers. In some configurations, a producer does not require acknowledgements from all other brokers, it merely needs to receive a minimum number (to achieve a quorum). In other configurations, it may need acknowledgements from all brokers. In either case, the act of receiving acknowledgements is somewhat latency-sensitive and will impact how fast a producer can push messages.

When configuring Kafka producers, consider the following:

Producers can send messages in batches, this will generally be more efficient than sending individual messages as it means the conversation with the broker is less extensive, and fewer acknowledgements are required.
Producers can compress messages, compressing messages will make them smaller which requires less network bandwidth. However, compression means more CPU power is needed.

You can observe Kafka producer behaviour with the following metrics:

When creating alerts, note the following:

The kafka_producer_batch_size_avg metric indicates the average size of batches sent to the broker. Large batches are preferred, since small batches do not compress well, and need to be sent more often, thus require more network traffic. This value should not vary greatly under a reasonably constant load.
If the kafka_producer_node_response_rate metric is low this may indicate that the producer is falling behind and that data can’t be consumed at an ideal rate. This value should not vary greatly under a reasonably constant load.
If the kafka_producer_request_rate is low under high load this could indicate an issue with the producer. An extremely high rate could also indicate a problem as it may mean the consumer may struggle to keep up ( which could require throttling).
Generally batches should be large. However, large batches may also increase the kafka_producer_node_request_latency_avg metric. This is because a producer may wait until it builds up a big enough batch before it initiates a send operation (this behaviour is controlled by the linger.ms Kafka setting). You should prefer throughput over latency, however, this is a trade-off and too much latency can also be problematic. Large batches are most likely to cause a problem in high load scenarios.
If the kafka_producer_io_wait_time_ns_avg metric is high this means that the producer is spending a lot of time waiting on network resources while CPU is essentially idle. This may point to resource saturation, a slow network, or similar problems.

Metric	Source	Dimensions	Description
`kafka_producer_batch_size_avg`	StatsD	`app`, `client-id`, `hostname`	Average number of bytes sent per partition, per request. Example: `kafka_producer_batch_size_avg Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=producer-1`
`kafka_producer_compression_rate_avg`	StatsD	`app`, `client-id`, `hostname`	Average compression rate of record batches. Example: `kafka_producer_compression_rate_avg Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=producer-1`
`kafka_producer_node_response_rate`	StatsD	`app`, `client-id`, `hostname`, `node-id`	Average number of responses received per second from the broker. Example: `kafka_producer_node_response_rate Dimensions = app=snapshotd, node-id=node--1, hostname=hub-vm.internal, client-id=producer-1`
`kafka_producer_request_rate`	StatsD	`app`, `client-id`, `hostname`	Example: `kafka_producer_request_rate Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=producer-1`
`kafka_producer_node_request_latency_avg`	StatsD	`app`, `client-id`, `hostname`, `node-id`	Average request latency in milliseconds for a node. Example: `kafka_producer_node_request_latency_avg Dimensions = app=snapshotd, node-id=node--1, hostname=hub-vm.internal, client-id=producer-1`
`kafka_producer_io_wait_time_ns_avg`	StatsD	`app`, `client-id`, `hostname`	Average length of time the I/O thread spends waiting for a socket ready for reads or writes in nanoseconds. Example: `kafka_producer_io_wait_time_ns_avg Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=producer-1`
`kafka_producer_outgoing_byte_rate`	StatsD	`app`, `client-id`, `hostname`	Average number of bytes sent per second to the broker. Example: `kafka_producer_outgoing_byte_rate Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=producer-1`

Kafka broker metrics Copied

All Kafka messages pass through a broker, so if the broker is encountering problems this can have a wider impact on performance and reliability.

Note that Kafka broker metrics are collected via the Kafka-plugin which uses JMX. The consumer and producer metrics listed above are gathered by the StatsD plugin for each specific process.

You can observe Kafka broker behaviour with the following metrics:

When creating alerts, note the following:

The server_replica_under_replicated_partitions metric should never be greater than zero.
The server_replica_isr_expands_per_sec_<attribute> metric should not vary significantly.
The controller_active_controller_count metric must be equal to one. There should be exactly one controller per cluster.
The controller_active_controller_count metric should never be greater than zero.
A high rate of leader elections, as indicated by the controller_leader_election_rate_and_time_ms metric, suggests brokers are fluctuating between offline and online statuses. Additionally, taking too long to elect a leader will result in partitions being inaccessible for long periods.
The controller_unclean_leader_elections_per_sec_count metric should never be greater than zero.

Metric	Source	Dimensions	Description
`server_replica_under_replicated_partitions`	Kafka	`app`, `broker_id`, `cluster_id`, `hostname`	Number of under-replicated partitions. Each Kafka partition may be replicated in order to provide reliability guarantees. In the set of replicas for a given partition, one is chosen as the leader. The leader is always considered in sync. The remaining replicas will also be considered in sync, providing they are not too far behind the leader. Synchronised replicas form the ISR (In Sync Replica) pool. If a partition lags too far behind it is removed from the ISR pool. Producers may require a minimum number of ISRs in order to operate reliably. When the ISR pool shrinks you will surely see an increase in under-replicated partitions. Example: `server_replica_under_replicated_partitions Dimensions = broker_id=0, hostname=hub-vm.internal, cluster_id=a1ZajBN8QiS60Hs_AZte3Q, app=kafka`
`server_replica_isr_expands_per_sec_<attribute>`	Kafka	`app`, `broker_id`, `cluster_id`, `hostname`	Rate at which the pool of in-sync replicas (ISRs) expands. Example: `server_replica_isr_expands_per_sec_count Dimensions = broker_id=0, hostname=hub-vm.internal, cluster_id=a1ZajBN8QiS60Hs_AZte3Q, app=kafka`
`server_replica_isr_shrinks_per_sec_<attribute>`	Kafka	`app`, `broker_id`, `cluster_id`, `hostname`	Rate at which the pool of in-sync replicas (ISRs) shrinks. Example: `server_replica_isr_shrinks_per_sec_count Dimensions = broker_id=0, hostname=hub-vm.internal, cluster_id=a1ZajBN8QiS60Hs_AZte3Q, app=kafka`
`controller_active_controller_count`	Kafka	`app`, `broker_id`, `cluster_id`, `hostname`	Number of active controllers in the cluster. Example: `controller_active_controller_<attribute> Dimensions = broker_id=0, hostname=hub-vm.internal, cluster_id=a1ZajBN8QiS60Hs_AZte3Q, app=kafka`
`controller_offline_partitions_count`	Kafka	`app`, `broker_id`, `cluster_id`, `hostname`	Number of partitions that do not have an active leader and are hence not writable or readable. Example: `controller_offline_partitions_count Dimensions = broker_id=0, hostname=hub-vm.internal, cluster_id=a1ZajBN8QiS60Hs_AZte3Q, app=kafka`
`controller_leader_election_rate_and_time_ms`	Kafka	`app`, `broker_id`, `cluster_id`, `hostname`	Rate of leader elections per second and the overall duration the cluster went without a leader.
`controller_unclean_leader_elections_per_sec_count`	Kafka	`app`, `broker_id`, `cluster_id`, `hostname`	Unclean leader election rate. If a broker goes offline, some partitions will be leaderless and Kafka will elect a new leader from the ISR pool. Gateway Hub does not allow unclean elections, hence the new leader must come from the ISR pool. Example: `controller_unclean_leader_elections_per_sec_count Dimensions = broker_id=0, hostname=hub-vm.internal, cluster_id=a1ZajBN8QiS60Hs_AZte3Q, app=kafka`

Zookeeper metrics Copied

Kafka uses Zookeeper to store metadata about topics and brokers. It plays a critical role in ensuring Kafka’s performance and stability. If Zookeeper is not available Kafka cannot function.

Zookeeper is very sensitive to IO latency. In particular, disk latency can have severe impacts on Zookeeper, this is because quorum operations must be completed quickly.

You can observe Zookeeper with the following metrics:

When creating alerts, note the following:

The zookeeper_outstanding_requests metric should be low. Confluent suggests that this value should be below 10.
The zookeeper_avg_request_latency metric should be low as possible (typical values should be less than 10ms), and ideally fairly constant. You should investigate if this number spikes or has variability.
If the zookeeper_fsync_threshold_exceed_count metric increases steadily there may be a problem with disk latency. Ideally, this metric should be zero or non moving (in the case of recovery after a latency problem).

Metric	Source	Dimensions	Description
`zookeeper_outstanding_requests`	Zookeeper	`app`, `hostname`, `port`	Number of requests from followers that have yet to be acknowledged. Example: `zookeeper_outstanding_requests Dimensions = hostname=hub-vm.internal, port=5181, app=zookeeper`
`zookeeper_avg_request_latency`	Zookeeper	`app`, `hostname`, `port`	Average time to respond to a client request. Example: `zookeeper_avg_request_latency Dimensions = hostname=hub-vm.internal, port=5181, app=zookeeper`
`zookeeper_max_client_cnxns_per_host`	Zookeeper	`app`, `hostname`, `port`	Example: `zookeeper_max_client_cnxns_per_host Dimensions = hostname=hub-vm.internal, port=5181, app=zookeeper`
`zookeeper_num_alive_connections`	Zookeeper	`app`, `hostname`, `port`	Number of connections currently open. Should be well under the configured maximum connections for safety. Example: `zookeeper_num_alive_connections Dimensions = hostname=hub-vm.internal, port=5181, app=zookeeper`
`zookeeper_fsync_threshold_exceed_count`	Zookeeper	`app`, `hostname`, `port`	Count of instances f-sync time has exceeded the warning threshold. Example: `zookeeper_fsync_threshold_exceed_count Dimensions = hostname=hub-vm.internal, port=5181, app=zookeeper`

etcd metrics Copied

etcd is used by Gateway Hub as a key-value store. Disk latency is the most important etcd metric, but CPU starvation can also cause problems.

Many etcd metrics are provided as histograms composed of several gauges. Etcd histogram buckets are cumulative. See below for an example:

disk_wal_fsync_duration_seconds_bucket_0.001 = 2325
disk_wal_fsync_duration_seconds_bucket_0.002 = 4642
disk_wal_fsync_duration_seconds_bucket_0.004 = 5097
disk_wal_fsync_duration_seconds_bucket_0.008 = 5187
disk_wal_fsync_duration_seconds_bucket_0.016 = 5248
disk_wal_fsync_duration_seconds_bucket_0.032 = 5253
disk_wal_fsync_duration_seconds_bucket_0.064 = 5254
disk_wal_fsync_duration_seconds_bucket_0.128 = 5254
disk_wal_fsync_duration_seconds_bucket_0.256 = 5254
disk_wal_fsync_duration_seconds_bucket_0.512 = 5254
disk_wal_fsync_duration_seconds_bucket_1.024 = 5254
disk_wal_fsync_duration_seconds_bucket_2.048 = 5254
disk_wal_fsync_duration_seconds_bucket_4.096 = 5254
disk_wal_fsync_duration_seconds_bucket_8.192 = 5254
disk_wal_fsync_duration_seconds_sum = 7.362459756999995
disk_wal_fsync_duration_seconds_count = 5254

The value of the disk_wal_fsync_duration_seconds_bucket_<x.y> metric indicates the cumulative total over the whole duration ending with the bucket specified by the <x.y> postfix. In this example, the disk_wal_fsync_duration_seconds_<x.y> value increases with each time step for 0.064 seconds then remains static.

Key metrics Copied

You can observe etcd with the following metrics:

When creating alerts, note the following:

The rate of server health failures should be low.
If the server_heartbeat_send_failures_total metric is increasing, this may indicate a slow disk.

Metric	Source	Description
`server_health_failures`	prometheus-target	Total number of failed health checks.
`server_heartbeat_send_failures_total`	prometheus-target	Total number of leader heartbeat send failures (likely overloaded from slow disk). If non zero and increasing, this could be due to a slow disk, and can be a prelude to a cluster failover.

Etcd latency metrics Copied

Most issues with etcd relate to slow disks. High disk latency can cause long pauses that will lead to missed heartbeats, and potentially fail overs in the etcd cluster. Disk latency will also contribute to high request latency.

You can observe etcd latency with the following metrics:

When creating alerts, note the following:

The 0.99th percentile of the disk_backend_commit_duration_seconds_bucket_<bucket> metric should be less than 25ms.
The disk_wal_fsync_duration_seconds_bucket_<bucket> metric should be fairly constant and ideally, as low as possible.

Metric	Source	Description
`disk_backend_commit_duration_seconds_bucket_<bucket>`	prometheus-target	Presented as a histogram. The latency distributions of commit called by backend.
`disk_wal_fsync_duration_seconds_bucket_<bucket>`	prometheus-target	Presented as a histogram. The latency distributions of fsync called by the Write Ahead Log (WAL).

PostgreSQL metrics Copied

Gateway Hub stores metrics in a PostgreSQL database.

You can observe PostgreSQL performance with the following metrics:

When creating alerts, note the following:

Ensure that excess CPU usage by PostgreSQL is monitored. For example, a runaway query could consume excessive CPU which can affect the whole Gateway Hub node. Other causes of excessive CPU usage include: expensive back ground workers, expensive queries, high ingestion rates, misconfiguration of background worker counts, and more.
Do not manually configure PostgreSQL after installation.
At no time should PostgreSQL or other process show swappiness, the memory budget should ensure that overall memory allocation does not exceed the total memory.
It is possible that PostgreSQL may run at or near to its upper memory allocation. Storage systems will buffer pages into memory for caching and other efficiency reasons, so it is normal to see high but fairly constant memory usage.
The disk_free metric should not exceed 90 percent of disk_total for any volume.
PostgreSQL sets an upper limit on concurrent connections. If this limit is exceeded, database clients (including application code), may be blocked while waiting to be allocated a connection. A blocked client can result in a timeout which can significantly impact performance or throughput. During Gateway Hub installation PostgreSQL is configured with a maximum connection limit based on the available hardware.
The likely causes of a connection limit problem are connections leaks in applications code or manually accessing the PostgreSQL database. Both should be avoided.
The active_connections metric should not exceed 80 percent of max_connections.

Metric	Source	Dimensions	Description
`processes_cpu_load`	System	`comm`, `hostname`, `pid`	CPU usage for a specific process identified by the `pid` dimension.
`memory_used_swap_percent`	System	`hostname`	Swap memory used for a specific process.
`processes_virtual_size`	System	`hostname`, `process_id`, `process_name`	Sum of all mapped memory used by a specific process, including swap.
`processes_resident_set_size`	System	`hostname`, `process_id`, `process_name`	Sum of all physical memory used by a specific process, excluding swap. Known as the resident set.
`disk_free`	System	`hostname`, `volume`	Total free space on the specified volume.
`disk_total`	System	`hostname`, `volume`	Total space on the specified volume.
`active_connections`	PostgreSQL	`app`, `hostname`	Number of connections currently active on the server.
`app_connections`	PostgreSQL	`app`, `hostname`, `username`	Number of connections currently active on the server, grouped by application.
`max_connections`	PostgreSQL	`app`, `hostname`	Maximum number of connections for the PostgreSQL server.

Previous article Next article

Gateway Hub

Overview Copied

Prerequisites Copied

Gateway Hub self monitoring Copied

Configure default self monitoring Copied

Configure log file monitoring Copied

Important metrics Copied

JVM memory Copied

JVM garbage collection Copied

Kafka consumer metrics Copied

Key consumer metrics Copied

Consumer lag Copied

Fetch rate Copied

Kafka producer metrics Copied

Kafka broker metrics Copied

Zookeeper metrics Copied

etcd metrics Copied

Key metrics Copied

Etcd latency metrics Copied

PostgreSQL metrics Copied

Was this topic helpful?

Your thoughts...

How can we improve this topic?

Your thoughts...

Thank you for your feedback!