Gateway Hub

Gateway Hub self monitoring is now a required part of a Gateway Hub system. You must configure self monitoring to recieve full support.

Overview Copied

You can monitor Gateway Hub via Gateway by using the metrics collected from the internal Netprobes running on each Gateway Hub node.

Configuring Gateway Hub self monitoring allows you to:

You can also use the Gateway Hub data Gateway plugin to monitor the status of the connection between Gateway and Gateway Hub. For more information see Gateway Hub data in Gateway Plug-Ins.

Prerequisites Copied

The following requirements must be met prior to the installation and setup of this integration:

You must have a valid Gateway license to use this feature outside of demo mode. Any licensing errors are reported by the Health plugin.

Gateway Hub self monitoring Copied

Caution

Gateway Hub self monitoring relies on new simplified mappings for dynamic entities pilot feature. This feature is subject to change.

Each Gateway Hub node runs an internal Netprobe and Collection Agent to collect metrics on its own performance. You can connect to the internal Netprobe to monitor Gateway Hub from your Active Console. You must connect to each node individually.

Gateway Hub datapoints must be processed by configuring Dynamic Entities, to simplify this process you can download an include file that provides this configuration. The Gateway Hub self monitoring includes file uses the new simplified Dynamic Entities configuration options, this is a pilot feature and is subject to change.

Note

In Gateway Hub version 2.4.0, the Collection Agent plugin used to collect self-monitoring metrics was updated from linux-infra to system. This requires an updated include file with the correct plugin name and other changes.Ensure you have the correct include file for the version of Gateway Hub you are using.

Configure default self monitoring Copied

To enable Gateway Hub self monitoring in Active Console:

  1. Download the Gateway Hub Integration from Downloads . This should contain the geneos-integration-gateway-hub-<version>.xml include file and you should save this file to a location accessible to your Gateway.
  2. Open the Gateway Setup Editor.
  3. Right-click the Includes top level section, then select New Include.
  4. Set the following options:
    • Priority — Any value above 1.
    • Location — Specify the path to the location of the geneos-integration-gateway-hub-<version>.xml include file.
  5. Right-click the include file in the State Tree and select Expand all.
  6. Select Click to load. The new includes file will load.
  7. Right-click the Probes top-level section, then select New Probe.
  8. Set the following options in the Basic tab:
    • Name — Specify the name that will appear in Active Console, for example Gateway_Hub_Self_Monitoring.
    • Hostname — Specify the hostname of your Gateway Hub node.
    • Port — 7036.
    • Secure — Enabled.
  9. Set the following options in the Dynamic Entities tab:
  1. Repeat steps 7 to 9 for each Gateway Hub node you wish to monitor.
  2. Click Save current document .

In your Active Console, Dynamic Entities are created for metrics from Gateway Hub self monitoring.

Dataviews are automatically populated from available metrics, in general an entity will be created for each Gateway Hub component. Components that use Java will also include metrics on JVM performance.

Depending on how you have configured your Active Console, you may see repeated entities in the State Tree. This is because there are multiple Gateway Hubnodes each running the same components. To organise your State Tree by node, perform the following steps:

  1. Open your Active Console.
  2. Navigate to Tools > Settings > General.
  3. Set the Viewpath option to hostname.
  4. Click Apply.

Configure log file monitoring Copied

You can additionally monitor Gateway Hub log files using the FKM plugin. To do this you must configure a new Managed Entity using the hub-logs sampler provided as part of the geneos-integration-gateway-hub-<version>.xml include file.

To enable Gateway Hub log monitoring:

  1. Open the Gateway Setup Editor.
  2. Right-click the Managed Entities top level section, then select New Managed Entity.
  3. Set the following options:
    • Name — Specify the name that will appear in Active Console, for example Hub_Log_Monitoring.
    • Options > Probe — Specify the name of the internal Gateway Hub probe.
    • Sampler — hub-logs.
  4. Repeat steps 2 to 3 for each Gateway Hub node you wish to monitor.
  5. Click Save current document .

In your Active Console an additional Managed Entity, with the name you specified, will show the status of Gateway Hub’s log files on that node.

If you have configured Gateway Hub to store log files in a directory other than the default, you must direct the sampler to your logs directory. To do this, specify the hub-logs-dir variable from the advanced tab of your Managed Entity. For more information about setting variables, see managedEntities > managedEntity > var in Managed Entities and Managed Entity Groups.

Important metrics Copied

If you have configured your Gateway to receive Gateway Hub self monitoring metrics, you may want to set up alerts for changes in the most important metrics. This section outlines the key metrics for the major Gateway Hub components and provides advice to help create meaningful alerts.

JVM memory Copied

You can observe JVM memory with the following metrics:

Observing high heap usage, where jvm_memory_pool_heap_used is nearing jvm_memory_heap_max, can be an indicator that the heap memory allocation may not be sufficient.

Note

While there might be a temptation to use increase the memory allocation, the Gateway Hub installer calculates the ideal Gateway Hub memory settings based on the size of the machine being used. It is important not to over-allocate memory to any Gateway Hub component, as it may result in an over-commitment that can produce unexpected behaviours including failures or swapping.
Metric Source Description
jvm_memory_heap_committed StatsD

Amount of memory (in bytes) that is guaranteed to be available for use by the JVM.

The amount of committed memory may change over time (increase or decrease). The value of jvm_memory_heap_committed may be less than jvm_memory_heap_max but will always be greater than jvm_memory_pool_heap_used.

jvm_memory_heap_max StatsD

Maximum amount of memory (in bytes) that can be used for memory management. Its value may be undefined.

The maximum amount of memory may change over time if defined. The value of jvm_memory_pool_heap_used and jvm_memory_heap_committed will always be less than or equal to jvm_memory_heap_max if defined.

Memory allocation may fail if it attempts to increase the used memory such that the used memory is greater than the committed memory, even if the used memory would still be less than or equal to the maximum memory (for example, when the system is low on virtual memory).

jvm_memory_pool_heap_used StatsD Amount of memory currently used (in bytes) by the JVM.
jvm_memory_gc_collection_count StatsD Number of garbage collections that have occurred in the JVM life cycle.
jvm_memory_gc_collection_time StatsD Time spent in garbage collection.

JVM garbage collection Copied

You can observe JVM garbage collection with the following metrics:

When creating alerts, note the following:

Metric Source Description
jvm_memory_gc_collection_count StatsD Total number of collections that have occurred
jvm_memory_gc_collection_time StatsD Approximate accumulated collection elapsed time in milliseconds.

Kafka consumer metrics Copied

Several Gateway Hub daemons contain Kafka consumers that consume messages from Kafka topic partition and process them.

Key consumer metrics Copied

You can observe Kafka consumers with the following metrics:

When creating alerts, note the following:

Metric Source Dimensions Description
kafka_consumer_bytes_consumed_rate StatsD app, client-id, hostname, topic

Average bytes consumed per topic, per second.

Example: kafka_consumer_bytes_consumed_rate Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, client-id=consumer-1

kafka_consumer_bytes_consumed_rate StatsD app, client-id, hostname

Average bytes consumed per second.

Example: kafka_consumer_bytes_consumed_rate Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=consumer-1

kafka_consumer_bytes_consumed_total StatsD app, client-id, hostname

Total bytes consumed.

Example: kafka_consumer_bytes_consumed_total Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=consumer-1

kafka_consumer_bytes_consumed_total StatsD app, client-id, hostname, topic

Total bytes consumed by topic.

Example: kafka_consumer_bytes_consumed_total Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, client-id=consumer-1

kafka_consumer_records_consumed_rate StatsD app, client-id, hostname, topic

The average number of records consumed per second.

Example: kafka_consumer_records_consumed_rate Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, client-id=consumer-1

kafka_consumer_records_consumed_rate StatsD app, client-id, hostname

The average number of records consumed per second.

Example: kafka_consumer_records_consumed_rate Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=consumer-1

kafka_consumer_records_consumed_total StatsD app, client-id, hostname, topic

Total records consumed.

Example: kafka_consumer_records_consumed_total Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, client-id=consumer-1

Consumer lag Copied

If a Kafka topic fills faster than the topic is consumed then we get what is known as “lag”. High lag means that your system is not keeping up with messages. Near zero lag means that it is keeping up. High lag, in operational terms, means that there may be significant latency between a message being ingested into Gateway Hub, and when that message is reflected to a user (via a query or some other means). You should try to ensure that lag is close to zero, increasing lag is a problem.

You can observe lag with the following metrics:

These Kafka consumer metrics work at the consumer level and not at the consumer group level. To get a complete picture you should watch the equivalent metrics across all nodes in the cluster.

When creating alerts, note the following:

Metric Source Dimensions Description
kafka_consumer_records_lag StatsD app, client-id, hostname, partition, topic

Number of messages the consumer is behind the producer on this partition.

Example: kafka_consumer_records_lag Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, partition=1, client-id=consumer-1

kafka_consumer_records_lag_avg StatsD app, client-id, hostname, partition, topic

Average number of messages the consumer is behind the producer on this partition.

Example: kafka_consumer_records_lag_avg Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, partition=1, client-id=consumer-1

kafka_consumer_records_lag_max StatsD app, client-id, hostname, partition, topic

Max number of messages the consumer is behind the producer on this partition

Example: kafka_consumer_records_lag_max Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, partition=1, client-id=consumer-1

Fetch rate Copied

You can observe fetch rates with the following metrics:

When creating alerts, note the following:

Metric Dimensions Description
kafka_consumer_fetch_rate StatsD app, client-id, hostname

Number of fetch requests per second.

Example: kafka_consumer_fetch_rate Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=consumer-1

kafka_consumer_fetch_size_avg StatsD app, client-id, hostname

Average number of bytes fetched per request.

Example: kafka_consumer_fetch_size_avg Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=consumer-1

kafka_consumer_fetch_size_avg StatsD app, client-id, hostname, topic

Average number of bytes fetched per request.

Example: kafka_consumer_fetch_size_avg Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, client-id=consumer-1

kafka_consumer_fetch_size_max StatsD app, client-id, hostname, topic

Max number of bytes fetched per request.

Example: kafka_consumer_fetch_size_max Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=consumer-1

Kafka producer metrics Copied

Kafka producers publish records to Kafka topics. When producers publish messages in a reliable system, they must be sure that messages have been received (unless explicitly configured not to care). To do this publishers receive acknowledgements from brokers. In some configurations, a producer does not require acknowledgements from all other brokers, it merely needs to receive a minimum number (to achieve a quorum). In other configurations, it may need acknowledgements from all brokers. In either case, the act of receiving acknowledgements is somewhat latency-sensitive and will impact how fast a producer can push messages.

When configuring Kafka producers, consider the following:

You can observe Kafka producer behaviour with the following metrics:

When creating alerts, note the following:

Metric Source Dimensions Description
kafka_producer_batch_size_avg StatsD app, client-id, hostname

Average number of bytes sent per partition, per request.

Example: kafka_producer_batch_size_avg Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=producer-1

kafka_producer_compression_rate_avg StatsD app, client-id, hostname

Average compression rate of record batches.

Example: kafka_producer_compression_rate_avg Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=producer-1

kafka_producer_node_response_rate StatsD app, client-id, hostname, node-id

Average number of responses received per second from the broker.

Example: kafka_producer_node_response_rate Dimensions = app=snapshotd, node-id=node--1, hostname=hub-vm.internal, client-id=producer-1

kafka_producer_request_rate StatsD app, client-id, hostname

Example: kafka_producer_request_rate Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=producer-1

kafka_producer_node_request_latency_avg StatsD app, client-id, hostname, node-id

Average request latency in milliseconds for a node.

Example: kafka_producer_node_request_latency_avg Dimensions = app=snapshotd, node-id=node--1, hostname=hub-vm.internal, client-id=producer-1

kafka_producer_io_wait_time_ns_avg StatsD app, client-id, hostname

Average length of time the I/O thread spends waiting for a socket ready for reads or writes in nanoseconds.

Example: kafka_producer_io_wait_time_ns_avg Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=producer-1

kafka_producer_outgoing_byte_rate StatsD app, client-id, hostname

Average number of bytes sent per second to the broker.

Example: kafka_producer_outgoing_byte_rate Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=producer-1

Kafka broker metrics Copied

All Kafka messages pass through a broker, so if the broker is encountering problems this can have a wider impact on performance and reliability.

Note that Kafka broker metrics are collected via the Kafka-plugin which uses JMX. The consumer and producer metrics listed above are gathered by the StatsD plugin for each specific process.

You can observe Kafka broker behaviour with the following metrics:

When creating alerts, note the following:

Metric Source Dimensions Description
server_replica_under_replicated_partitions Kafka app, broker_id, cluster_id, hostname

Number of under-replicated partitions.

Each Kafka partition may be replicated in order to provide reliability guarantees. In the set of replicas for a given partition, one is chosen as the leader. The leader is always considered in sync. The remaining replicas will also be considered in sync, providing they are not too far behind the leader. Synchronised replicas form the ISR (In Sync Replica) pool. If a partition lags too far behind it is removed from the ISR pool. Producers may require a minimum number of ISRs in order to operate reliably. When the ISR pool shrinks you will surely see an increase in under-replicated partitions.

Example: server_replica_under_replicated_partitions Dimensions = broker_id=0, hostname=hub-vm.internal, cluster_id=a1ZajBN8QiS60Hs_AZte3Q, app=kafka

server_replica_isr_expands_per_sec_<attribute> Kafka app, broker_id, cluster_id, hostname

Rate at which the pool of in-sync replicas (ISRs) expands.

Example: server_replica_isr_expands_per_sec_count Dimensions = broker_id=0, hostname=hub-vm.internal, cluster_id=a1ZajBN8QiS60Hs_AZte3Q, app=kafka

server_replica_isr_shrinks_per_sec_<attribute> Kafka app, broker_id, cluster_id, hostname

Rate at which the pool of in-sync replicas (ISRs) shrinks.

Example: server_replica_isr_shrinks_per_sec_count Dimensions = broker_id=0, hostname=hub-vm.internal, cluster_id=a1ZajBN8QiS60Hs_AZte3Q, app=kafka

controller_active_controller_count Kafka app, broker_id, cluster_id, hostname

Number of active controllers in the cluster.

Example: controller_active_controller_<attribute> Dimensions = broker_id=0, hostname=hub-vm.internal, cluster_id=a1ZajBN8QiS60Hs_AZte3Q, app=kafka

controller_offline_partitions_count Kafka app, broker_id, cluster_id, hostname

Number of partitions that do not have an active leader and are hence not writable or readable.

Example: controller_offline_partitions_count Dimensions = broker_id=0, hostname=hub-vm.internal, cluster_id=a1ZajBN8QiS60Hs_AZte3Q, app=kafka

controller_leader_election_rate_and_time_ms Kafka app, broker_id, cluster_id, hostname Rate of leader elections per second and the overall duration the cluster went without a leader.
controller_unclean_leader_elections_per_sec_count Kafka app, broker_id, cluster_id, hostname

Unclean leader election rate.

If a broker goes offline, some partitions will be leaderless and Kafka will elect a new leader from the ISR pool. Gateway Hub does not allow unclean elections, hence the new leader must come from the ISR pool.

Example: controller_unclean_leader_elections_per_sec_count Dimensions = broker_id=0, hostname=hub-vm.internal, cluster_id=a1ZajBN8QiS60Hs_AZte3Q, app=kafka

Zookeeper metrics Copied

Kafka uses Zookeeper to store metadata about topics and brokers. It plays a critical role in ensuring Kafka’s performance and stability. If Zookeeper is not available Kafka cannot function.

Zookeeper is very sensitive to IO latency. In particular, disk latency can have severe impacts on Zookeeper, this is because quorum operations must be completed quickly.

You can observe Zookeeper with the following metrics:

When creating alerts, note the following:

Metric Source Dimensions Description
zookeeper_outstanding_requests Zookeeper app, hostname, port

Number of requests from followers that have yet to be acknowledged.

Example: zookeeper_outstanding_requests Dimensions = hostname=hub-vm.internal, port=5181, app=zookeeper

zookeeper_avg_request_latency Zookeeper app, hostname, port

Average time to respond to a client request.

Example: zookeeper_avg_request_latency Dimensions = hostname=hub-vm.internal, port=5181, app=zookeeper

zookeeper_max_client_cnxns_per_host Zookeeper app, hostname, port

Example: zookeeper_max_client_cnxns_per_host Dimensions = hostname=hub-vm.internal, port=5181, app=zookeeper

zookeeper_num_alive_connections Zookeeper app, hostname, port

Number of connections currently open. Should be well under the configured maximum connections for safety.

Example: zookeeper_num_alive_connections Dimensions = hostname=hub-vm.internal, port=5181, app=zookeeper

zookeeper_fsync_threshold_exceed_count Zookeeper app, hostname, port

Count of instances f-sync time has exceeded the warning threshold.

Example: zookeeper_fsync_threshold_exceed_count Dimensions = hostname=hub-vm.internal, port=5181, app=zookeeper

etcd metrics Copied

etcd is used by Gateway Hub as a key-value store. Disk latency is the most important etcd metric, but CPU starvation can also cause problems.

Many etcd metrics are provided as histograms composed of several gauges. Etcd histogram buckets are cumulative. See below for an example:

disk_wal_fsync_duration_seconds_bucket_0.001 = 2325
disk_wal_fsync_duration_seconds_bucket_0.002 = 4642
disk_wal_fsync_duration_seconds_bucket_0.004 = 5097
disk_wal_fsync_duration_seconds_bucket_0.008 = 5187
disk_wal_fsync_duration_seconds_bucket_0.016 = 5248
disk_wal_fsync_duration_seconds_bucket_0.032 = 5253
disk_wal_fsync_duration_seconds_bucket_0.064 = 5254
disk_wal_fsync_duration_seconds_bucket_0.128 = 5254
disk_wal_fsync_duration_seconds_bucket_0.256 = 5254
disk_wal_fsync_duration_seconds_bucket_0.512 = 5254
disk_wal_fsync_duration_seconds_bucket_1.024 = 5254
disk_wal_fsync_duration_seconds_bucket_2.048 = 5254
disk_wal_fsync_duration_seconds_bucket_4.096 = 5254
disk_wal_fsync_duration_seconds_bucket_8.192 = 5254
disk_wal_fsync_duration_seconds_sum = 7.362459756999995
disk_wal_fsync_duration_seconds_count = 5254

The value of the disk_wal_fsync_duration_seconds_bucket_<x.y> metric indicates the cumulative total over the whole duration ending with the bucket specified by the <x.y> postfix. In this example, the disk_wal_fsync_duration_seconds_<x.y> value increases with each time step for 0.064 seconds then remains static.

Key metrics Copied

You can observe etcd with the following metrics:

When creating alerts, note the following:

Metric Source Description
server_health_failures prometheus-target Total number of failed health checks.
server_heartbeat_send_failures_total prometheus-target Total number of leader heartbeat send failures (likely overloaded from slow disk). If non zero and increasing, this could be due to a slow disk, and can be a prelude to a cluster failover.

Etcd latency metrics Copied

Most issues with etcd relate to slow disks. High disk latency can cause long pauses that will lead to missed heartbeats, and potentially fail overs in the etcd cluster. Disk latency will also contribute to high request latency.

You can observe etcd latency with the following metrics:

When creating alerts, note the following:

Metric Source Description
disk_backend_commit_duration_seconds_bucket_<bucket> prometheus-target Presented as a histogram. The latency distributions of commit called by backend.
disk_wal_fsync_duration_seconds_bucket_<bucket> prometheus-target Presented as a histogram. The latency distributions of fsync called by the Write Ahead Log (WAL).

PostgreSQL metrics Copied

Gateway Hub stores metrics in a PostgreSQL database.

You can observe PostgreSQL performance with the following metrics:

When creating alerts, note the following:

Metric Source Dimensions Description
processes_cpu_load System comm, hostname, pid CPU usage for a specific process identified by the pid dimension.
memory_used_swap_percent System hostname Swap memory used for a specific process.
processes_virtual_size System hostname, process_id, process_name Sum of all mapped memory used by a specific process, including swap.
processes_resident_set_size System hostname, process_id, process_name Sum of all physical memory used by a specific process, excluding swap. Known as the resident set.
disk_free System hostname, volume Total free space on the specified volume.
disk_total System hostname, volume Total space on the specified volume.
active_connections PostgreSQL app, hostname Number of connections currently active on the server.
app_connections PostgreSQL app, hostname, username Number of connections currently active on the server, grouped by application.
max_connections PostgreSQL app, hostname Maximum number of connections for the PostgreSQL server.
["Geneos"] ["Geneos > Netprobe"] ["User Guide"]

Was this topic helpful?