Troubleshoot Maintenance
Overview
This guide is intended to help you troubleshoot an existing Gateway Hub instance. If you encounter problems when installing Gateway Hub, consult Troubleshoot Installation.
Common errors are often caused by a failure to meet Gateway Hub's requirements, you must ensure your environment meets all these before proceeding.
To do this, consult the following pages before proceeding:
You can check many, but not all, requirements by running the hubcheck
tool, and you should resolve any errors before proceeding. For more information, see Validate environment.
The main tool used to perform troubleshooting steps is hubctl
, you will need the original configuration file used to install Gateway Hub to use this correctly. For more information, see hubctl tool.
Maintenance
This section cover topics related to troubleshooting a running Gateway Hub instance, in general troubleshooting should proceed as follows:
-
Generate and review diagnostics using the
hubctl diagnostics <config_file>
command. -
Consult this document, support sites and the release notes [link here], for information about how to resolve any known issues.
-
Raise a ticket with ITRS Client Services and provide them with a description of the problem, the business objective and the full logs you generated in step 1.
Diagnostics
You can create a comprehensive diagnostics file that packages the Gateway Hub log files from each node in the cluster as well as system information about the cluster and attached storage. You will need these logs for any to investigate any problems that occur on a running Gateway Hub system.
To obtain a diagnostic file from the command line, run:
hubctl diagnostics <config_file>
This creates a temporary file on each node and downloads the all these files to your local machine. The location of the file is printed to stdout.
Reviewing logs
When reviewing logs you should first look at the orchestrationd logs. Orchestrationd is the service that keeps the rest of the Gateway Hub services running; if one of the components has an issue it will be shown here.
Services in Gateway Hub are arranged by Run Levels, where components in each successive Run Level depend on components in previous Run Levels. The lower the Run Level, the more critical the issue if a component fails - an app can likely gracefully fail but a low level platform component (for example, etcd or the key value store which operate at Run Level 0) failing is critical.
Orchestrationd periodically checks each component via a liveness check, using live-probe.sh
, to determine whether each component is running or not. During startup you will see some number of ERROR
log lines as each component is brought online and the liveness check fails. This is normal and will look something like:
2020-09-29 16:35:13.823Z [OrchestrationDaemon/SystemService-HealthCheck-0] INFO com.itrsgroup.hub.orchestration.SystemService - Service 'zookeeper' is PENDING 2020-09-29 16:35:14.168Z [OrchestrationDaemon/SystemService-HealthCheck-0] ERROR com.itrsgroup.hub.orchestration.SystemService - Execution of '/app/hub/hub-2.2.0-GA/services/zookeeper-3.6.1/live-probe.sh' failed - exit code '1' 2020-09-29 16:35:16.028Z [OrchestrationDaemon/SystemService-HealthCheck-0] ERROR com.itrsgroup.hub.orchestration.SystemService - Execution of '/app/hub/hub-2.2.0-GA/services/zookeeper-3.6.1/live-probe.sh' failed - exit code '1' 2020-09-29 16:35:17.459Z [OrchestrationDaemon/SystemService-HealthCheck-0] ERROR com.itrsgroup.hub.orchestration.SystemService - Execution of '/app/hub/hub-2.2.0-GA/services/zookeeper-3.6.1/live-probe.sh' failed - exit code '1' 2020-09-29 16:35:18.845Z [OrchestrationDaemon/SystemService-HealthCheck-0] ERROR com.itrsgroup.hub.orchestration.SystemService - Execution of '/app/hub/hub-2.2.0-GA/services/zookeeper-3.6.1/live-probe.sh' failed - exit code '1' 2020-09-29 16:35:19.983Z [OrchestrationDaemon/SystemService-HealthCheck-0] ERROR com.itrsgroup.hub.orchestration.SystemService - Execution of '/app/hub/hub-2.2.0-GA/services/zookeeper-3.6.1/live-probe.sh' failed - exit code '1' 2020-09-29 16:35:21.387Z [OrchestrationDaemon/SystemService-HealthCheck-0] INFO com.itrsgroup.hub.orchestration.SystemService - Service 'zookeeper' is STARTED
In this example, we can see that Zookeeper was started, but there were 5 failed liveness checks before Zookeeper was brought online. This is expected behaviour for each component during start up, although some will take longer than others to start. If any component is unable to start after a few minutes, then Orchestrationd will bring the entire Gateway Hub down. If this happens, consult the failed component log for more specific information. You can use the modified date on each log to quickly identify those with new information.
Once Gateway Hub has reached a stable state, where all services are running, any additional errors in the orchestrationd log are unexpected and should be investigated by checking the relevant component log.
Startup process
To start Gateway Hub, under normal conditions, run:
hubctl start <config_file>
The following steps will occur as part of the startup process:
-
thing one
-
thing two
-
and so on
Explanation of run levels.
You can restart Gateway Hub by running:
hubctl restart <config_file>
You can stop Gateway Hub by running:
hubctl stop <config_file>
The following steps will occur as part of an orderly shutdown process:
-
thing one
-
thing two
-
and so on
Performance
Attempting to run Gateway Hub on insufficient hardware can cause many issues. Several components, such as etcd, rely on disk IO performance and therefore inadequate disk performance can cause serious problems. For more information, see Hardware requirements.
To identify performance issues, it is important to configure Gateway Hub self-monitoring [link here].
Resolve ingestion errors
Gateway Hub administrators should aim to have zero ingestion errors at all time. If an ingestion error occurs you should consider resolving it high priority, this is because a sufficiently large number of ingestion errors may cause Gateway Hub to stop functioning correctly.
For more information, see Resolve an ingestion error.
Remove ingestion errors
Gateway Hub deletes old ingestion errors after seven days.
However, if you accumulate stale ingestion errors you may want to manually remove them.
To remove ingestion errors, perform the following steps on each node:
-
Navigate to Gateway Hub's built-in PostgreSQL database. By default, this is located at
/opt/hub/hub-<hub_version>/services/postgres-timescale-<version>
. -
Start a SQL prompt:
./run-psql.sh
-
Connect to the database
hub
as the userpostgres
:\c hub
-
Delete the ingestion errors:
DELETE FROM errors; DELETE 0
-
Exit the SQL prompt:
exit
Note: Since you cannot perform SQL operations simultaneously on all nodes, some inconsistencies may occur.
Gateway cannot publish
Error message: GatewayHubPublishing: Failed Sending: Local: Unknown error
When configuring a Gateway to publish data to Gateway Hub you may encounter this error. This occurs when a Gateway cannot find the required topics or cannot connect to any brokers.
To diagnose the cause of the error use the kafkacat
tool to test the connection to Gateway Hub and fetch a list of metadata, including topic names.
Run the kafkacat
command, specifying as options each of the Additional Settings required by the Gateway Setup Editor. These options have the form -X setting.name=value
where the setting.name
matches the corresponding Additional Setting omitting the kafka
prefix.
You should provide the same credentials used when configuring the connection in the Gateway Setup Editor. For more information, see Connect a Gateway in Connect a .
kafkacat -X security.protocol=ssl -X ssl.ca.location=<hub_CA_certificate> -b <hostame>:9092 -L
If kafkacat
returns a list of topics that does not include geneos-events
or geneos-metrics-v1
, the Gateway will not be able to publish metrics to Gateway Hub. You should check the Gateway Hub configuration.
If kafkacat
cannot connect to Gateway Hub, the Gateway will also be unable to connect. You should check the network connection.
Kafkacat
The kafkacat
tool is an open source utility written and maintained by the author of the librdkafka library used by Geneos. This utility is shipped with Linux 64-bit Gateways to ease the testing of connecting to your Kafka infrastructure. For more information about kafkacat
, see kafkacat Github.
To ensure that kafkacat uses the same Kafka and SSL libraries as the Gateway, kafkacat must be run with the following environment variables:
LD_LIBRARY_PATH
— this must point at the lib64 library supplied as part of the Gateway bundle.
Kafka message size exceeds maximum request size
Error message: org.apache.kafka.common.errors.RecordTooLargeException: The message is 6747120 bytes when serialized which is larger than the maximum request size you have configured with the max.request.size configuration.
When running a Gateway using central configuration, setup files are validated by Gateway Hub.
Gateway Hub uses Kafka messages to distribute Gateway setup files to a dedicated daemon that validates them. If the setup file size is large, it may exceed the default Kafka message limit of 1 MB. In this case, Gateway Hub is unable to validate files and the Gateway setup files cannot be saved.
To resolve this issue, you must increase the maximum Kafka message size for Gateway setup validation.
However, the following additional memory limitations still apply:
- HTTP server request size (
8 MiB
) - etcd message size (
2 MiB
) - gRPC message size (
4 MiB
)
If any of these are exceeded, Gateway setup files cannot be saved.
Caution: Kafka messages over 8 MB in size will also breach the maximum HTTP server request size of the API Daemon.
Diagnose Kafka message size errors
The Kafka message size may be too small if both of the following behaviours are occurring:
- Gateway validation in the Gateway Setup Editor is unable to complete.
- The following appears in the API Daemon logs:
2020-09-28 11:47:04.479Z ERROR [default-akka.actor.default-dispatcher-3] [ValidationQueryRequesterImpl] - Unhandled Kafka Exception publishing to topic 'unknown' org.apache.kafka.common.errors.RecordTooLargeException: The message is 6747120 bytes when serialized which is larger than the maximum request size you have configured with the max.request.size configuration. 2020-09-28 11:47:04.479Z ERROR [default-akka.actor.default-dispatcher-3] [ValidationQueryRequesterImpl] - Failed to publish validation query org.apache.kafka.common.errors.RecordTooLargeException: The message is 6747120 bytes when serialized which is larger than the maximum request size you have configured with the max.request.size configuration.
Increase message size
To change the maximum Kafka message size for Gateway setup validation, perform the following steps:
- Open the API Daemon's Kafka configuration in your default text editor:
./hubctl/hubctl config edit -n apid -c apid.yaml installation-descriptor.yml
- In the kafkaProducer > properties section, specify the
max.request.size
in bytes. For example:# Kafka producer for publishing to the Hub Kafka cluster # https://kafka.apache.org/documentation/#producerconfigs kafkaProducer: properties: max.request.size: 5242880 bootstrap.servers: localhost:9092 acks: all key.serializer: org.apache.kafka.common.serialization.StringSerializer value.serializer: org.apache.kafka.common.serialization.ByteArraySerializer
- Open the Gateway configuration Daemon's Kafka configuration in your default text editor:
./hubctl/hubctl config edit -n gateway-configd -c gateway-configd.yaml installation-descriptor.yml
- In the kafkaProducer > properties section specify the
max.request.size
in bytes, and in the kafkaConsumer > properties section specify thefetch.max.bytes
in bytes. The configuration file should look similar to:# Kafka producer # https://kafka.apache.org/documentation/#producerconfigs kafkaProducer: properties: max.request.size: 5242880 bootstrap.servers: localhost:9092 security.protocol: SSL # Kafka consumer # https://kafka.apache.org/documentation/#consumerconfigs kafkaConsumer: properties: fetch.max.bytes: 5242880 bootstrap.servers: localhost:9092 security.protocol: SSL
- Open the Kafka server configuration in your default text editor:
./hubctl/hubctl config edit -n kafka -c server.properties installation-descriptor.yml
- In the Replication Settings section, specify the
replica.fetch.max.bytes
in bytes. The configuration file should look similar to:#### Replication Settings #### min.insync.replicas=3 replica.fetch.max.bytes=5242880
- Update the Kafka topic configuration. To do this, run the following on a Gateway Hub node:
<hub_root>/hub-current/services/kafka-2.12-2.5.0/kafka_2.12-2.5.0/bin/kafka-configs.sh --zookeeper localhost:5181 --entity-type topics --entity-name hub-gateways-validations-requests --alter --add-config max.message.bytes=5242880 <hub_root>/hub-current/services/kafka-2.12-2.5.0/kafka_2.12-2.5.0/bin/kafka-configs.sh --zookeeper localhost:5181 --entity-type topics --entity-name hub-gateways-validations-queries --alter --add-config max.message.bytes=5242880