Gateway Hub ["Geneos"]
["Geneos > Gateway Hub"]["User Guide"]

Troubleshoot Maintenance

Overview

This guide is intended to help you troubleshoot an existing Gateway Hub instance. If you encounter problems when installing Gateway Hub, consult Troubleshoot Installation.

Common errors are often caused by a failure to meet Gateway Hub's requirements, you must ensure your environment meets all these before proceeding.

To do this, consult the following pages before proceeding:

You can check many, but not all, requirements by running the hubcheck tool, and you should resolve any errors before proceeding. For more information, see Validate environment.

The main tool used to perform troubleshooting steps is hubctl, you will need the original configuration file used to install Gateway Hub to use this correctly. For more information, see hubctl tool.

Maintenance

This section cover topics related to troubleshooting a running Gateway Hub instance, in general troubleshooting should proceed as follows:

  1. Generate and review diagnostics using the hubctl diagnostics <config_file> command.

  2. Consult this document, support sites and the release notes [link here], for information about how to resolve any known issues.

  3. Raise a ticket with ITRS Client Services and provide them with a description of the problem, the business objective and the full logs you generated in step 1.

Diagnostics

You can create a comprehensive diagnostics file that packages the Gateway Hub log files from each node in the cluster as well as system information about the cluster and attached storage. You will need these logs for any to investigate any problems that occur on a running Gateway Hub system.

To obtain a diagnostic file from the command line, run:

hubctl diagnostics <config_file>

This creates a temporary file on each node and downloads the all these files to your local machine. The location of the file is printed to stdout.

Reviewing logs

When reviewing logs you should first look at the orchestrationd logs. Orchestrationd is the service that keeps the rest of the Gateway Hub services running; if one of the components has an issue it will be shown here.

Services in Gateway Hub are arranged by Run Levels, where components in each successive Run Level depend on components in previous Run Levels. The lower the Run Level, the more critical the issue if a component fails - an app can likely gracefully fail but a low level platform component (for example, etcd or the key value store which operate at Run Level 0) failing is critical.

Orchestrationd periodically checks each component via a liveness check, using live-probe.sh, to determine whether each component is running or not. During startup you will see some number of ERROR log lines as each component is brought online and the liveness check fails. This is normal and will look something like:

2020-09-29 16:35:13.823Z [OrchestrationDaemon/SystemService-HealthCheck-0] INFO  com.itrsgroup.hub.orchestration.SystemService - Service 'zookeeper' is PENDING
2020-09-29 16:35:14.168Z [OrchestrationDaemon/SystemService-HealthCheck-0] ERROR com.itrsgroup.hub.orchestration.SystemService - Execution of '/app/hub/hub-2.2.0-GA/services/zookeeper-3.6.1/live-probe.sh' failed - exit code '1'
2020-09-29 16:35:16.028Z [OrchestrationDaemon/SystemService-HealthCheck-0] ERROR com.itrsgroup.hub.orchestration.SystemService - Execution of '/app/hub/hub-2.2.0-GA/services/zookeeper-3.6.1/live-probe.sh' failed - exit code '1'
2020-09-29 16:35:17.459Z [OrchestrationDaemon/SystemService-HealthCheck-0] ERROR com.itrsgroup.hub.orchestration.SystemService - Execution of '/app/hub/hub-2.2.0-GA/services/zookeeper-3.6.1/live-probe.sh' failed - exit code '1'
2020-09-29 16:35:18.845Z [OrchestrationDaemon/SystemService-HealthCheck-0] ERROR com.itrsgroup.hub.orchestration.SystemService - Execution of '/app/hub/hub-2.2.0-GA/services/zookeeper-3.6.1/live-probe.sh' failed - exit code '1'
2020-09-29 16:35:19.983Z [OrchestrationDaemon/SystemService-HealthCheck-0] ERROR com.itrsgroup.hub.orchestration.SystemService - Execution of '/app/hub/hub-2.2.0-GA/services/zookeeper-3.6.1/live-probe.sh' failed - exit code '1'
2020-09-29 16:35:21.387Z [OrchestrationDaemon/SystemService-HealthCheck-0] INFO  com.itrsgroup.hub.orchestration.SystemService - Service 'zookeeper' is STARTED

In this example, we can see that Zookeeper was started, but there were 5 failed liveness checks before Zookeeper was brought online. This is expected behaviour for each component during start up, although some will take longer than others to start. If any component is unable to start after a few minutes, then Orchestrationd will bring the entire Gateway Hub down. If this happens, consult the failed component log for more specific information. You can use the modified date on each log to quickly identify those with new information.

Once Gateway Hub has reached a stable state, where all services are running, any additional errors in the orchestrationd log are unexpected and should be investigated by checking the relevant component log.

Startup process

To start Gateway Hub, under normal conditions, run:

hubctl start <config_file>

The following steps will occur as part of the startup process:

  1. thing one

  2. thing two

  3. and so on

Explanation of run levels.

You can restart Gateway Hub by running:

hubctl restart <config_file>

You can stop Gateway Hub by running:

hubctl stop <config_file>

The following steps will occur as part of an orderly shutdown process:

  1. thing one

  2. thing two

  3. and so on

Disk space

 

 

Licensing

 

 

SSO

 

 

Performance

Attempting to run Gateway Hub on insufficient hardware can cause many issues. Several components, such as etcd, rely on disk IO performance and therefore inadequate disk performance can cause serious problems. For more information, see Hardware requirements.

To identify performance issues, it is important to configure Gateway Hub self-monitoring [link here].

Resolve ingestion errors

Gateway Hub administrators should aim to have zero ingestion errors at all time. If an ingestion error occurs you should consider resolving it high priority, this is because a sufficiently large number of ingestion errors may cause Gateway Hub to stop functioning correctly.

For more information, see Resolve an ingestion error.

Remove ingestion errors

Gateway Hub deletes old ingestion errors after seven days.

However, if you accumulate stale ingestion errors you may want to manually remove them.

To remove ingestion errors, perform the following steps on each node:

  1. Navigate to Gateway Hub's built-in PostgreSQL database. By default, this is located at /opt/hub/hub-<hub_version>/services/postgres-timescale-<version>.

  2. Start a SQL prompt:

    ./run-psql.sh 
  3. Connect to the database hub as the user postgres:

    \c hub
  4. Delete the ingestion errors:

    DELETE FROM errors;              
    DELETE 0
  5. Exit the SQL prompt:

    exit

Note: Since you cannot perform SQL operations simultaneously on all nodes, some inconsistencies may occur.

Gateway cannot publish

Error message: GatewayHubPublishing: Failed Sending: Local: Unknown error

When configuring a Gateway to publish data to Gateway Hub you may encounter this error. This occurs when a Gateway cannot find the required topics or cannot connect to any brokers.

To diagnose the cause of the error use the kafkacat tool to test the connection to Gateway Hub and fetch a list of metadata, including topic names.

Run the kafkacat command, specifying as options each of the Additional Settings required by the Gateway Setup Editor. These options have the form -X setting.name=value where the setting.name matches the corresponding Additional Setting omitting the kafka prefix.

You should provide the same credentials used when configuring the connection in the Gateway Setup Editor. For more information, see Connect a Gateway in Connect a .

kafkacat -X security.protocol=ssl -X ssl.ca.location=<hub_CA_certificate> -b <hostame>:9092 -L

If kafkacat returns a list of topics that does not include geneos-events or geneos-metrics-v1, the Gateway will not be able to publish metrics to Gateway Hub. You should check the Gateway Hub configuration.

If kafkacat cannot connect to Gateway Hub, the Gateway will also be unable to connect. You should check the network connection.

Kafkacat

The kafkacat tool is an open source utility written and maintained by the author of the librdkafka library used by Geneos. This utility is shipped with Linux 64-bit Gateways to ease the testing of connecting to your Kafka infrastructure. For more information about kafkacat, see kafkacat Github.

To ensure that kafkacat uses the same Kafka and SSL libraries as the Gateway, kafkacat must be run with the following environment variables:

  • LD_LIBRARY_PATH — this must point at the lib64 library supplied as part of the Gateway bundle.

Kafka message size exceeds maximum request size

Error message: org.apache.kafka.common.errors.RecordTooLargeException: The message is 6747120 bytes when serialized which is larger than the maximum request size you have configured with the max.request.size configuration.

When running a Gateway using central configuration, setup files are validated by Gateway Hub.

Gateway Hub uses Kafka messages to distribute Gateway setup files to a dedicated daemon that validates them. If the setup file size is large, it may exceed the default Kafka message limit of 1 MB. In this case, Gateway Hub is unable to validate files and the Gateway setup files cannot be saved.

To resolve this issue, you must increase the maximum Kafka message size for Gateway setup validation.

However, the following additional memory limitations still apply:

  • HTTP server request size (8 MiB)
  • etcd message size (2 MiB)
  • gRPC message size (4 MiB)

If any of these are exceeded, Gateway setup files cannot be saved.

Caution: Kafka messages over 8 MB in size will also breach the maximum HTTP server request size of the API Daemon.

Diagnose Kafka message size errors

The Kafka message size may be too small if both of the following behaviours are occurring:

  • Gateway validation in the Gateway Setup Editor is unable to complete.
  • The following appears in the API Daemon logs:
    2020-09-28 11:47:04.479Z ERROR [default-akka.actor.default-dispatcher-3] [ValidationQueryRequesterImpl] - Unhandled Kafka Exception publishing to topic 'unknown'
    org.apache.kafka.common.errors.RecordTooLargeException: The message is 6747120 bytes when serialized which is larger than the maximum request size you have configured with the max.request.size configuration.
    2020-09-28 11:47:04.479Z ERROR [default-akka.actor.default-dispatcher-3] [ValidationQueryRequesterImpl] - Failed to publish validation query
    org.apache.kafka.common.errors.RecordTooLargeException: The message is 6747120 bytes when serialized which is larger than the maximum request size you have configured with the max.request.size configuration.

Increase message size

To change the maximum Kafka message size for Gateway setup validation, perform the following steps:

  1. Open the API Daemon's Kafka configuration in your default text editor:
    ./hubctl/hubctl config edit -n apid -c apid.yaml installation-descriptor.yml
  2. In the kafkaProducer > properties section, specify the max.request.size in bytes. For example:
    # Kafka producer for publishing to the Hub Kafka cluster
    # https://kafka.apache.org/documentation/#producerconfigs
    kafkaProducer:
      properties:
        max.request.size: 5242880
        bootstrap.servers: localhost:9092
        acks: all
        key.serializer: org.apache.kafka.common.serialization.StringSerializer
        value.serializer: org.apache.kafka.common.serialization.ByteArraySerializer
  3. Open the Gateway configuration Daemon's Kafka configuration in your default text editor:
    ./hubctl/hubctl config edit -n gateway-configd -c gateway-configd.yaml installation-descriptor.yml
  4. In the kafkaProducer > properties section specify the max.request.size in bytes, and in the kafkaConsumer > properties section specify the fetch.max.bytes in bytes. The configuration file should look similar to:
    # Kafka producer
    # https://kafka.apache.org/documentation/#producerconfigs
    kafkaProducer:
      properties:
        max.request.size: 5242880
        bootstrap.servers: localhost:9092
        security.protocol: SSL
    
     
    # Kafka consumer
    # https://kafka.apache.org/documentation/#consumerconfigs
    kafkaConsumer:
      properties:
        fetch.max.bytes: 5242880
        bootstrap.servers: localhost:9092
        security.protocol: SSL
  5. Open the Kafka server configuration in your default text editor:
    ./hubctl/hubctl config edit -n kafka -c server.properties installation-descriptor.yml
  6. In the Replication Settings section, specify the replica.fetch.max.bytes in bytes. The configuration file should look similar to:
    #### Replication Settings  ####
     
    min.insync.replicas=3
    replica.fetch.max.bytes=5242880
  7. Update the Kafka topic configuration. To do this, run the following on a Gateway Hub node:
    <hub_root>/hub-current/services/kafka-2.12-2.5.0/kafka_2.12-2.5.0/bin/kafka-configs.sh --zookeeper localhost:5181 --entity-type topics --entity-name hub-gateways-validations-requests --alter --add-config max.message.bytes=5242880
    <hub_root>/hub-current/services/kafka-2.12-2.5.0/kafka_2.12-2.5.0/bin/kafka-configs.sh --zookeeper localhost:5181 --entity-type topics --entity-name hub-gateways-validations-queries --alter --add-config max.message.bytes=5242880