Back to Geneos FAQ

Gateway Hub - how to recover the apid service if it could not keep up after a service failure

Gateway Hub should generally be able to recover itself after a server or service failure. The startup sequence would automatically apply unconsumed records from the Kafka component.

Self Recovery

If the previous server maintenance or downtime occurred for an extended period of time, the apid service can take longer to recover. Users may need to check the ready_timeout parameter for apid in orchestrationd.yml. The parameter unit is in seconds with a default of 5 minutes. Users may want to increase the timeout to 1 hour or more depending on the data volume and situations. Users can check the apid log files if the startup is successful. Issues would more likely arise due to a slow or very contented disk, which may need multiple restart attempts to resume. The following messages indicate symptoms of such conditions:

WARN [Thread-43] [RocksDBCache/hub-keyValueStore-entitySnapshots] - [lumn_family.cc:872] [default] Stalling writes because we have 15 immutable memtables (waiting for flush), max_write_buffer_number is set to 16 rate 16777216
WARN [Thread-44] [RocksDBCache/hub-keyValueStore-entitySnapshots] - [lumn_family.cc:872] [default] Stalling writes because we have 15 immutable memtables (waiting for flush), max_write_buffer_number is set to 16 rate 16777216
WARN [main] [RocksDBCache/hub-keyValueStore-entitySnapshots] - [lumn_family.cc:872] [default] Stalling writes because we have 15 immutable memtables (waiting for flush), max_write_buffer_number is set to 16 rate 16777216
WARN [main] [RocksDBCache/hub-keyValueStore-entitySnapshots] - [lumn_family.cc:836] [default] Stopping writes because we have 16 immutable memtables (waiting for flush), max_write_buffer_number is set to 16

Manual Recovery

If apid is unable to recover itself after the timeout is increased, users may consider to delete and recreate the entity snapshots Kafka topic. This topic only holds the latest value for every metrics, so it will be re-populated over time as new data gets consumed.

Please follow the instructions below:

Step 0: Environment preparation On all nodes: (please adjust the commands with actual installation paths)

export HUB_HOME=/opt/hub/hub-2.5.1
export KAFKA_HOME=$HUB_HOME/services/kafka-2.12-2.8.1-log4j-patched

Step 1: mask the systemd service and manually start the Hub in run-level 2** On all nodes:

sudo systemctl mask hub-orchestration
sudo systemctl stop hub-orchestration
$HUB_HOME/bin/hub.sh start
$HUB_HOME/bin/hub-admin run-level 2

Step 2 On all nodes: Wait for apid to be stopped (using $HUB_HOME/bin/hub-admin) and delete directory /opt/hub/tmp/apid-kafka-kvs/hub-keyValueStore-entitySnapshots

Step 3 On one node only: Check if file $KAFKA_HOME/../conf/server.properties contains delete.topic.enable=true. If not, add it to the end of the file and restart the Hub via $HUB_HOME/bin/hub.sh restart

Step 4 Ensure that the Kafka client can access details about topic hub-keyValueStore-entitySnapshots

$KAFKA_HOME/bin/kafka-topics.sh --command-config $KAFKA_HOME/../conf/client.properties --zookeeper localhost:5181 --topic hub-keyValueStore-entitySnapshots --describe

Step 5 Delete topic hub-keyValueStore-entitySnapshots

$KAFKA_HOME/bin/kafka-topics.sh --command-config $KAFKA_HOME/../conf/client.properties --zookeeper localhost:5181 --topic hub-keyValueStore-entitySnapshots --delete

Step 6 Wait for topic hub-keyValueStore-entitySnapshots to be gone

$KAFKA_HOME/bin/kafka-topics.sh --command-config $KAFKA_HOME/../conf/client.properties --zookeeper localhost:5181 --topic hub-keyValueStore-entitySnapshots --list

Step 7 Re-create topic hub-keyValueStore-entitySnapshots

$KAFKA_HOME/bin/kafka-topics.sh --command-config $KAFKA_HOME/../conf/client.properties --zookeeper localhost:5181 --topic hub-keyValueStore-entitySnapshots --create --config cleanup.policy=compact --config min.insync.replicas=1 --config segment.bytes=536870912 --partitions 10 --replication-factor 1

Step 8 Wait until topic hub-keyValueStore-entitySnapshots becomes available

$KAFKA_HOME/bin/kafka-topics.sh --command-config $KAFKA_HOME/../conf/client.properties --zookeeper localhost:5181 --topic hub-keyValueStore-entitySnapshots --describe

Step 9 On all nodes: Go back to run-level 4.

$HUB_HOME/bin/hub-admin run-level 4

Step 10: final checks Check that everything is working as expected. You will notice that in the Web Console, it takes some time for some Dataview cells to become available. This is because the data snapshot is being rebuilt from incoming data.

Step 11: unmask the systemd service On all nodes:

$HUB_HOME/bin/hub.sh stop
sudo systemctl unmask hub-orchestration
sudo systemctl start hub-orchestration
["Geneos"] ["Geneos > Gateway Hub"] ["FAQ"]

Was this topic helpful?