Pollers or peers are disconnected
Related to Copied
mon node status shows the status of one or more pollers and/or peers as inactive or disconnected.
Problem Copied
- Pollers or peers not connected/ in an inactive state.
Possible Cause(s) Copied
- Different OP5 versions within the cluster.
- SSH keys not propagated within the cluster.
- SSH keys have changed or are no longer valid.
- Invalid Naemon configuration.
ist diagnose can quickly diagnose most aspects of the above-mentioned errors.
Possible Solution(s) Copied
Basic troubleshooting Copied
Try a mon restart first on all nodes. If a restart does not fix the issue, proceed with checking other steps.
Ensure that the nodes are able to communicate with each other. Tools such as ssh, ping, or nc can be good to verify if communications can be established.
An example using nc is shown below. Merlin runs on port 15551 by default.
[root@mon9-mas01 ~]# nc -zv mon9-mas02peer 15551
Connection to mc-rocky-mon9-mas02peer (xx.xx.xx.xx) 15551 port [tcp/*] succeeded!
Verify OS and OP5 versions Copied
Clustering in OP5 requires the same OS and OP5 versions. Run this command on all devices, and make sure that all devices are running the same version:
cat /etc/op5-monitor-release
It should give output such as this:
If there are differences, please rectify the situation by getting all devices on the same version.
Troubleshooting SSH issues Copied
Check in the /var/log/secure file, and see if there are any errors pertaining to SSH. If there is, run these commands on the server having the issue. This will need to run for each additional server in the cluster. An example:
## mon sshkey push <hostname1>
## asmonitor mon sshkey push <hostname1>
## mon sshkey push <hostname2>
## asmonitor mon sshkey push <hostname2>
This pushes all SSH keys over to the other servers in the cluster. OP5 uses password-less SSH connections for some communications, so we need to make sure all the SSH keys are moved everywhere.
Check the Merlin log file /var/log/op5/merlin/neb.log as well. In some instances, you may see errors like below:
[1676376045] 4: stdout: Offending RSA key in /opt/monitor/.ssh/known_hosts:1
[1676376045] 4: stdout: RSA host key for monitor1_peer has changed and you have requested strict checking.
[1676376045] 4: stdout: Host key verification failed.
For scenarios where an IP address or hostname has changed, you will need to first remove the known_hosts entry of the affected node before running the mon sshkey push commands. On all affected nodes, run the command below to remove the ssh key for the monitor user:
runuser -l monitor -c 'ssh-keygen -R hostname'
After removing the known_hosts entry and re-runnign mon sshkey push, restart Merlin on all nodes:
sysetmctl restart merlind
and then observethe status via mon node status.