One of the issues that can occur when managing VMware Hosts through vCenter is to experience random or intermittent VMware Host disconnects from the vCenter Server. This behaviour happens when the vSphere ESXi Hosts are not sending a message back to the vCenter. These are the heartbeat messages. If the VMware vCenter does not receive them at all or does not even receive them in a specific polling time then the vCenter Server assumes these particular VMware Hosts are down or not reachable.
These heartbeat messages are coming on a format of UDP packets from the VMware Hosts to the vCenter Server on port 902. By default the VMware ESXi Hosts send heartbeat messages every 10 seconds. In particular the vCenter has a time window of 60 seconds as a sort of polling time. Should heartbeat messages not received within these 60 seconds the vCenter triggers an alarm as per configuration of the “Host connection and power state” default configuration.
Now there are a number of reasons why this might happen. Typically Network Ports to be configured through the firewall or eventually a symptom of a congested network. It is always a good idea to take a look and review the best practices for VMware vSphere Networking.
In the environment I did setup in my home lab I was receiving random VMware Host disconnects. Although the VMware ESXi Host were appearing disconnected I was still able to use the virtual machines associated with such Hosts.
When taking a look at issues reported in the Monitor section the triggered alarm was “Host connection and power state”.
After some troubleshooting and digging for more information it looks this behaviour can also happen with a fairly good network configuration. Everything in this case boils down to how the vCenter gets the heartbeat messages from the VMware ESXi Hosts. Luckily there’s a setting to configure this behaviour and the article will cover this in more detail.
Review VMware Host disconnects from vCenter
When the vCenter is not receiving the heartbeat messages from the VMware ESXi Hosts it is automatically placing such Hosts in “Disconnected” status. Chances are that either this is through or simply the timeout for these messages is too low.
In addition when reviewing the issues in the monitor section we can see triggered alarms for “Host connection and power state”. The screenshot below shows an example of one of the random VMware Host disconnects.
Surely we can ty to reconnect the VMware ESXi Host from the Actions menu. Or even restart the Host. Chances are this behaviour might happen again.
Assuming we can rule out mis-configuration cases we can actually change the default polling time. All we need to do is to access the Advanced Settings options in the vCenter Server.
Manage > Settings > Advanced Settings
Let’s look for the following paramter:
If this parameter does not exist we can manually create this one on the fly with the “Edit” option.
In my home lab I’m using latest version of vSphere 6.0u3 and could not find this parameter. Will create a new one and assign the value of “120”.
This means that now the default polling time for the vCenter to check about heartbeat messages is set to 2 minutes.
Let’s click “Add” and “OK” to amend the changes.
We should now get something similar to the screenshot below.
To make new changes effective we need to restart the VMware vCenter Server service. In this environment I’m using the vCenter installed on Windows. As usual we can do this from the Windows Services mmc panel. Or simply by issuing:
As soon as we try to restart the VMware vCenter Server service the MMC will detect the relevant dependencies and restart them accordingly.
As soon as all services are restarted we can access again the VMware vSphere Console. And of course to be patient for all VMware services to use memory allocation on the machine.
After a few seconds we can now see the trigger about the “Host connection and power state” is now resolved.
Of course I would suggest to use configurations as close as possible to the minimum requirements. Running such configurations in a home lab is great as it gives us the option to learn more and test about different aspects of the deployments.
thanks, that saved me a lot of frustration.
Thanks for your comment and happy to hear it is solved.