VMware HA: Isolation Response

A friend of mine asked me to blog about VMware HA, specifically isolation response. Much of my knowledge comes from Duncan Epping’s books on HA/DRS and his website Yellow Bricks. This post is targeted at vSphere 4.1 environments. I will include links to his blog and specific books to more HA resources at the end of this blog and information regarding vSphere 5.

Isolation Response is the action a host takes when it determines its been isolated as part of an HA enabled cluster. There are three actions that a vSphere 4 environment can take when a failure is detected. I will explore each of these settings in more detail below.

  • Leave Powered On (default)
  • Power Off
  • Shut down

This setting can be changed on the cluster properties as shown in the screen shot below.

Leave Powered On: Leaves the VMs powered On in an HA event, this is the default option. This can help in the situation to mitigate against false positives such as a network failure but host/datastore are not impacted.

Power Off: Initiates a hard stop to the guest immediatly powering them off. This is a hard stop.

Shut Down: Initiates a graceful shut down of the guests on the host during an HA event. This can take some time to complete depending on the state of the VMs VMware tools is a requirement for this to work, if this has not competed after 5 minutes a Power Off is initiated.

In all of these cases HA will attempt to restart the VM on another host in the cluster. If it is simply an isolation from the network the files will be locked on the datastore and the VM will not be able to be restarted.

An except from Duncan’s blog on Design is below as well that may help you make your decision.

Basic design principle 1: Isolation response should be chosen based on the version of ESX used. For pre-vSphere 4 Update 2 environment with iSCSI/NFS Storage I recommend to set the isolation response to “Power off” to avoid a possible split brain scenario. I also recommend to have a secondary service console running on the same vSwitch as the iSCSI network to detect an iSCSI outage and avoid false positives.

Basic design principle 2: Base your isolation response on your SLA. If your SLA dictates that hosts with degraded hardware should not be used, make sure to select shutdown or power off.

Please see the following links for more detailed information on HA and Isolation Response

I would also highly recommend picking up Duncan’s books on HA/DRS for both vSphere 4/5 if you haven’t already.