Permanent Device Loss (PDL) enhancements


The PDL condition is useful for both “stretched storage cluster” and non-stretched environments.

Consider this scenario for stretched storage cluster:
PDL is probably most common in non-uniform stretched solutions like EMC VPLEX. With VPLEX site affinity is defined per LUN. If your VM resides in Datacenter-A while the LUN it is stored on has affinity to Datacenter-B, in case of failure, this VM could lose access to the LUN. These new PDL enhancements will ensure the VM is killed and restarted on the other side.

Consider this scenario for non-stretched cluster:
PDL occurs where for instance; the storage admin makes a mistake and removes access for a specific host to a LUN.

Note:
Please note that action will only be taken when a PDL sense code is issued. When your storage completely fails for instance it is impossible to reach the PDL condition as there is no communication possible anymore from the array to the ESXi host and the state will be identified by the ESXi host as an All Paths Down (APD) condition. APD is a more common scenario in most environments. If you are testing these enhancements please check the log files to validate which problem has been identified.

PDL enhancements:

Two advanced settings make this possible. The first setting is configured on a host level and is “disk.terminateVMOnPDLDefault=TRUE” should be added /etc/vmware/settings. This setting ensures that when a datastore enters a PDL state, corresponding virtual machine is killed. The virtual machine is killed as soon as it initiates disk I/O on a datastore which is in a PDL condition and all of the virtual machine files reside on this datastore. Note that if a virtual machine does not initiate any I/O it will not be killed!


The second setting is a vSphere HA advanced setting called das.maskCleanShutdownEnabled. This setting is also not enabled by default in vSphere 5.0 Update 1 but it is enabled by default from vSphere 5.1. These settings allow HA triggering a restart response for a virtual machine which has been killed automatically due to a PDL condition. This setting allows HA to differentiate between a virtual machine which was killed due to the PDL state or a virtual machine which has been powered off by an administrator.

As soon as “disaster strikes” and the PDL sense code is sent. You will see the following popping up in the vmkernel.log that indicates the PDL condition and the kill of the VM:


2012-03-14T13:39:25.085Z cpu7:4499)WARNING: VSCSI: 4055: handle 8198(vscsi4:0):opened by wid 4499 (vmm0:fri-iscsi-02) has Permanent Device Loss. Killing world group leader 4491
2012-03-14T13:39:25.085Z cpu7:4499)WARNING: World: vm 4491: 3173: VMMWorld group leader = 4499, members = 1


Thanks to Duncan Epping for his posts

No comments:

Post a Comment