Yeah that’s a lot information in the title, but this is actually what this PowerCLI script will do. Set the PereniallyReserved flag of RDMs storage devices that are shared between multi-server nodes. This happens with Microsoft Fail over cluster across boxes shared RDM’s.

I returned from my vacation and one of the first questions / remarks I got when a team member tried to reboot a maintenanced vSphere 5.1 host is, why does this boot takes long…. While discussing the situation and monitoring the VMkernel log while booting (ALT-F12 on the console), I learned that a VM with a physical shared scsi bus physical RDM setup was shutdown before the maintenance (as online vMotion is not allowed in this situation) and that the boot is hanging on the storage device discovery. It continues on to the next device after a few retries, and hangs when encountering a same sort of storage device. I remember this situation from a previous encounter and redirected the customer to Knowledge Base article http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1016106&src=vmw_so_vex_pheld_277 and the vSphere MSCS Setup Checklist from http://pubs.vmware.com/vsphere-55/index.jsp?src=vmw_so_vex_pheld_277&topic=%2Fcom.vmware.vsphere.mscs.doc%2FGUID-E794B860-E9AB-43B9-A6D0-F7DE222695A1.html. And yes, the storage devices where part of a MSCS cluster and all the symptoms matched.

The Symptoms

In a scenario where you have to reboot or rescan a host that is participating in a Microsoft cluster (MSCS or Microsoft Failover Cluster), you will notice slow boot up or rescan times for the host. This situation includes all shared RDM’s where an active node owns a scsi reservation, so this will not be only MSCS but others can be effected as well.
The delay exists because the active node (the hypervisor host with the other cluster node VM with the active cluster resources) will still have scsi reservations on the RDM LUN. This will inadvertently cause the “other” hypervisor host (the one with the passive node) to slow down during boot or rescans, as it tries to interrogate each of the devices it is presented. This including the MSCS RDM’s (which are actually active on other node) LUNs during storage path claiming and device discovery. It will fail the reserved/active RDM’s and retries until ESXi gives up and moves along.
With hosts that have a lot of RDM’s this can take a while to finish, from somewhere in the range of 10 minutes to times up to 2 hours that I have seen.

The solution

From the knowledge base article 1016106 and the vSphere checklist we can learn that there is an option that needs to be set on the MSCS LUNs, set the perennially reserved flag on the shared storage device(s). This perennially reserved flag is false (unselected) by default. 

You will have to indentify the RDM Luns that are part of MSCS cluster(s). Take note of the NAA id’s of the RDM shared devices.

We have several options to set this flag in 5.1 (and 5.5) using:

  • Host Profiles; Using the Storage Configuration – Pluggable Storage Architecture (PSA) configuration – PSA Device Setting – <Storage Device NAA id> – Setting – Configuration Details – Flag Device Perennially reserved status. Ofcourse this situation included editions without host profiles. 

Host Profile

  • Use esxcli command on all the participating hosts for all of the shared RDM NAA id’s. Use esxcli storage core device setconfig -d naa.id –perennially-reserved=true command.
  • Use PowerCLI toghter with Get-ESXcli to set the storage.core.device.setconfig on the involved NAA id’s.

Putting it together in a PowerCLI script

I have created a PowerCLI to connect to vCenter and connect to a target cluster. At this time a vSphere cluster target is used as the MSCS clusters are setup to support HA on the VM’s. All hosts in a cluster have the RDM’s visible, and the vSphere cluster is made up with different clustered VM’s. HA is supported for MSCS cluster node VM’s, but can be discussed if necessary on VM’s that participate in application/service clustering. The vCenter and Cluster are defined in $vCenter and $Cluster variables, set these to your environment.

Next the script takes the cluster location and scans for VM’s that have the scsicontroller BusSharingMode equalled to Physical (as per required Physical mode in MSCS deployments, and not getting any none shared RDM’s in the enviroment) using the Get-ScsiController cmdlet and using the Where-Object equal to Physical BusSharingMode. Each VM found with one or more shared scsi controllers are looped through in a foreach. The VM name is captured and the VM is scanned for virtual or physical RDMS with Get-HardDisk – DiskType. I’m using this twice to save the information to a CSV (Export-CSV cmdlet) for documentation purposes and the second time to get the ScsiCanonicalName (the NAA id). If you don’t need the export because the information is not needed or it takes a long time due to a lot of these RDM’s remove the line with the Export-CSV cmdlet.

Each host is looped and connected with Get-EsxCLI cmdlet to run on the vSphere cluster hosts. After that the RDM naa.id are looped and the esxcli storage.core.device.setconfig is used on the RDM naa.id’s to set to true.  To set the device as perennially reserved, the command runned is: $esxcli.storage.core.device.setconfig($false, “naa.id”, $true), which in the script is created as $esxcli.storage.core.device.setconfig($false, ($RDM.ScsiCanonicalName), $true) where $RDM is the found disktype RawPhysical/RawVirtual Selected Object ScsiCanonicalName.

You can check if the device is set by opening PowerCLI and connect to vCenter. Use $esxcli=Get-EsxCli -VMHost <participating host> and list the device $esxcli.storage.core.device.list(“RDM naa.id”). IsPerenniallyReserved parameter should now read : true.

Lastly the VM name is saved in the OldVM variable to check if (in a if statement) there are no double VM names as the loop is created on the scsicontroller check. VM’s can have multiple scsicontrollers and will else be double scanned.

After running this script all subsequent rescans/boots will be at normal speed.

# This script sets the PereniallyReserved flag for RDM datastores that are part of a Microsoft cluster.
# Main reason is to lower the boot time or rescan times of the hosts as these are waiting for the datastore when booting/rescan.

# In a scenario where you have to reboot a host that is participating in a Microsoft cluster, you will notice slow boot up or rescan times for the host.
# This is because the active node will still have scsi reservations on the rdm lun which will inadvertently cause the hypervisor to slow down during boot or rescans as it tries
# to interrogate each of the devices it is presented, including the MSCS RDM’s (which are actually active on other node) LUNs during storage discovery. It will fail the reserved/active RDM’s and retries until ESXi gives up and moves along.
# With hosts that have a lot of RDM’s this can take a while to finish. Seen times up to 2 hours.

# This script connects to the defined cluster and get’s VM’s configured with a shared physical scsi controller and that vm’s RAW disks canonical names.
# It then connects to all hosts in that cluster and set’s the canonical devices to PereniallyReserved (does not make sense, but is configured in this environment. DRS no, HA is for discussion)

# See KB 1016106 for details (http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1016106&src=vmw_so_vex_pheld_277)

# Settings
$vCenter = “<vcenter fqdn>” # vCenter hosts name, the user that is running needs access to the vCenter instance.

$Cluster = “<clustername as shown in inventory>” # Cluster to connect to

# Connect to vCenter
Connect-VIServer -Server $vCenter

# Get the hosts from the cluster
$VMhosts = Get-Cluster $Cluster | Get-VMHost

# We need VM’s with RDM’s with SCSI Bus Sharing on Physical (as per vSphere MSCS guide/checklist)
# Check with OldVM and VM’s is added to filter out VM’s with more than one vscsi controller.
Get-VM -Location $Cluster | Get-ScsiController | Where-Object {$_.BusSharingMode -eq “Physical”} | ForEach {
$VMs=$_.Parent.Name
# Get RDM Disk of these VM’s
# RDM’s can be either Virtual or Physical, look for both types.
# We want an export of the disk information found, comment out the line with Export-CSV when you don’t need or takes a long time.
If ($VMs -ne $OldVM){
Get-VM -Name $VMs | Get-HardDisk -DiskType “RawPhysical”,”RawVirtual” | Select-Object -Property Parent, Name, DiskType, FileName, CapacityGB, ScsiCanonicalName | Export-CSV D:\Scripts\MSCSRDMs\RDM-List-$VMs.csv
$RDMs = Get-VM -Name $VMs | Get-HardDisk -DiskType “RawPhysical”,”RawVirtual” | Select-Object -Property ScsiCanonicalName

# Get EsxCli for each host in the cluster
ForEach($hostName in $VMhosts) {
$esxcli=Get-EsxCli -VMHost $hostName
# And set each RDM disk found to PereniallyReserved
ForEach($RDM in $RDMs) {
# Set the configuration to “PereniallyReserved”.
$esxcli.storage.core.device.setconfig($false, ($RDM.ScsiCanonicalName), $true)
}
}
}
$OldVM= $VMs
}

# Disconnect all the connection objects as we are finished.
Disconnect-VIServer * -Confirm:$false

– Hopefully will help you too.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s