I am currently working on a project where we are optimizing the virtual infrastructure which consist of vSphere and XenServer hypervisors. In the project we want to measure and confirm some of the performance related counters. We got several standard tools at the infrastructure components to see what the environment is capable of and check if there are some bottlenecks regarding IO flow and processing.
With any of the analyzing it is important to plan (or know) what to measure on what layer so this is repeatable when wanting to check what certain changes can do to your environment. This check can also be done from some of the tools available, such as earlier written in the blog post about VMware View planner (to be checked at this url http://pascalswereld.nl/post/66369941380/vmware-view-planner) or is a repeat of your plan (which then can be automated/orchestrated). Your measuring tools need to have similar counters/metric throughout the chain, or at least show what your putting/requesting from a start and at the end (but if there is a offset you got little grey spots in the chain).
A correct working time service (NTP) is next to correct working of for example clustering and logging, also necessary for monitoring. To get to right values at the right intervals. Slightly off will in some cases give you negative or values that are off at some components.
Some basics about measuring
You will have to know what the measuring metrics are at a point. Some are integers, some are floating, some are averages over periods or amounts used, some need a algorithm to calculate to human or a similar metric (Kb at one level and bytes on the other, some of them are not that easy). A value that is high in first view but consists of several components and is an average of a certain period, could be normal when devided by the amounts of worlds.
Next up know or decide on your period and data collection intervals. If you are measuring every second you probably get a lot of information and are a busy man (or woman) trying to analyze trough all the data. Measuring in December gives a less representative workload then measuring in a company’s peak February period (and for Santa it is the other way around ;-)). And measure the complete proces cycle, try to get a 4 weeks/month period to get the month opening and closing processes in there (well depending on the workload of course).
Most important is that you know what your workloads are, what the needs for IO is and what your facilitating networking and storage components are capable off. If you don’t know what your VD image is build of for a certain group of users and what is required for these, how will you know if a VD from this groups requesting 45 IOPS is good or bad. At the other hand if you put all your management, infrastructure and VD’s on the same storage how are you going separate the cumulative counters from the specific workload.
Hey you said something about vSphere in the title, let’s see what is standard available for the vSphere level.
– VM monitoring. In guest Windows Perfmon counters or Linux guest statistics. The last is highly depending on what you put in your distribution, but think of top, htop, atop, vmstat, mpstat et al.
Windows Perfmon counters are supplemented with some VM insights with VMware tools. There are a lot of counters available, so know what you want to measure. And use the data collection sets to group them and have them for reference/repeatable sets (scheduling of the data collection).
– Host level; esxtop or vscsistats. Esxtop is great tool for performance analysis of all types. Duncan Epping has an excellent post about esxtop metrics and usage, you can find it here http://www.yellow-bricks.com/esxtop// Esxtop can be used in interactive or batch mode. With the batch mode you can load you output file in Windows Perf mon or in esxplot (http://labs.vmware.com/flings/esxplot). Use VisualESXtop (http://labs.vmware.com/flings/visualesxtop) for enhancements to the esxtop commandline and a nice GUI. On the VMA you can use resxtop to remotely get the esxtop stats. vscsistats is used when wanting to get scsi collections or get storage information that esxtop is not capable of showing. And ofcourse PowerCLI can be an enormous help here.
– vCenter level; Statistics collection which depends on your statistics level. Graphs can be shown on several components in the vSphere Web Client, can be read via the vSphere API or again use PowerCLI to extract the wanted counters. To get an overview of metrics at the levels please read this document http://pubs.vmware.com/vsphere-55/topic/com.vmware.ICbase/PDF/vsphere-esxi-vcenter-server-55-monitoring-performance-guide.pdf or check documentation center for your version.
– vCenter™ Operations Management Suite (vCOPS). Well standard, you still have to option to not include operations in your environment. But your missing out on some of the automated (interactive/proactive) performance monitoring, reporting and insight in your environment options. Root cause analysis is part of the suite, and not down to your own understanding and analytic skills. If you are working on the previous levels your life could have been simpler with vCOPS suite.
These standard tools need to be supplemented with specific application, networking (hops and other passed components) and storage (what are the storage processors up to is there latency build up in the device it self) counters.
– Happy measuring!