Design for failure – but what about the failure in designs in the big bad world?

This post is a random thought post, not quite technical but in my opinion very important. The idea formed after some subjects and discussions at last week’s NL VMUG. This blog post’s main goal is to create a discussion, so why don’t you post a comment with your opinion … Here it goes…

Murphy, hardware failures and engineers tripping over cables in the data center, us tech gals and guys all know and probably experienced them. Disaster happens everyday. But what about a state of the art application that ticks all the boxes for functional and technical requirements, but users are not able to use it, because of their lack of knowledge in this field, or because they are clueless why the business has created this thingy (why and how this application or data is supposed to help the information flow of business processes)? Failure is a constant and needs to be handled accordantly, and from all angles.

Techies are used to look at the environment from the bottom up. We design complete infrastructures with failure in our minds and have the technology and knowledge to perfectly execute disaster avoidance or disaster recovery (forget the theoretical RTO/RPO of 0’s here). We can do this at a lower cost (CAPEX) than ever before, and there are more benefits (OPEX and minimized downtime for business processes) than before. But subsequently, we should ask ourselves this: What about failing applications or data which is generated but not reaching the required business processes (the people that are operating or using these processes)?
Designs need to tackle this problem, using design based on the complete business view and connecting strategy, technical possibility and users!

And how will we do this then?

Well, first of all, the business needs to have full knowledge of their required processes and information flows, that support or process in and out data for these services supporting the business strategy. Very important. And to be honest, only a few companies have figured out this part. Most experience difficulties. And they give up. Commitment from the business and people in the business is of utmost importance. Be a strategic partner (to the management). Start with asking why certain choices are made and explain the why a little more often than just the how, what and when!

Describe why and how information and data is collected organized and distributed (in a fail safe and secure method) and what information systems are used. Describe the applications (and their ROI, services, processes and busses), how the information is presented and flows back in the business (via the people or automated systems). How does your solution let the business grow and flourish? Keep clear of too much technical detail – present your story in a way the manager understands the added value, and knows which team members (future users) to delegate to project meetings.

Next up IT, or ICT here in the Netherlands, Information and Communication Technology. I really like the Communication part for this post, businesses must do that a little more often. Start looking at the business from different points of view, and make sure you understand the functional parts and what is required to operate. To prevent people working on their own without a common goal or reason, internal communication is essential. Know the in and outs, describe why and how the desired result is achieved. Connect the different business layers. For this a great part of business IT departments needs to refocus it’s 1984 vision to the now and future. IT is not about infrastructure alone, it is a working part within the business, a facilitator, a placeholder (for lack of other words in my current vocabulary). IT needs to be about aligning the business services with applications and data, the tools and services that support and provides the business. That is why IT is there in the first place, not the business that is (connected or not) there for IT. IT’s business. Start listening, start writing first on a global level (what does the business mean by working from everywhere everyplace), then map possibilities to logical components (think from the information, why is it there, where does it come from and where does it go, and think for the apps, the users) and then when you have defined the logical components, you can add the physical components (insert the providers, vendors, hardware building blocks).

Sounds familiar? There are frameworks out there to use. Use your Google-Fu: Enterprise Architecture. Is this for enterprise size organizations only? No, any size company must know the why and why and why. And do something about it. And a simplified version will work for SMB size companies. Below is an example of a simplified model and what layers of attention this architectural framework brings to your organization.

Design for Failure

And…in addition to this, start using the following as a basis to include in your designs:

The best way to avoid failure is to fail constantly

Not my own, but from Netflix. This cannot be closer than the truth. No test or disaster recovery plan testing in iterations of half year or year. Do it constantly and see if your environment and business is up to the task to not influence any applications that will go down. Sure, there will be influences that for example the services running at 100% warp speed, but your users still able to do things with the services is better than nothing at all. And knowing that your service operates with a failure is the important part here. Now you can do something about not reaching the full speed, for example scale out to allow a service failure but not at a degraded service speed. Or know which of your services can actually go down without influencing business services for a certain time-frame. This is valuable feedback that will need to go back to the business. Is going down sufficient for the business, or should we try and handle this part so it does not go down at all. Just don’t use it at the infrastructure level only, include the data, application and information layers as well.
Big words here: trust and commitment. Trust the environment in place and test if it succeeds to provide the services needed even when hell freezes over (or when some other unexpected thing should happen). Trust that your environment can handle failure. Trust that the people can do something with or about the failures.
Commitment of the organization not to abandon when reaching a brick wall over and over, but to keep going until you are all satisfied. And trust that your people can fail also. Let them be familiar with the procedures and let a broader range of people handle the procedure (not just the current users names mapped to the processes, but within defined and mapped roles to services, multiple people can operate and analyze the information). Just like technical testing, your people are not operating 24x7x365, they like to go on leave and sometimes they tend to get ill.

Back to Netflix. For their failure generating Netflix uses Chaos Monkey. With that name an other Monkey comes to mind, Monkey Lives: http://www.folklore.org/StoryView.py?project=Macintosh&story=Monkey_Lives.txt. Not sure where the idea came from, but such a service and name cannot be a coincidence only (if you believe coincidence exists in the first place). But that is not what this paragraph is about.
The Chaos Monkey’s job is to automatically and randomly kill instances and services within the Netflix Infrastructure architecture. When working with Chaos Monkey you will quickly learn that everything happens for a reason. And you will have to do something about it. Pretty Awesome. And the engineers even shared Chaos Monkey on Github:
https://github.com/Netflix/SimianArmy/wiki/Chaos-MonkeyIt must not stop at the battle plan of randomly killing services; fill up the environment with random events where services will get into some not okay state (unlike a dead service) and see how the environment reacts to this.

 

VMware Utility Belt must have tools – RVTools 3.7 released

March 2015 RVTools version 3.7 is released. 

This, in my opinion, is the tool each VMware consultant must have in his VMware utility belt together with the other standard presented tools. At this time RVTool is still free, so budget is no constrain to use this tool. More important it’s lightweight, very simple in usage and shows much wanted information in a ordered overview or allows for exporting the information in Excel format to analyse this offline. 

Before using this tool, it is important to understand the tool is used to make a point in time snapshot of the infrastructure configuration items in place. In short what is configured and what is the current operational state. No more, no less. The information can then be used in for example operational health checks or AS IS starting point in projects (consolidation or refresh projects) in the analysis/inventory phase. See more use cases further below, and I am sure there can be some more examples out there.

No trending or what if’s for example, that is something you will have to do yourself or use other solutions/tools available for the software defined data center. VMware has some other excellent tools for SDDC management and insights in your virtual environment (for example vRealize Operations and Infrastructure Navigator). But that is a complete other story.

What is RVTools?

RVTools is a Windows .NET application which used the VI SDK (which is updated to 5.5 in this release) to display information about your VMware infrastructure.
A inventory connection can be made to vCenter or a single host, to get as is information about hosts, VM’s, VM Tools information, Data stores, Clusters, networking, CPU, health and more. This information is displayed in a tabpage view. Each tab represents a specific type of information, for example hosts or datastores.

RVTools can currently interact with Virtual Center 2.5, ESX Server 3.5, ESX Server 3i, Virtual Center 4.x, ESX(i) Server 4.x, Virtual Center 5.0, Virtual Center Appliance, ESXi Server 5.0, Virtual Center 5.1, ESXi Server 5.1, Virtua lCenter 5.5, ESXi Server 5.5 (no official 6.0 in this version).

RVTools can export the inventory to Excel and CSV for further analysis. The same tab from the GUI will be visible in Excel.

image

image

There is also a command line option to have (for example) a inventory schedule and let the results be send via e-mail to an administrative address.

Use Cases?

– On site Assessment / Analysis; Get a simple and fast overview of a VMware infrastructure. The presented information is easy to browse through, where in the vSphere Web Client you would go clicking through screens. When there is something interesting in the presented data you can go deeper with the standard vSphere and ESXi tools. Perfect for fast analysis and health checks.

– Off site Assessment / Analysis; Get the information and save the Excel or CSV dump to get a fast overview and dump for later analysis. You will have the complete dump (a point in time reference that is) which you can easily browse through when writing up an analysis/health check report.

– Documentation; The dumped information can be used on or offline to write up documentation. Excel tabs are easily copied in to the documentation.

– (Administrator) reporting; Via the command tool get a daily overview of your VMware infrastructure. Compare your status of today with the point in time overview of the day before or last week (depending on your schedule and/or retention). Use this information in the daily tasks of adding/changing documentation, analysis, reporting and such.

Release 3.7 Notes

For version 3.7 the following has been added:

  • VI SDK reference changed from 5.0 to 5.5
  • Extended the timeout value from 10 to 20 minutes for realy big enviroments
  • New field VM Folder on vCPU, vMemory, vDisk, vPartition, vNetwork, vFloppy, vCD, vSnapshot and vTools tabpages
  • On vDisk tabpage new Storage IO Allocation Information
  • On vHost tabpage new fields: service tag (serial #) and OEM specific string
  • On vNic tabpage new field: Name of (distributed) virtual switch
  • On vMultipath tabpage added multipath info for path 5, 6, 7 and 8
  • On vHealth tabpage new health check: Multipath operational state
  • On vHealth tabpage new health check: Virtual machine consolidation needed check
  • On vInfo tabpage new fields: boot options, firmware and Scheduled Hardware Upgrade Info
  • On statusbar last refresh date time stamp
  • On vhealth tabpage: Search datastore errors are now visible as health messages
  • You can now export the csv files separately from the command line interface (just like the xls export)
  • You can now set a auto refresh data interval in the preferences dialog box
  • All datetime columns are now formatted as yyyy/mm/dd hh:mm:ss
  • The export dir / filenames now have a formated datetime stamp yyyy-mm-dd_hh:mm:ss
  • Bug fix: on dvPort tabpage not all networks are displayed
  • Overall improved debug information

Who?

RVTools is written by Rob de Veij aka Robware. You can find Rob on twitter (@rvtools) and via his website http://robware.net.
Big thank to Rob for unleashing yet another version of this great tool!

As the tool is currently free please donate if you find the application useful to help and support Rob in further developing and maintaining RVTools.

Let’s get ready to cast your vote: vBlog 2015

Like the years before Eric Siebert of vSphere-Land.com is opening the annual vBlog voting for 2015 (http://vsphere-land.com/news/voting-now-open-for-the-2015-top-vmware-virtualization-blogs.html). This year Infinio is the sponsor and the top 50 is going to receive a special custom commemorative coin. All the blogs that are listed on vLaunchpad are on the ballot for the general voting. The top vBlog voting contest helps rank the most popular vblogs based on the community (you) votes and the outcome determines the ranking that is announced on the 19-03 Live show (and published on the vLaunchpad website).

Pascalswereld.nl is included on the voting ballot, but please keep in mind there is a lot of better blogs out there. As Eric states; keep in mind quality, frequency, longevity and length of the blogs out there when voting.
And of course your personal preferences ;-)

Ready to participate?

You can place your vote at: http://www.surveygizmo.com/s3/2032977/TopvBlog2015.

Good luck to all the great bloggers out there!

Sources: http://vsphere-land.com, http://info.infinio.com/topvblog2015