EUC Layers: Display protocols and graphics – the stars look very different today

In my previous EUC Layer post I discussed the importance of putting insights on screens, in this post I want to discuss the EUC Layer of putting something on the screen of the end user.

Display Protocols

In short, a display protocol transfers the mouse, keyboard and screen (ever wondered about vSphere MKS error if that popped up) input and output from a (virtual) desktop to the physical client endpoint device and vice versa. Display protocols usually will optimize this transfer with encoding, compressing, deduplicating and performing other magical operations to minimize the amount of data transferred between the client endpoint device and the desktop. Minimize data equals less chance of interference equals better user experience at the client device. Yes, the one the end user is using.

For this blog post I will stick to the display protocols VMware Horizon has under its hood. VMware Horizon supports four ways of using a display protocol: PCoIP via the Horizon Client, Blast Extreme/BEAT via the Horizon Client, RDP via Horizon Client or MS Terminal Client, and any HTML5 compatible browser for HTML Blast connections.

The performance and experience of all the display protocols are influenced by the client endpoint device – everything in between – desktop agent and the road back to the client. : for example virtual desktop Horizon Agent. USB Redirected Mass storage device to your application, good-bye performance. Network filtering and poof black screen. Bad WiFi coverage and good-bye session when moving from office cubicle to meeting room.

poof-its-gone

RDP

Who? What? Skip this one when you are serious about display protocols. The only reason it is around in this list, is for troubleshooting when every other method fails. And yes the Horizon Agent default uses RDP as an installation dependency.

Blast Extreme

Just Beat it PCoIP. Not the official statement of VMware. VMware ensures it’s customers that Blast Extreme is not a replacement but an additional display protocol. But yeah…..sure…

With Horizon 7.1 VMware introduced BEAT to the Blast Extreme protocol. BEAT stands for Blast Extreme Adaptive Transport— UDP-based adaptive transport as part of the Blast Extreme protocol. BEAT is designed to ensure user experience stays crisp across quality varying network conditions. You know them, those with low bandwidth, high latency and high packet loss, jitter and so on. Great news for mobile and remote workers. And for spaghetti incident local networks……..

Blast uses standardized encoding schemes such as default H.264 for graphical encoding, and Opus as audio codec. If it can’t do H.264 it will fallback to JPG/PNG, so always use H.264 and check the conditions you have that might cause a fallback. JPG/PNG is more a codec for static agraphics or at least not something larger than an animated gif. H.264 the other way around is more a video codec but also very good in encoding static images, will compress them better than JPG/PNG. Plus 90% of the client devices are already equipped with a capability to decode H.264. Blast Extreme is network friendlier by using TCP by default, easier for configuration and performance under congestion and drops. It is effecient in not using up all the client resources, so that for example mobile device batteries are not drained because of the device using a lot of power feeding these resources.
Default protocol Blast Extreme selected.

PCoIP

PC-over-IP or PCoIP is a display protocol developed by Teradici. PcoIP is available in hardware, like Zero Clients, and in software. VMware and Amazon are licensed to use the PCoIP protocol in VMware Horizon and AWS Amazon Workspaces. For VMware Horizon PCoIP is an option with the Horizon Client or PCoIP optimized Zero Clients.
PCoIP is mainly a UDP based protocol, it does use TCP but only in the initial phase (TCP/UDP4172). PcoIP is rendered, multi-codec and can dynamically adapt itself based on available bandwidth. In low bandwidth environments it utilizes a lossy compression technique  where a highly compressed image is quickly delivered followed by additional data to refine that image. This process is termed “build to perceptually lossless”. The default protocol behaviour is to use lossless compression when there is minimal network congestion expected. Or explicitly disable as might be required for use cases where image quality is more important than bandwidth for example in medical imaging.
Images rendered on the server are captured as pixels, compressed and encoded and then sent to the client where decryption and decompression happens. Depending on the display, different codecs are used to encode the pixels sent since techniques to compress video images can be different in effectiveness compared to those more effective for text.

 

HTML

Blast Extreme without the Horizon client dependency. Client is a HTML5 compatible browser. HTML access needs to be installed and enabled on the datacenter side.
HTML uses the Blast Extreme display protocol with the JPG/PNG codec. HTML is not feature par with the Horizon Client that’s why I am putting it up as a separate display protocol option. As not all features can be used it not a best fit in must production environments, but it will be very sufficient for enough to use for remote or external use cases.

Protocol Selection

Depending how the pool is configured in Horizon, the end user has either the option to change the display protocol from the Horizon Client or the protocol is set on the pool with the setting that a user cannot change it’s protocol. The latter is has to be selected when using GPU, but it depends a bit on the work force and use case if you would like to leave all the options available to the user.

horizon-client-protocol

Display Protocol Optimizations

Unlike what some might think, display protocol optimization will benefit user experience in all situations. Either from an end user point of view or from IT having some control over what can and will be sent over the network. Network optimizations in the form of QoS for example. PCoIP and Blast Extreme can also be optimized via policy. You can add the policy items to your template, use Smart Policies and User Environment Management (highly recommended) to apply on specific conditions or use GPO’s. IMHO use UEM, and then template or GPO are the order to work from.

uem-smart-policy-example

For both protocols you can configure the image quality level and frame rate used during periods of network congestion. This works well for static screen content that does not need to be updated or in situations where only a portion of the display needs to be refreshed.

With regard to the amount of bandwidth a session eats up, you can configure the maximum bandwidth, in kilobits per second. Try to correspond these settings to the type of network connection, such an interconnect or a Internet connection, that are available in your environment.For example a higher FPS is fluent motion, but more used network bandwidth. Lower is less fluent but a less network bandwidth cost. Keep in mind that the network bandwidth includes all the imaging, audio, virtual channel, USB, and PCoIP or Blast control traffic.

You can also configure a lower limit for the bandwidth that is always reserved for the session. With this option set an user does not have to wait for bandwidth to become available.

For more information, see the “PCoIP General Settings” and the “VMware Blast Policy Settings” sections in Setting Up Desktop and Application Pools in View on documentation center (https://pubs.vmware.com/horizon-7-view/index.jsp#com.vmware.horizon-view.desktops.doc/GUID-34EA8D54-2E41-4B71-8B7D-F7A03613FB5A.html).

If you are changing these values, do it one setting at a time. Check what the result of your change is and if it fits your end users need. Yes, again use real users. Make a note of the setting and result, and move on to the next. Some values have to be redone to find the sweet spot that works best. Most values will be applied when disconnecting and reconnecting to the session where you are changing the values.

Another optimization can be done by optimizing the virtual desktops so less is transferred or resources can be dedicated to encoding and not for example defragmenting non persistent desktops during work. VMware OS Optimization Tool (OSOT) Fling to the rescue, get it here.

Monitoring of the display protocols is essential. Use vROPS for Horizon to get insights of your display protocol performance. Blast Extreme and PCoIP are included in vROPS. The only downside is that these session details are only available when the session is active. There is no history or trending for session information.

Graphic Acceleration

There are other options to help the display protocols on the server-side by offloading some of the graphics rendering and coding to specialized components. Software acceleration uses a lot of vCPU resources and just don’t cut it in playing 1080p full screen video’s. Not even 720p full screen for that matter. Higher clock speed of processor will help graphical applications a lot, but a the cost that those processor types have lower core count. Lower core count and a low overcommitment and physical to virtual ratio will lower the amount of desktops on your desktop hosts. Specialized engineering, medical or map layering software requires graphic capabilities that are not offered by software acceleration. Or require hardware acceleration as a de facto. Here we need offloading to specialized hardware for VDI and/or Published applications and desktops. Nvidia for example.

gpu-oprah-meme

What will those applications be using? How many frame buffers? Will the engineers be using these application mostly or just for a few moments and are afterwards doing work in office to write their reports. For this Nvidia supports all kinds of GPU profiles. Need more screens and framebuffers, choose a profile for this use case. A board can support multiple profiles if it has multiple GPU cores. But per core there only one type of profile can be used, multiple times if you not out of memory (buffers) yet. How to find the right profile for your work force? Assessment and PoC testing. GPU monitoring can be a little hard as not all monitoring application have the metrics up there.

And don’t forget that some applications need to be set to use hardware acceleration to be used by GPU or applications that don’t support or run worse on hardware acceleration because their main resource request is CPU (Apex maybe).

Engineers only? What about Office Workers?

Windows 10, Office 2016, browsers, and streaming video are used all over the offices. These applications can benefit from graphics acceleration. The number of applications that support and use hardware graphics acceleration has doubled over the past years. That’s why you see that the hardware vendors also changed their focus. NVidias’ M10 is targeted at consolidation while its brother M60 is targetted to performance, however reaching higher consolidation ratio’s then the older K generation. But cost a little bit more.

vGPU and one of the 0B/1B profiles and a vGPU for everyone. The Q’s can be saved for engineering. Set the profiles on the VM’s and for usage on the desktop pools.

And what can possibly go wrong?

Fast Provisioning – vGPU for instant clones

Yeah. Smashing graphics and depJloying those desktops like crazy… me likes! The first iteration of instant clones did not support any GPU hardware acceleration. With the latest Horizon release instant clones can be used for GPU. Awesomesauce.

– Enjoy looking at the stars!

Sources: vmware.com, wikipedia.org, teradici.com, nvidia.com

Digital Workspace Transformation: information security

Yes…. it has been a while since I posted on this blog, but I’m still alive ;-)

For a 2016 starter (what?!? is it June already), I want to ramble on about information security in the digital workspace. With a growing number of digital workspace transformations going on, information security is more important than ever. With the growing variety of client endpoints and methodes of access in the personal and corporate environments, users are becoming increasingly independent from the physical company locations. Making it interesting how to centrally manage storage of data, passwords, access policies, application settings and network access (just examples, not the complete list). For any place, any device, any information and any application environments for your users (or do we want any user in there), it is not just a couple of clicks of this super-duper secure solution and were done.

encrypt-image300x225
(image source blogs.vmware.com)

Storing data on for example Virtual Desktop servers (hello VMware Horizon!) in the data center is (hopefully) a bit more secure than storing it locally on the user’s endpoint. At the same time, allowing users to access virtual desktops remotely puts your network at a higher risk then local only. But it’s not all virtual desktops. We have mobile users who will like to have the presentations or the applications directly on the tablet or handheld. I for instance, don’t want to have to open a whole virtual desktop for just one application. You ever tried a virtual desktop on a iPhone, it is technical possible yes, but works crappy. Erm forgot my Macbook HDMI USB-C converter for this presentation, well I send it to your gmail or dropbox for access with the native mobile apps at your conference room. And the information is gone out of the company sphere…..(a hypothetical situation of course..)

Data Leak

Great ideas all those ways to be in and out of company information. But but but….. these also pose some challenges to which a lot of companies have not started thinking about. Sounds a bit foolish as it is probably the biggest asset of a company, information. But unfortunately it’s a fact (or maybe it could be just the companies I visit). Sure these companies have IT departments or IT vendors who think a bit about security. And in effect mostly make their users life’s miserable with all sort of technical barriers installed in the infrastructure. In which the users, business and IT (!) users, will find all sorts of ways to pass these installed barriers. Why? First of all to increase their productivity while effectively decreasing security, and secondly they are not informed about the important why. And then those barriers can be just a nuisance.

Break down the wall

IT’s Business

I have covered this earlier in my post (https://pascalswereld.nl/2015/03/31/design-for-failure-but-what-about-the-failure-in-designs-in-the-big-bad-world). The business needs to have full knowledge of their required processes and information flows, that support or process in and out information for the services supporting the business strategy. And the persons that are part of the business and operate the services. And what to do with this information in what different ways, is it allowed for certain users to access the information outside of the data center and such. Compliancy to for example certain local privacy laws. Governance with policies and choices, and risk management do we do this part or not, how do we mitigate some risk if we take approach y, and what are the consequences if we do (or don’t).

Commitment from the business and people in the business is of utmost importance for information security. Start explaining, start educating and start listening.
If scratch is the starting point, start the writing first on a global level. What does the business mean by working from everywhere everyplace, what is this digital workspace and such.  What are the risks, how do we approach IAM, what do we have for data loss protection (DLP), is it allowed for IT to inspect SSL traffic (decrypt, inspect and encrypt) etc. etc.
Not to detailed at first it is not necessary, as it can take a long time to have a version 1.0. We can work on it. And to be fair information security and digital workspace for a fact, is continue evolving and moving. A continual improvement of these processes must be in place. Be sure to check with legal if there are no loops in what has been written in the first iteration.
Then map to logical components (think from the information, why is it there, where does it come from and where does it go, and think for the apps, the users) and then when you have defined the logical components. IT can then add the physical components (insert the providers, vendors, building blocks). Evaluate together, what works, what doesn’t, what’s needed and what is not. And rave and repeat…..

Furthermore, a target for a 100% safe environment all the time will just not cut it. Mission Impossible. Think about and define how to react to information leaks and minimize the surface of a compromise.

Design Considerations

With the above we should have a good starting point for the business requirement phase of a design and deploy of the digital workspace. And there will also be information from IT flowing back to the business for continual improvement.

Within the design of an EUC environment we have several software components were we can take actions to increase (or decrease, but I will leave that part out ;-)) security in the layers of the digital workspace environment. And yes, when software defined is not a option there is always hardware…
And from the previous phase we have some idea what choices can be made in technical ways to conform to the business strategy and policies.

If we think of the VMware portfolio and the technical software layers were we need to think about security, we can go from AirWatch/Workspace ONE, Access Point, Identity Manager, Security Server, Horizon, AppVolumes to User Environment Management. And And….Two-Factor, One Time Password (OTP), Microsoft Security Compliance Manager (SCM) for Windows based components, anti-virus and anti-malware, networking segmentation and access policies with SDDC NSX for Horizon. And what about Business Continuity and disaster recovery plans, and SRM, vDP.
Enterprise Management with vROPS and Log Insight integration to for example SIEM. vRealize for automating and orchestrating to mitigate work arounds or faults in manual steps. And so on and so on. We have all sorts of layers where to implement or help with implementing security and access policies. And how will all these interact? A lot to think about. (It could be that a new blog post series subject is born…)

But the justification should start at the business… Start explaining and start acting! This is probably 80% of the success rate of implementing information security. And the technical components can be made fit, but… after the strategy, policies, information architecture are somewhat clear….

And the people in the business are supporting the need for information security in the workspace. (Am I repeating myself a bit ;-)

Ideas, suggestions, conversation, opinions. Love to hear them.

Design for failure – but what about the failure in designs in the big bad world?

This post is a random thought post, not quite technical but in my opinion very important. The idea formed after some subjects and discussions at last week’s NL VMUG. This blog post’s main goal is to create a discussion, so why don’t you post a comment with your opinion … Here it goes…

Murphy, hardware failures and engineers tripping over cables in the data center, us tech gals and guys all know and probably experienced them. Disaster happens everyday. But what about a state of the art application that ticks all the boxes for functional and technical requirements, but users are not able to use it, because of their lack of knowledge in this field, or because they are clueless why the business has created this thingy (why and how this application or data is supposed to help the information flow of business processes)? Failure is a constant and needs to be handled accordantly, and from all angles.

Techies are used to look at the environment from the bottom up. We design complete infrastructures with failure in our minds and have the technology and knowledge to perfectly execute disaster avoidance or disaster recovery (forget the theoretical RTO/RPO of 0’s here). We can do this at a lower cost (CAPEX) than ever before, and there are more benefits (OPEX and minimized downtime for business processes) than before. But subsequently, we should ask ourselves this: What about failing applications or data which is generated but not reaching the required business processes (the people that are operating or using these processes)?
Designs need to tackle this problem, using design based on the complete business view and connecting strategy, technical possibility and users!

And how will we do this then?

Well, first of all, the business needs to have full knowledge of their required processes and information flows, that support or process in and out data for these services supporting the business strategy. Very important. And to be honest, only a few companies have figured out this part. Most experience difficulties. And they give up. Commitment from the business and people in the business is of utmost importance. Be a strategic partner (to the management). Start with asking why certain choices are made and explain the why a little more often than just the how, what and when!

Describe why and how information and data is collected organized and distributed (in a fail safe and secure method) and what information systems are used. Describe the applications (and their ROI, services, processes and busses), how the information is presented and flows back in the business (via the people or automated systems). How does your solution let the business grow and flourish? Keep clear of too much technical detail – present your story in a way the manager understands the added value, and knows which team members (future users) to delegate to project meetings.

Next up IT, or ICT here in the Netherlands, Information and Communication Technology. I really like the Communication part for this post, businesses must do that a little more often. Start looking at the business from different points of view, and make sure you understand the functional parts and what is required to operate. To prevent people working on their own without a common goal or reason, internal communication is essential. Know the in and outs, describe why and how the desired result is achieved. Connect the different business layers. For this a great part of business IT departments needs to refocus it’s 1984 vision to the now and future. IT is not about infrastructure alone, it is a working part within the business, a facilitator, a placeholder (for lack of other words in my current vocabulary). IT needs to be about aligning the business services with applications and data, the tools and services that support and provides the business. That is why IT is there in the first place, not the business that is (connected or not) there for IT. IT’s business. Start listening, start writing first on a global level (what does the business mean by working from everywhere everyplace), then map possibilities to logical components (think from the information, why is it there, where does it come from and where does it go, and think for the apps, the users) and then when you have defined the logical components, you can add the physical components (insert the providers, vendors, hardware building blocks).

Sounds familiar? There are frameworks out there to use. Use your Google-Fu: Enterprise Architecture. Is this for enterprise size organizations only? No, any size company must know the why and why and why. And do something about it. And a simplified version will work for SMB size companies. Below is an example of a simplified model and what layers of attention this architectural framework brings to your organization.

Design for Failure

And…in addition to this, start using the following as a basis to include in your designs:

The best way to avoid failure is to fail constantly

Not my own, but from Netflix. This cannot be closer than the truth. No test or disaster recovery plan testing in iterations of half year or year. Do it constantly and see if your environment and business is up to the task to not influence any applications that will go down. Sure, there will be influences that for example the services running at 100% warp speed, but your users still able to do things with the services is better than nothing at all. And knowing that your service operates with a failure is the important part here. Now you can do something about not reaching the full speed, for example scale out to allow a service failure but not at a degraded service speed. Or know which of your services can actually go down without influencing business services for a certain time-frame. This is valuable feedback that will need to go back to the business. Is going down sufficient for the business, or should we try and handle this part so it does not go down at all. Just don’t use it at the infrastructure level only, include the data, application and information layers as well.
Big words here: trust and commitment. Trust the environment in place and test if it succeeds to provide the services needed even when hell freezes over (or when some other unexpected thing should happen). Trust that your environment can handle failure. Trust that the people can do something with or about the failures.
Commitment of the organization not to abandon when reaching a brick wall over and over, but to keep going until you are all satisfied. And trust that your people can fail also. Let them be familiar with the procedures and let a broader range of people handle the procedure (not just the current users names mapped to the processes, but within defined and mapped roles to services, multiple people can operate and analyze the information). Just like technical testing, your people are not operating 24x7x365, they like to go on leave and sometimes they tend to get ill.

Back to Netflix. For their failure generating Netflix uses Chaos Monkey. With that name an other Monkey comes to mind, Monkey Lives: http://www.folklore.org/StoryView.py?project=Macintosh&story=Monkey_Lives.txt. Not sure where the idea came from, but such a service and name cannot be a coincidence only (if you believe coincidence exists in the first place). But that is not what this paragraph is about.
The Chaos Monkey’s job is to automatically and randomly kill instances and services within the Netflix Infrastructure architecture. When working with Chaos Monkey you will quickly learn that everything happens for a reason. And you will have to do something about it. Pretty Awesome. And the engineers even shared Chaos Monkey on Github:
https://github.com/Netflix/SimianArmy/wiki/Chaos-MonkeyIt must not stop at the battle plan of randomly killing services; fill up the environment with random events where services will get into some not okay state (unlike a dead service) and see how the environment reacts to this.

 

Exchange DAG Rebuilding steps and a little DAG architecture

At a customers site I was called in to do a Exchange health check and some troubleshooting. As I have not previously added Exchange content to this blog, I thought on doing a note experience and new blog post in once.

Situation
The environment is a two site data center where site A is active/primary and site B is passive/secondary. Therefor a two node Exchange is deployed on Hyper-V. The node in site A is CAS/HT/MBX en the node in site B is CAS/HT/MBX. The mailbox role is DAG’ged, where active is site A and database copies are on site B. There is no CAS array (Microsoft best practice is to set it even if you have just one, but this wasn’t the case here). This is not ideal as a fail-over in CAS doesn’t allow clients to auto connect to another CAS, Exchange uses the CAS Array (with load balancer) for this. The CAS fail-over is manual (as is the HT). But when documented well and small amount of downtime is acceptable for the organisation, this is no big issue.
Site A has a Hyper-V cluster where the exchange node A is a guest hosted on this cluster. Site B has a unclustered Hyper-V host where Exchange node B is a guest. Exchange node A is marked high available. This again is not ideal, yes maybe for the CAS/HT role it can be used (should then be separated from the mailbox role), but for the mailbox role this is application layer clustered already (the DAG) so preferably off. Anyhow these are some of the pointers I could discuss with the organisation. But there is a problem at hand that needs to be solved.

The Issue at hand (and some Exchange architecture)
The issue is that the secondary node is in a failed state and currently not seeding it’s database copies. Furthermore the host is complaining about the witness share. You can check the DAG health with PowerShell Get-MailboxDatabaseCopyStatus and Test-ReplicationHealth.

You can check the DAG settings and members with Get-DatabaseAvailabilityGroup -Identity DAGNAME -Status | fl. Here you can see the setup file witness server.

RunspaceId : 0ffe8535-f78a-4cc1-85fd-ae27934a98e0
Name : DAGNAME
Servers : {Node A, Node B}
WitnessServer : Servername
WitnessDirectory : Directory name on WitnessServer
AlternateWitnessServer : Second Servername
AlternateWitnessDirectory : Directory name on Second Witness Server

(AlternateWitness server is only used with Datacenter Activition (DAC) Mode DAGOnly, here it is off and therefore not used and not needed)

Okay witness share, some Exchange DAG architecture first. Exchange DAG is a Exchange database mirror service build on Fail over cluster service (Microsoft calls it hidden cluster service). You can mirror the databases in a active/passive solution (one node is active to other is only hosting replica’s), or in an active/active solution (both nodes have active and passive databases). In both solutions that is high availability and room for maintenance (in theory that is). The mirror service is done by replicating the databases as database copies between members of the dag. The DAG uses Fail over clustering services where the DAG members participate as cluster nodes. A cluster uses a quorum to tell the cluster which server(s) should be active at any given time (a majority of votes). In case of a failure in heartbeating networking there is a possibility of split brain, that both nodes are active and try to bring up the cluster resources as they are designed to do. Both nodes can serve active databases with the possibility of data mismatch, corruption or other failures. In this case a quorum is used to find out which node has more votes to be active. A shared disk is often used for the cluster quorum. An other option is to use a file share on a server outside the cluster, the so called file witness quorum or file witness share in Exchange.

image

The above model shows the CAS and DAG HA components. With Exchange architecture best practice the File Witness share is to be placed on the HT role, but in the case of mixed roles you should select a server outside the DAG and in this case outside the Exchange organisation. Any file server can be used, preferably a server in the same datacenter as the primary site serving users (important).

So back to the issue. File witness share (FWS) access. I checked if I could see the file share (\servernameDAGFQDN) from the server and checked permissions (Exchange Trusted Subsystem and the DAG$ computer object should full control). The Exchange trusted subsystem must be a Adminstrators local group member. The FWS is placed on a domain controller in this organization. Not ideal again (Exchange server now need domain level administrators group membership as domain controllers don’t have local groups), but working.

I checked the failover service and there the node is in a down state, including it’s networks. But in the Windows guest networks are up and traffic is flowing from and to the both nodes and the FWS. No firewall on or between the nodes, no natting. Okay……Some other items (well a lot) where checked as the where several actions done in the environment. Also checked Hyper-V settings and networking. Nothing blocking found (again some pointers for future actions).

Well, try to remove and add the failed state node to the DAG. This should have no impact on the organization and the state is already failed.

Removing a node from the DAG.

Steps to follow:
1. Depending on the state, suspend database seeding. When failed, suspend via Suspend-MailboxDatabaseCopy -Identity <mailbox database><nodename>. When status is failed and suspended this is not needed.
2. Remove Database copies of mailbox databases on the failed node. Use  Remove-MailboxDatabaseCopy -Identity <mailbox database><nodename>. Repeat when needed for the other copies.
3. Remove Server from DAG.  Remove-DatabaseAvailabilityGroupServer -Identity <DAGName> -MailboxServer <ServerName> -ConfigurationOnly
4. Evict from cluster.
As the cluster is now only one node, the quorum is moved to node majority automatically. The FWS object is removed from the config.

Rebuilding the DAG by adding the removed node back

Steps to follow:

1. Add server to DAG. This will add the node back to the cluster.  Add-DatabaseAvailabilityGroupServer -Identity <DAGName> -MailboxServer <ServerName>. Succes the node is healthy.
2. Add the database copies as preference 2 (the other node is still active). Add-MailboxDatabaseCopy –Identity <Mailbox Database> -MailboxServer <ServerName> -ActivationPreference 2.
3. In my case to time between fail state and returning to the DAG was a bit long. The database came up, but returned to failed state. We have to suspend and manually seed. Suspend-MailboxDatabaseCopy -Identity <mailbox database><nodename>.
4. Update-MailboxDatabaseCopy -Identity “<Mailbox Database><Mailbox Server>” -DeleteExistingFiles. Wait for the bytes are transferred across the line. When finished the suspended state is automaticaly lifted.
Repeat for the other databases.
5. You will now see a good state of the DAG and databases in Exchange Management console. Not yet. The file witness share is not yet back.
6. Add the Witness share from Exchange powershell. Set-DatabaseAvailabilityGroup -WitnessDirectory “<Server Directory” -WitnessServer “<Servername>” -id “<DAG Name>”. When the DAG members are minimal two the FWS is recreated. This is also visible in Failover Cluster.

Root Cause Analyse

Okay this didn’t went so smooth as described above. When trying to add the cluster node back to the cluster this fails with the FWS error again. In cluster node command output it is noticed that on Node A Node B is down. And on Node B Node A is down and Node B is Joining. Hey wait there is a split and the Joining indicates that Node B is trying to bring up it’s own cluster. Good that it is failing. When removing the node from the DAG Kaspersky Virus protection is loosing connection as this is configured to the DAG databases. At the same time Node A has the same errors and something new, RPC Server errors with Kaspersky. Ahhhh Node A networking services not correctly working is the culprit here. The organisations admins could not tell if networking updates and changes had a maintenance restart/reboot. So there probably something is still in memory. So inform the users, check the backup and reboot the active node. The node came up good and low and behold node B could be added to the fail over cluster. At this time I could rebuild the DAG. Health checks are okay, and documented.

– Hope this helps when you have similar actions to perform.

 

Dissecting vSphere – Data protection

An important part of a business continuity and disaster recovery plans are the ways to protect your organisation data. A way to do this is to have a back-up and recovery solution in place. This solution should be able to get your organization back in to production with the set RPO/RTO’s. The solution needs to be able to test your back-ups, preferable in a sandboxed testing environment. I have seen situations at organisations where backup software was reporting green lights on the backup operation, but when a crisis came up they couldn’t get the data out and thus failing recovery. Panicking people all over the place….

Back-up and recovery solution can be (a mix of) commercial products to protect the virtual environment like Veeam or from within guest with agents like Veritas or DPM or from features of the OS (return to previous version with snapshots). Other ways included solutions on the storage infrastructuur. But what if your budget constrained….

Well VMware has the vSphere Data Protection that is included from the Essentials Enterpise Plus kit. This is the standard edition. The vSphere Data Protection Advanced edition is available from the enterprise license.
So there are two flavours, what is standard giving and lacking from advanced?
First the what; like previous stated VDP is the backup and recovery solution from VMware. It is a appliance that is fully integrated with vCenter. It’s easy to be deployed. It performs full virtual machine and File-LevelRestore (FLR) without installing an agent in every virtual machine.It uses data deduplication for all backup jobs, reducing disk space consumption.

image

VDP standard is capped with a 2TB backup data store, where VDP advanced allows dynamic capacity growth. This allows a growth of capacity to 4TB, 6TB or 8TB backup stores. VDP advance also provides agents for specific applications. Agents for SQL Server and Exchange agents can be installed in the VM guest os. These agents provides selecting individual databases or stores for backup or restore actions, application quiescing and advanced options like truncating transaction logs.

image

At VMworld 2013 further capabilities of VDP 5.5 are introduced:

– Replication of backup data to EMC.
– Direct-to-Host Emergency Restore. (without the need for vCenter, so perfect for backing up your vCenter)
– Backup and restore of individual VDMK files.
– Specific schedules for multiple jobs.
– VDP storage management improvements. Selecting separate backup data stores.

Sizing and configuration

The appliance is configured with 4vCPU’s and 4GB RAM. For the available backup stores storage capacity 500GB, 1TB or 2TB they will consume respectivily 850GB, 1,3 TB and 3,1TB of actual storage. There is a 100 VM limit, so after that you would need another VDP appliance (maximum of 10 VDP appliances per vCenter).

After the appliance deployment the appliance need to be configured at the VDP web service. The first time it is in installation mode. Items such as IP, hostname, DNS (if you haven’t added these with the OVF deployment), time and vCenter need to be configured. After completion (and sucessful testing) the appliance needs to be rebooted. A heads up, the initial configuration reboot can take up to 30 minutes to complete so have your coffee machine nearby.

After this you can use the webclient connected to your VDP connected vCenter to create jobs. Let the created jobs run controlled for the first time; the first backup of a virtual machine takes time as all of the data for that virtual machine is being backed up. Subsequent backups of the same virtual machine take less time, here changed block tracking (CBT) and dedup is preformed.

Performance

Well this depends on the kind of storage you are going to use as the backup data store. If you going for low cost storage (let say most of the SMB would want that), your paying in performance (or lacking it most of the time).

Storage Offsite

Most organizations want their backup data stored offsite in some way. vDP does not offer replication (or with VDP5.5 to only EMC), so you want to have some offsite replication or synchronization in place (and a how are you able to restore from this data if your VDP is lost also). vSphere Replication only protects VM’s and not your backup data store. Most SMB’s don’t have a lot of storage able replication devices in place, and when they do, there using it for production and not use that as a backup datastore. Keep this in mind when researching this product for your environment.

– Enjoy data protecting!

Learned Lessons – Nexus 1000V and the manual VEM installation

At an implementation project we implemented 24 ESXi hosts and used the Nexus 1000V to have a consistent network configuration, feature set and provisioning throughout the physical and the virtual infrastructure. When I tried to add the last host to the Nexus it failed on me with the InstallerApp (Cisco provided java installer that adds the VEM and adds the ESXi host to the configured DVS and groups). Another option is to use Update Manager (the Nexus is a appliance that can be updated by update manager), but that one threw an error code at me. I will describe the symptoms a little bit later, first some quick Nexus product architecture so you will have a bit understanding how the components work, where they are and how they interact.

Nexus 1000V Switch Product Architecture

Cisco Nexus 1000V Series Switches have two major components: the Virtual Ethernet Module (VEM), which runs inside the hypervisor (or with other words on the ESXi host), and the external Virtual Supervisor Module (VSM), which manages the VEMs (see figure below). The VSM can be either a virtual appliance or an appliance in the hardware device (for example in the physical Nexus switch).
The Nexus 1000V replaces the VMware virtual switches and adds a Nexus Distributed Switch (Enterprise Plus). Uplink and portgroup definitions are bound to the Cisco ethernet profile configurations. The VEM and VSM use a control and data link to exchange configuration items.

Configuration is performed through the VSM and is automatically propagated to the VEMs. Virtualization admins can pick up these configuration to select the portgroups at VM provisioning.

image

Symptoms to a failing installation 

The problem occurred as follows:

– Tried to install the VEM with the InstallerApp. The installer app finds the host and when the deployment is done, it stops when adding the hosts to the existing Nexus DVS. This happens somewhere from moving existing vSwitches to the DVS. Error presented is: got (vim.fault.PlatformConfigFault) exception.

– Checked the status of the host in update manager and this showed a green compliant Cisco Nexus Appliance. This probably delayed me a bit, because it really wasn’t.

– Tried to manually add a host to the Nexus DVS in the vSphere Webclient. This gave an error in the task. Further investigation let me to the line in the vmkernel.log: invalid Net_Create: class cisco_nexus_1000v not supported. Say what?

– With the Cisco support site and a network engineer I tried some VEM cli on the host. But hey wait vem isn’t there. An esxcli software vib list | grep cisco doesn’t show anything either (duh). While on an VEM installed ESXi host this shows the installed VEM software version. So Update Manager is screwing with me.

Manual Installation

That leaves me with trying to manually install the VEM. With the Cisco Nexus 1000 installation guides the working sollution is as follows:

  • The preferred Update Managers does not work. It fails with an error 99.
  • copy the vib file containing the VEM software from the VSM homepage using the following url: http://Your_VSM_IP_Address/. Check an ESXi host that is installed for the running version (esxcli software vib list | grep cisco). Download this file (save as).
  • Upload this file to a location where the host can access this. On the host or a datastore accessible from the host. I did the latter as the host did not have direct storage. Used WinSCP to transfer the files to a datastore directory ManualCiscoNexus.
  • On the hosts I added this vib by issuing the following command:

esxcli software vib install -v /vmfs/volumes/<datastore>/ManualCiscoNexus/Cisco_bootbank_cisco-vem-v160-esx_4.2.1.2.2.1.0-3.1.1.vib

  • vem status -v now gives output. Look for VEM Agent is running in the output of the vem status command.
  • vemcmd show port vlans only shows the standard switches. Communication with the VSM is not yet there.
  • I added the host manually to the Nexus DVS and success. When migrating the standard vmkernel management port to the DVS groups the hosts is also visible on the VSM. Communication is flowing and the host is part of the Nexus 1000v.

I hope this post will help when you experience the same problem, and also learns you a little about the Nexus 1000V Product Architecture.