Digital Workspace Transformation: information security

Yes…. it has been a while since I posted on this blog, but I’m still alive ;-)

For a 2016 starter (what?!? is it June already), I want to ramble on about information security in the digital workspace. With a growing number of digital workspace transformations going on, information security is more important than ever. With the growing variety of client endpoints and methodes of access in the personal and corporate environments, users are becoming increasingly independent from the physical company locations. Making it interesting how to centrally manage storage of data, passwords, access policies, application settings and network access (just examples, not the complete list). For any place, any device, any information and any application environments for your users (or do we want any user in there), it is not just a couple of clicks of this super-duper secure solution and were done.

encrypt-image300x225
(image source blogs.vmware.com)

Storing data on for example Virtual Desktop servers (hello VMware Horizon!) in the data center is (hopefully) a bit more secure than storing it locally on the user’s endpoint. At the same time, allowing users to access virtual desktops remotely puts your network at a higher risk then local only. But it’s not all virtual desktops. We have mobile users who will like to have the presentations or the applications directly on the tablet or handheld. I for instance, don’t want to have to open a whole virtual desktop for just one application. You ever tried a virtual desktop on a iPhone, it is technical possible yes, but works crappy. Erm forgot my Macbook HDMI USB-C converter for this presentation, well I send it to your gmail or dropbox for access with the native mobile apps at your conference room. And the information is gone out of the company sphere…..(a hypothetical situation of course..)

Data Leak

Great ideas all those ways to be in and out of company information. But but but….. these also pose some challenges to which a lot of companies have not started thinking about. Sounds a bit foolish as it is probably the biggest asset of a company, information. But unfortunately it’s a fact (or maybe it could be just the companies I visit). Sure these companies have IT departments or IT vendors who think a bit about security. And in effect mostly make their users life’s miserable with all sort of technical barriers installed in the infrastructure. In which the users, business and IT (!) users, will find all sorts of ways to pass these installed barriers. Why? First of all to increase their productivity while effectively decreasing security, and secondly they are not informed about the important why. And then those barriers can be just a nuisance.

Break down the wall

IT’s Business

I have covered this earlier in my post (https://pascalswereld.nl/2015/03/31/design-for-failure-but-what-about-the-failure-in-designs-in-the-big-bad-world). The business needs to have full knowledge of their required processes and information flows, that support or process in and out information for the services supporting the business strategy. And the persons that are part of the business and operate the services. And what to do with this information in what different ways, is it allowed for certain users to access the information outside of the data center and such. Compliancy to for example certain local privacy laws. Governance with policies and choices, and risk management do we do this part or not, how do we mitigate some risk if we take approach y, and what are the consequences if we do (or don’t).

Commitment from the business and people in the business is of utmost importance for information security. Start explaining, start educating and start listening.
If scratch is the starting point, start the writing first on a global level. What does the business mean by working from everywhere everyplace, what is this digital workspace and such.  What are the risks, how do we approach IAM, what do we have for data loss protection (DLP), is it allowed for IT to inspect SSL traffic (decrypt, inspect and encrypt) etc. etc.
Not to detailed at first it is not necessary, as it can take a long time to have a version 1.0. We can work on it. And to be fair information security and digital workspace for a fact, is continue evolving and moving. A continual improvement of these processes must be in place. Be sure to check with legal if there are no loops in what has been written in the first iteration.
Then map to logical components (think from the information, why is it there, where does it come from and where does it go, and think for the apps, the users) and then when you have defined the logical components. IT can then add the physical components (insert the providers, vendors, building blocks). Evaluate together, what works, what doesn’t, what’s needed and what is not. And rave and repeat…..

Furthermore, a target for a 100% safe environment all the time will just not cut it. Mission Impossible. Think about and define how to react to information leaks and minimize the surface of a compromise.

Design Considerations

With the above we should have a good starting point for the business requirement phase of a design and deploy of the digital workspace. And there will also be information from IT flowing back to the business for continual improvement.

Within the design of an EUC environment we have several software components were we can take actions to increase (or decrease, but I will leave that part out ;-)) security in the layers of the digital workspace environment. And yes, when software defined is not a option there is always hardware…
And from the previous phase we have some idea what choices can be made in technical ways to conform to the business strategy and policies.

If we think of the VMware portfolio and the technical software layers were we need to think about security, we can go from AirWatch/Workspace ONE, Access Point, Identity Manager, Security Server, Horizon, AppVolumes to User Environment Management. And And….Two-Factor, One Time Password (OTP), Microsoft Security Compliance Manager (SCM) for Windows based components, anti-virus and anti-malware, networking segmentation and access policies with SDDC NSX for Horizon. And what about Business Continuity and disaster recovery plans, and SRM, vDP.
Enterprise Management with vROPS and Log Insight integration to for example SIEM. vRealize for automating and orchestrating to mitigate work arounds or faults in manual steps. And so on and so on. We have all sorts of layers where to implement or help with implementing security and access policies. And how will all these interact? A lot to think about. (It could be that a new blog post series subject is born…)

But the justification should start at the business… Start explaining and start acting! This is probably 80% of the success rate of implementing information security. And the technical components can be made fit, but… after the strategy, policies, information architecture are somewhat clear….

And the people in the business are supporting the need for information security in the workspace. (Am I repeating myself a bit ;-)

Ideas, suggestions, conversation, opinions. Love to hear them.

NTP and the hey didn’t I set it up woes

Once in a while I come across a production network where there are some unexplained issues in the environment. Some (or a lot depending on the environment) can be because of time synchronization woes. Mostly because of undocumented or partly unconfigured time services.

Why is this an issue?

Time is inherently important to the function of routers, networks, servers, clusters, applications, storage or name it. Without synchronized time, accurately information between hosts becomes difficult, if not impossible. When it comes to analyzing and security, if you cannot successfully compare logs between each of devices, you will find it very hard to develop a reliable picture of an incident. Some or part of application clustering will fail. Application services are waiting on data packets that will not be processed. Locks are not removed in time. If an authentication time stamp coming from the client differs with more than default 5 minutes from a Domain Controllers time, it will discard the packet as fake (or think what your two factor will do). Storage controllers providing CIFS access need the same access to the directory as the previous client example. Performance data cannot be accurately interpreted if the time stamps on the data are not synchronized between the managed and management components. When wanting to present logs as proof, even if you are able to put the pieces together, not synchronized times, may give an attacker with a good attorney enough wiggle room to escape prosecution. These are just some of the things that can be affected with a mis-configured or not in place time synchronization.

Often there isn’t anything in the standard installation which gives the engineer a chance to setup NTP. Proper NTP usage only occurs when the engineer is knowledgeable of NTP and its administration, and has as standard practice of configuring NTP as one of the post-installation tasks. This is often not the case or is partly done and documented.

And if you don’t have a blue box around, Wibbly Wobbly Timey Wimey stuff is not what you want!

But what is in a NTP architecture?

NTP is designed to synchronize the time on a network of hosts. NTP runs over the User Datagram Protocol (UDP), using port 123 as both the source and destination. 

A NTP network usually gets its time from an authoritative time source, such as a radio clock or an atomic clock attached to a time server (a stratum 1) or one or more trusted time servers in the hierarchy. NTP then distributes this time across the network. Depending on your network and available time sources several options of your NTP infrastructure can be done. But often you will see a hierarchical structure with external time source on the Internet (pool.org for example). Core infrastructure components (core router, firewall or Active Directory Domain Controller) have a client-server relationship with external time sources (and are allowed through the firewall), the internal NTP services have a client-server relationship with the internal servers, devices et al, the internal workstations have there time services in a client-server relationship with the time synchronized server (via Windows services in this example), and so on down the tree. A hierarchical structure is the preferred technique because it provides consistency, stability, and scalability.

image

Relations and configurations

In the virtual world there will be varying resources. On busy systems, resources may be denied for short periods of time or during high workloads. Some may receive higher resources. This will result in something referred to as time drifting, the clock ticks will sometimes run in a faster or slower pace creating an offset. This really show to need for time synchronization.

NTP prefers to have access to several sources of time (preferable three) since it can then apply an agreement algorithm. Normally, when all servers are in agreement, NTP chooses the best server in terms of lowest stratum, closest (in terms of network delay), and claimed precision.

For external NTP sources a path must exist from the trusted device to the external sources, firewall rule must allow this traffic. When using fully qualified domain names in the configuration (not always possible but preferable) the NTP client relies on the DNS client to resolve. DNS must be set-up correctly.

With a Windows Domain your Windows Members are automatically configured to use Windows Time Service to their domain controllers (that is with the NT5DS setting). The domain controller (or the PDC emulator to be precise) needs to be manually changed to type NTP and setup with a peer list.

In other devices it is a configuration tab you can set from the web interface, other have CLI in a Linux or other OS. Linux can be setup to use the NTP daemon.

For VMware VM’s use the guest operating system time synchronization such as Windows w32tm or NTP, and not use VMware Tools time synchronization.

Be ware that all your devices, hosts, guests, database servers, application, services et al are working together and need to have the same reliable source of time. As a rule you will have to set up your time service, this is not done for you.

Set it up correctly and document for reference.

– Happy timey wimey!

vSphere Performance monitoring tools standard available

I am currently working on a project where we are optimizing the virtual infrastructure which consist of vSphere and XenServer hypervisors. In the project we want to measure and confirm some of the performance related counters. We got several standard tools at the infrastructure components to see what the environment is capable of and check if there are some bottlenecks regarding IO flow and processing. 

With any of the analyzing it is important to plan (or know) what to measure on what layer so this is repeatable when wanting to check what certain changes can do to your environment. This check can also be done from some of the tools available, such as earlier written in the blog post about VMware View planner (to be checked at this url https://pascalswereld.nl/post/66369941380/vmware-view-planner) or is a repeat of your plan (which then can be automated/orchestrated). Your measuring tools need to have similar counters/metric throughout the chain, or at least show what your putting/requesting from a start and at the end (but if there is a offset you got little grey spots in the chain).
A correct working time service (NTP) is next to correct working of for example clustering and logging, also necessary for monitoring. To get to right values at the right intervals. Slightly off will in some cases give you negative or values that are off at some components.

Some basics about measuring

You will have to know what the measuring metrics are at a point. Some are integers, some are floating, some are averages over periods or amounts used, some need a algorithm to calculate to human or a similar metric (Kb at one level and bytes on the other, some of them are not that easy). A value that is high in first view but consists of several components and is an average of a certain period, could be normal when devided by the amounts of worlds.

Next up know or decide on your period and data collection intervals. If you are measuring every second you probably get a lot of information and are a busy man (or woman) trying to analyze trough all the data. Measuring in December gives a less representative workload then measuring in a company’s peak February period (and for Santa it is the other way around ;-)). And measure the complete proces cycle, try to get a 4 weeks/month period to get the month opening and closing processes in there (well depending on the workload of course).

Most important is that you know what your workloads are, what the needs for IO is and what your facilitating networking and storage components are capable off. If you don’t know what your VD image is build of for a certain group of users and what is required for these, how will you know if a VD from this groups requesting 45 IOPS is good or bad. At the other hand if you put all your management, infrastructure and VD’s on the same storage how are you going separate the cumulative counters from the specific workload.

Hey you said something about vSphere in the title, let’s see what is standard available for the vSphere level.

VM monitoring. In guest Windows Perfmon counters or Linux guest statistics. The last is highly depending on what you put in your distribution, but think of top, htop, atop, vmstat, mpstat et al.
Windows Perfmon counters are supplemented with some VM insights with VMware tools. There are a lot of counters available, so know what you want to measure. And use the data collection sets to group them and have them for reference/repeatable sets (scheduling of the data collection). 

– Host level; esxtop or vscsistats. Esxtop is great tool for performance analysis of all types. Duncan Epping has an excellent post about esxtop metrics and usage, you can find it here http://www.yellow-bricks.com/esxtop// Esxtop can be used in interactive or batch mode. With the batch mode you can load you output file in Windows Perf mon or in esxplot (http://labs.vmware.com/flings/esxplot). Use VisualESXtop (http://labs.vmware.com/flings/visualesxtop) for enhancements to the esxtop commandline and a nice GUI. On the VMA you can use resxtop to remotely get the esxtop stats. vscsistats is used when wanting to get scsi collections or get storage information that esxtop is not capable of showing. And ofcourse PowerCLI can be an enormous help here.

vCenter level; Statistics collection which depends on your statistics level. Graphs can be shown on several components in the vSphere Web Client, can be read via the vSphere API or again use PowerCLI to extract the wanted counters. To get an overview of metrics at the levels please read this document http://pubs.vmware.com/vsphere-55/topic/com.vmware.ICbase/PDF/vsphere-esxi-vcenter-server-55-monitoring-performance-guide.pdf or check documentation center for your version.

– vCenter™ Operations Management Suite (vCOPS). Well standard, you still have to option to not include operations in your environment. But your missing out on some of the automated (interactive/proactive) performance monitoring, reporting and insight in your environment options. Root cause analysis is part of the suite, and not down to your own understanding and analytic skills. If you are working on the previous levels your life could have been simpler with vCOPS suite.

Next up

These standard tools need to be supplemented with specific application, networking (hops and other passed components) and storage (what are the storage processors up to is there latency build up in the device it self) counters.

– Happy measuring!

Managing multi-hypervisor environments, what is out there?

A little part of the virtualization world I visit are in the phase of doing multi-hypervisor environments. But I expect more and more organizations to be not one type only and are open to using a second line of hypervisors other then their current install base. Some will choose on specific features or on product lines for specific workloads or changing strategies to opensource for example.

Some providers of hypervisors are having or bringing multi support to their productlines. VMware NSX brings support for multi-hypervisor network environments via the Open vSwitch support in NSX (with a separate product choice that is), where XenServer leverages the Open vSwitch as an standard virtual switch option. Appliances are standard delivered in the OVF format. Several suites are out there that claim a single management for multi-hypervisors.

But how easily is this multi-hypervisor environment managed and for what perspective? Is there support in only a specific management plane? Is multi-hypervisor bound to multi-management products and thus adding extra complexity? Let’s try and find out what is currently available for the multi-hypervisor world.

What do we have?

Networking, Open vSwitch; a multi-layer virtual switch which is licensed under the open source Apache 2.0 license. Open vSwitch is designed to enable network automation through programmatic extension, and still supporting standard management protocols (e.g. NetFlow, sFlow, SPAN, RSPAN, CLI, LACP, 802.1ag). Furthermore it is designed to support distribution across multiple physical servers similar to VMware’s distributed vswitch concept. It is distributed standard in many Linux kernel’s and available for KVM, XenServer (default option), VirtualBox, OpenStack and VMware NSX for multi-hypervisor infrastructures. Hyper-V can use the Open vSwitch, but needs a third party extension (for example using OpenStack extension). Specifically for networking, but it is a real start for supporting true multi-hypervisors.

Transportation, Open format OVF/OVA; Possibly the oldest of the open standards in the virtual world. Open Virtualization Format (OVF) is an open standard for packaging and distributing virtual appliances or more generally software to be run in virtual machines. Used for offline transportation of VM’s. Wildly used for transporting appliances of all sorts. Supported by muiltiple hypervisor parties, but sometimes conversion are  needed especially for the disk types. OVF’s with a VHD disk needs to be converted to VMDK to be used on VMware (and vice versa). Supported by XenServer, VMware, Virtualbox and such. OVF is also supported for Hyper-V, but not in all versions of System Center Virtual Machine Manager support importing/exporting functionality. OVF allows a virtual appliance vendor to add items like a EULA, comments about the virtual machine, boot parameters, minimum requirements and a host of other features to the package. Specifically for offline transportation.

VMware vCenter Multi-Hypervisor Manager; Feature of vCenter to manage other hypervisors next to ESXi hosts from the vCenter management plane. Started as a VMware Lab fling, but now a VMware supported product (only support for the product, underlying Hyper-V issues are still for the Microsoft corporation) available as a free download with a standard license. Currently at version 1.1. Management of host and provisioning actions to third party hypervisors. Supported other then VMware hypervisors is limited to Hyper-V. And to be honest not primarily marketed as a management but more a conversion tool to vSphere.

vCloud Automation Center (vCAC);  vCloud Automation Center focuses on managing multiple infrastructure pools at the cloud level. You can define other then vSphere endpoints and collect information or add these computing resources to an enterprise group. For certain tasks (like destroying a VM) there still is manual discovery necessary for these endpoints to be updated accordantly. But you can leverage vCAC workflow capabilities to get over this. Uses vCAC agents to support vSphere, XenServer, Hyper-V or KVM hypervisors resource provisioning. Hypervisor management is limited to vSphere and Hyper-V (via SCVMM) only. vCAC does offer integration of different management applications for example server management (iLO, Drac, Blades, UCS), powerShell, VDI connection brokers (Citrix/VMware), provisioning (WinPE, PVS, SCCM, kickstart) and cloud platforms from VMware and Amazon (AWS) to one management tool. And thus providing a single interface for delivery of infrastructure pools. Support and management is limited as the product is focussed on workflows and automation for provisioning, and not management per se. But interested to see what the future holds for this product. Not primarily for organisations that are managing their own infrastructures and servicing only their own. Specifically for automated delivery of multi-tenant infrastructure pools but limited.

System Center Virtual Machine Manager (SCVMM); A management tool with the ability to manage VMware vSphere and Citrix XenServer hosts in addition to those running Hyper-V. But just as the product title says, it is primarily the management of your virtual machines. As SC VMM can be able to read and understand configurations, and do VM migrations leveraging vMotion. But need to do management tasks on networking, datastores, resource pools, VM templates (SCVMM only imports metadata to it’s library), host profile compliancy (and more) or fully use distributed cluster features you will need to switch to or rely on vCenter to do this tasks. Some actions can be done by extending SCVMM with a vCenter system, but that is again limited to managing VM tasks. Interesting that there is support to more then one other hypervisor with vSphere and XenServer support. And leveraging the system center suite gives you a data center broader management suite, but that is out of scope for this subject. Specifically for virtual machine management, and with another attempt to get you to convert to the primary hypervisor (in this case Hyper-V).

Other options?; Yes, automation! Not a single management solution but more of a close the gap between management tasks and support of management suites. Use automation and orchestration tools together with scripting extension to solve these management task gaps. Yes, you still have to have multiple management tools, but you can automate repetitive tasks (if you can repeat it, automate it) between them. PowerShell/CLI for example is a great way to script tasks in your vSphere, Hyper-V and XenServer environments. Use a interface like Webcommander (read at a previous blog post https://pascalswereld.nl/post/65524940391/webcommander) to present a single management interface to your users. But yes, here some work and effort is expected to solve the complexity issue.

– Third parties?; Are there any out there? Yes. They are providing ways to manage multi-hypervisor environment as add-ons/extensions that use already in place management. For example HOTLINK Supervisor adds management of Hyper-V, XenServer and KVM hosts from a single vCenter inventory. And Hotlink hybrid express adds Amazon cloud support to SCVMM or vCenter. Big advantage is that Hotlink is using the tools in place and integrate to those tools so there is just a minimal learning curve to worry about. But why choose a third party when the hypervisor vendors are moving there products to the same open scope, will an addon add extra troubleshooting complexity, how is support when using multiple products from multiple vendors where does one ends and the other starts? Well that’s up to you if these are pro’s or cons. And the maturity of the product of course.

Conclusion

With the growing number of organisations adopting a multi-hypervisor environment, these organisation still rely on multiple management interfaces/applications and thus bringing extra complexity to management of the virtual environments. Complexity adds extra time and extra costs, and that isn’t what the large portion of the organisations want. At this time, simply don’t expect a true single management experience if you bring in different hypervisors or be prepared to close the gaps yourself (the community can be of great help here) or use third party products like Hotlink.
We are getting closer with the adoption of open standards, hybrid clouds and a growing support of multiple hypervisors in the management suites of the hypervisor players. But a step at a time. Let’s see when we are there, at the true single management of multi-hypervisor environments.

Interested about telling your opinion, have a idea or party I missed? Leave a comment. I’m always interested in the view of the community.

– Happy (or happily) managing your environment!

Exchange DAG Rebuilding steps and a little DAG architecture

At a customers site I was called in to do a Exchange health check and some troubleshooting. As I have not previously added Exchange content to this blog, I thought on doing a note experience and new blog post in once.

Situation
The environment is a two site data center where site A is active/primary and site B is passive/secondary. Therefor a two node Exchange is deployed on Hyper-V. The node in site A is CAS/HT/MBX en the node in site B is CAS/HT/MBX. The mailbox role is DAG’ged, where active is site A and database copies are on site B. There is no CAS array (Microsoft best practice is to set it even if you have just one, but this wasn’t the case here). This is not ideal as a fail-over in CAS doesn’t allow clients to auto connect to another CAS, Exchange uses the CAS Array (with load balancer) for this. The CAS fail-over is manual (as is the HT). But when documented well and small amount of downtime is acceptable for the organisation, this is no big issue.
Site A has a Hyper-V cluster where the exchange node A is a guest hosted on this cluster. Site B has a unclustered Hyper-V host where Exchange node B is a guest. Exchange node A is marked high available. This again is not ideal, yes maybe for the CAS/HT role it can be used (should then be separated from the mailbox role), but for the mailbox role this is application layer clustered already (the DAG) so preferably off. Anyhow these are some of the pointers I could discuss with the organisation. But there is a problem at hand that needs to be solved.

The Issue at hand (and some Exchange architecture)
The issue is that the secondary node is in a failed state and currently not seeding it’s database copies. Furthermore the host is complaining about the witness share. You can check the DAG health with PowerShell Get-MailboxDatabaseCopyStatus and Test-ReplicationHealth.

You can check the DAG settings and members with Get-DatabaseAvailabilityGroup -Identity DAGNAME -Status | fl. Here you can see the setup file witness server.

RunspaceId : 0ffe8535-f78a-4cc1-85fd-ae27934a98e0
Name : DAGNAME
Servers : {Node A, Node B}
WitnessServer : Servername
WitnessDirectory : Directory name on WitnessServer
AlternateWitnessServer : Second Servername
AlternateWitnessDirectory : Directory name on Second Witness Server

(AlternateWitness server is only used with Datacenter Activition (DAC) Mode DAGOnly, here it is off and therefore not used and not needed)

Okay witness share, some Exchange DAG architecture first. Exchange DAG is a Exchange database mirror service build on Fail over cluster service (Microsoft calls it hidden cluster service). You can mirror the databases in a active/passive solution (one node is active to other is only hosting replica’s), or in an active/active solution (both nodes have active and passive databases). In both solutions that is high availability and room for maintenance (in theory that is). The mirror service is done by replicating the databases as database copies between members of the dag. The DAG uses Fail over clustering services where the DAG members participate as cluster nodes. A cluster uses a quorum to tell the cluster which server(s) should be active at any given time (a majority of votes). In case of a failure in heartbeating networking there is a possibility of split brain, that both nodes are active and try to bring up the cluster resources as they are designed to do. Both nodes can serve active databases with the possibility of data mismatch, corruption or other failures. In this case a quorum is used to find out which node has more votes to be active. A shared disk is often used for the cluster quorum. An other option is to use a file share on a server outside the cluster, the so called file witness quorum or file witness share in Exchange.

image

The above model shows the CAS and DAG HA components. With Exchange architecture best practice the File Witness share is to be placed on the HT role, but in the case of mixed roles you should select a server outside the DAG and in this case outside the Exchange organisation. Any file server can be used, preferably a server in the same datacenter as the primary site serving users (important).

So back to the issue. File witness share (FWS) access. I checked if I could see the file share (\servernameDAGFQDN) from the server and checked permissions (Exchange Trusted Subsystem and the DAG$ computer object should full control). The Exchange trusted subsystem must be a Adminstrators local group member. The FWS is placed on a domain controller in this organization. Not ideal again (Exchange server now need domain level administrators group membership as domain controllers don’t have local groups), but working.

I checked the failover service and there the node is in a down state, including it’s networks. But in the Windows guest networks are up and traffic is flowing from and to the both nodes and the FWS. No firewall on or between the nodes, no natting. Okay……Some other items (well a lot) where checked as the where several actions done in the environment. Also checked Hyper-V settings and networking. Nothing blocking found (again some pointers for future actions).

Well, try to remove and add the failed state node to the DAG. This should have no impact on the organization and the state is already failed.

Removing a node from the DAG.

Steps to follow:
1. Depending on the state, suspend database seeding. When failed, suspend via Suspend-MailboxDatabaseCopy -Identity <mailbox database><nodename>. When status is failed and suspended this is not needed.
2. Remove Database copies of mailbox databases on the failed node. Use  Remove-MailboxDatabaseCopy -Identity <mailbox database><nodename>. Repeat when needed for the other copies.
3. Remove Server from DAG.  Remove-DatabaseAvailabilityGroupServer -Identity <DAGName> -MailboxServer <ServerName> -ConfigurationOnly
4. Evict from cluster.
As the cluster is now only one node, the quorum is moved to node majority automatically. The FWS object is removed from the config.

Rebuilding the DAG by adding the removed node back

Steps to follow:

1. Add server to DAG. This will add the node back to the cluster.  Add-DatabaseAvailabilityGroupServer -Identity <DAGName> -MailboxServer <ServerName>. Succes the node is healthy.
2. Add the database copies as preference 2 (the other node is still active). Add-MailboxDatabaseCopy –Identity <Mailbox Database> -MailboxServer <ServerName> -ActivationPreference 2.
3. In my case to time between fail state and returning to the DAG was a bit long. The database came up, but returned to failed state. We have to suspend and manually seed. Suspend-MailboxDatabaseCopy -Identity <mailbox database><nodename>.
4. Update-MailboxDatabaseCopy -Identity “<Mailbox Database><Mailbox Server>” -DeleteExistingFiles. Wait for the bytes are transferred across the line. When finished the suspended state is automaticaly lifted.
Repeat for the other databases.
5. You will now see a good state of the DAG and databases in Exchange Management console. Not yet. The file witness share is not yet back.
6. Add the Witness share from Exchange powershell. Set-DatabaseAvailabilityGroup -WitnessDirectory “<Server Directory” -WitnessServer “<Servername>” -id “<DAG Name>”. When the DAG members are minimal two the FWS is recreated. This is also visible in Failover Cluster.

Root Cause Analyse

Okay this didn’t went so smooth as described above. When trying to add the cluster node back to the cluster this fails with the FWS error again. In cluster node command output it is noticed that on Node A Node B is down. And on Node B Node A is down and Node B is Joining. Hey wait there is a split and the Joining indicates that Node B is trying to bring up it’s own cluster. Good that it is failing. When removing the node from the DAG Kaspersky Virus protection is loosing connection as this is configured to the DAG databases. At the same time Node A has the same errors and something new, RPC Server errors with Kaspersky. Ahhhh Node A networking services not correctly working is the culprit here. The organisations admins could not tell if networking updates and changes had a maintenance restart/reboot. So there probably something is still in memory. So inform the users, check the backup and reboot the active node. The node came up good and low and behold node B could be added to the fail over cluster. At this time I could rebuild the DAG. Health checks are okay, and documented.

– Hope this helps when you have similar actions to perform.

 

Dissecting vSphere – Data protection

An important part of a business continuity and disaster recovery plans are the ways to protect your organisation data. A way to do this is to have a back-up and recovery solution in place. This solution should be able to get your organization back in to production with the set RPO/RTO’s. The solution needs to be able to test your back-ups, preferable in a sandboxed testing environment. I have seen situations at organisations where backup software was reporting green lights on the backup operation, but when a crisis came up they couldn’t get the data out and thus failing recovery. Panicking people all over the place….

Back-up and recovery solution can be (a mix of) commercial products to protect the virtual environment like Veeam or from within guest with agents like Veritas or DPM or from features of the OS (return to previous version with snapshots). Other ways included solutions on the storage infrastructuur. But what if your budget constrained….

Well VMware has the vSphere Data Protection that is included from the Essentials Enterpise Plus kit. This is the standard edition. The vSphere Data Protection Advanced edition is available from the enterprise license.
So there are two flavours, what is standard giving and lacking from advanced?
First the what; like previous stated VDP is the backup and recovery solution from VMware. It is a appliance that is fully integrated with vCenter. It’s easy to be deployed. It performs full virtual machine and File-LevelRestore (FLR) without installing an agent in every virtual machine.It uses data deduplication for all backup jobs, reducing disk space consumption.

image

VDP standard is capped with a 2TB backup data store, where VDP advanced allows dynamic capacity growth. This allows a growth of capacity to 4TB, 6TB or 8TB backup stores. VDP advance also provides agents for specific applications. Agents for SQL Server and Exchange agents can be installed in the VM guest os. These agents provides selecting individual databases or stores for backup or restore actions, application quiescing and advanced options like truncating transaction logs.

image

At VMworld 2013 further capabilities of VDP 5.5 are introduced:

– Replication of backup data to EMC.
– Direct-to-Host Emergency Restore. (without the need for vCenter, so perfect for backing up your vCenter)
– Backup and restore of individual VDMK files.
– Specific schedules for multiple jobs.
– VDP storage management improvements. Selecting separate backup data stores.

Sizing and configuration

The appliance is configured with 4vCPU’s and 4GB RAM. For the available backup stores storage capacity 500GB, 1TB or 2TB they will consume respectivily 850GB, 1,3 TB and 3,1TB of actual storage. There is a 100 VM limit, so after that you would need another VDP appliance (maximum of 10 VDP appliances per vCenter).

After the appliance deployment the appliance need to be configured at the VDP web service. The first time it is in installation mode. Items such as IP, hostname, DNS (if you haven’t added these with the OVF deployment), time and vCenter need to be configured. After completion (and sucessful testing) the appliance needs to be rebooted. A heads up, the initial configuration reboot can take up to 30 minutes to complete so have your coffee machine nearby.

After this you can use the webclient connected to your VDP connected vCenter to create jobs. Let the created jobs run controlled for the first time; the first backup of a virtual machine takes time as all of the data for that virtual machine is being backed up. Subsequent backups of the same virtual machine take less time, here changed block tracking (CBT) and dedup is preformed.

Performance

Well this depends on the kind of storage you are going to use as the backup data store. If you going for low cost storage (let say most of the SMB would want that), your paying in performance (or lacking it most of the time).

Storage Offsite

Most organizations want their backup data stored offsite in some way. vDP does not offer replication (or with VDP5.5 to only EMC), so you want to have some offsite replication or synchronization in place (and a how are you able to restore from this data if your VDP is lost also). vSphere Replication only protects VM’s and not your backup data store. Most SMB’s don’t have a lot of storage able replication devices in place, and when they do, there using it for production and not use that as a backup datastore. Keep this in mind when researching this product for your environment.

– Enjoy data protecting!

Evaluation – VMware vCenter Log Insight – Part one the what, why and installation

A few posts back I wrote about the vCenter Collector services to centrally collect logs and dumps. There is also the VMware vCenter Log Analyzer appliance, a collector and an analyzer (with the collectors you have to do the analyzing part yourself). The appliance is an OVF/OVA download, that you can add to your environment.

What does it gives you:

  • Log file collection and analysis.
  • Alert and events collection and analysis.
  • vCenter en vCenter Operation Management integration.
  • Connect to everything. Everything? Well everything that’s able to generate log data. Several partners have content packs for their logs that you can import that gives you an additional layer for analyzing.

So you were writing about the vCenter collectors, why will we not use them? Well you can for you virtual environment. And you will if you are budget constrained. You will have to do the analysis all by yourself, with your own expertise.

How do I get it?

What does it cost?

VMware vCenter Log Insight is licensed on a per operating system instance (OSI) basis, which is defined as any server, virtual or physical, with an IP address that generates logs, including network devices and storage arrays. You can analyze an unlimited amount of log data per OSI. vCenter Log Insight price is currently announced as $200 per OSI.

And it really depends on the amount of log generating devices you want collected or analyzed (not only VMware related).

Installation

Installation of the appliance is straightforward, just like any OVF: Right click datacenter or cluster for Deploy OVF Template. Select your source location, and don’t forget to change the file type in the browse window (else it defaults looks for *.ovf and not the ova extension the Log Analyzer has) and select the downloaded version.

Accept license agreement (O yes, you want to read it first ;) ), choose your hostname and location, disk layout, datastore location and network. If you want you can customize by setting a GW, DNS and IP for log insight. Default or blanks will give you DHCP. And let it fly.

Start your engines when it is all finished. Log on to the console and press CRTL ALT F1.

image

Login with root and blank password. This enables you to set a new password for root.

The vCenter Log Insight Web interface is available at http://log_insight-host/. The HTTPS-based secure Web interface is available at https://log_insight-host/.

image

When you access the vCenter Log Insight Web interface for the first time after the deployment, you must complete the initial configuration steps:

  • set the admin password and optionally a e-mail address.
  • set up a permanent or evaluation license key,
  • type in the e-mail address of the mailbox to receive notifications (some notifications about Log Insight are only send via e-mail notifications),
  • if you want you can participate in the Customer Experience Improvement Program select the thickbox of the send weekly option,
  • save and continue.
  • On the time page setup a NTP server, when none is available you can optionally sync with the ESXi host.
  • save and continue.
  • Setup the SMTP server details.
  • save and continue.
  • You can now setup the optional retrieval of vCenter events/tasks/alerts or send alert notifications to vOPS. (well optional, if you want central management set those options up. Leave out ops if you don’t have this.)
  • save and continue.
  • Set up an optional NFS archival location. You can also add more vmdk’s to your system for online data locations. But you will want to have some archiving in the future. For the evaluation I’m skipping this one.
  • save and continue.
  • Restart to complete the initial setup.

image


After the restart open a browser and viola the vCenter Log Insight home screen is shown. That was a smooth install.

This concludes the first part of the vCenter Log Insight evaluation. In the next part we will handle the following:

What to do next?

We need to configure some hosts to send syslogs to vCenter Log Insight. We can use to provide script configure-esx or we can use PowerCLI to setup a syslog host at the ESXi host advanced settings. We will use Log Insight to query log messages, set up alert notifications, import content packs and more. When I got my lab a bit more setup (I have a little resource issue) I will follow-up in a second post.

-Enjoy for now.