At a customers site I was called in to do a Exchange health check and some troubleshooting. As I have not previously added Exchange content to this blog, I thought on doing a note experience and new blog post in once.

Situation
The environment is a two site data center where site A is active/primary and site B is passive/secondary. Therefor a two node Exchange is deployed on Hyper-V. The node in site A is CAS/HT/MBX en the node in site B is CAS/HT/MBX. The mailbox role is DAG’ged, where active is site A and database copies are on site B. There is no CAS array (Microsoft best practice is to set it even if you have just one, but this wasn’t the case here). This is not ideal as a fail-over in CAS doesn’t allow clients to auto connect to another CAS, Exchange uses the CAS Array (with load balancer) for this. The CAS fail-over is manual (as is the HT). But when documented well and small amount of downtime is acceptable for the organisation, this is no big issue.
Site A has a Hyper-V cluster where the exchange node A is a guest hosted on this cluster. Site B has a unclustered Hyper-V host where Exchange node B is a guest. Exchange node A is marked high available. This again is not ideal, yes maybe for the CAS/HT role it can be used (should then be separated from the mailbox role), but for the mailbox role this is application layer clustered already (the DAG) so preferably off. Anyhow these are some of the pointers I could discuss with the organisation. But there is a problem at hand that needs to be solved.

The Issue at hand (and some Exchange architecture)
The issue is that the secondary node is in a failed state and currently not seeding it’s database copies. Furthermore the host is complaining about the witness share. You can check the DAG health with PowerShell Get-MailboxDatabaseCopyStatus and Test-ReplicationHealth.

You can check the DAG settings and members with Get-DatabaseAvailabilityGroup -Identity DAGNAME -Status | fl. Here you can see the setup file witness server.

RunspaceId : 0ffe8535-f78a-4cc1-85fd-ae27934a98e0
Name : DAGNAME
Servers : {Node A, Node B}
WitnessServer : Servername
WitnessDirectory : Directory name on WitnessServer
AlternateWitnessServer : Second Servername
AlternateWitnessDirectory : Directory name on Second Witness Server

(AlternateWitness server is only used with Datacenter Activition (DAC) Mode DAGOnly, here it is off and therefore not used and not needed)

Okay witness share, some Exchange DAG architecture first. Exchange DAG is a Exchange database mirror service build on Fail over cluster service (Microsoft calls it hidden cluster service). You can mirror the databases in a active/passive solution (one node is active to other is only hosting replica’s), or in an active/active solution (both nodes have active and passive databases). In both solutions that is high availability and room for maintenance (in theory that is). The mirror service is done by replicating the databases as database copies between members of the dag. The DAG uses Fail over clustering services where the DAG members participate as cluster nodes. A cluster uses a quorum to tell the cluster which server(s) should be active at any given time (a majority of votes). In case of a failure in heartbeating networking there is a possibility of split brain, that both nodes are active and try to bring up the cluster resources as they are designed to do. Both nodes can serve active databases with the possibility of data mismatch, corruption or other failures. In this case a quorum is used to find out which node has more votes to be active. A shared disk is often used for the cluster quorum. An other option is to use a file share on a server outside the cluster, the so called file witness quorum or file witness share in Exchange.

image

The above model shows the CAS and DAG HA components. With Exchange architecture best practice the File Witness share is to be placed on the HT role, but in the case of mixed roles you should select a server outside the DAG and in this case outside the Exchange organisation. Any file server can be used, preferably a server in the same datacenter as the primary site serving users (important).

So back to the issue. File witness share (FWS) access. I checked if I could see the file share (\servernameDAGFQDN) from the server and checked permissions (Exchange Trusted Subsystem and the DAG$ computer object should full control). The Exchange trusted subsystem must be a Adminstrators local group member. The FWS is placed on a domain controller in this organization. Not ideal again (Exchange server now need domain level administrators group membership as domain controllers don’t have local groups), but working.

I checked the failover service and there the node is in a down state, including it’s networks. But in the Windows guest networks are up and traffic is flowing from and to the both nodes and the FWS. No firewall on or between the nodes, no natting. Okay……Some other items (well a lot) where checked as the where several actions done in the environment. Also checked Hyper-V settings and networking. Nothing blocking found (again some pointers for future actions).

Well, try to remove and add the failed state node to the DAG. This should have no impact on the organization and the state is already failed.

Removing a node from the DAG.

Steps to follow:
1. Depending on the state, suspend database seeding. When failed, suspend via Suspend-MailboxDatabaseCopy -Identity <mailbox database><nodename>. When status is failed and suspended this is not needed.
2. Remove Database copies of mailbox databases on the failed node. Use  Remove-MailboxDatabaseCopy -Identity <mailbox database><nodename>. Repeat when needed for the other copies.
3. Remove Server from DAG.  Remove-DatabaseAvailabilityGroupServer -Identity <DAGName> -MailboxServer <ServerName> -ConfigurationOnly
4. Evict from cluster.
As the cluster is now only one node, the quorum is moved to node majority automatically. The FWS object is removed from the config.

Rebuilding the DAG by adding the removed node back

Steps to follow:

1. Add server to DAG. This will add the node back to the cluster.  Add-DatabaseAvailabilityGroupServer -Identity <DAGName> -MailboxServer <ServerName>. Succes the node is healthy.
2. Add the database copies as preference 2 (the other node is still active). Add-MailboxDatabaseCopy –Identity <Mailbox Database> -MailboxServer <ServerName> -ActivationPreference 2.
3. In my case to time between fail state and returning to the DAG was a bit long. The database came up, but returned to failed state. We have to suspend and manually seed. Suspend-MailboxDatabaseCopy -Identity <mailbox database><nodename>.
4. Update-MailboxDatabaseCopy -Identity “<Mailbox Database><Mailbox Server>” -DeleteExistingFiles. Wait for the bytes are transferred across the line. When finished the suspended state is automaticaly lifted.
Repeat for the other databases.
5. You will now see a good state of the DAG and databases in Exchange Management console. Not yet. The file witness share is not yet back.
6. Add the Witness share from Exchange powershell. Set-DatabaseAvailabilityGroup -WitnessDirectory “<Server Directory” -WitnessServer “<Servername>” -id “<DAG Name>”. When the DAG members are minimal two the FWS is recreated. This is also visible in Failover Cluster.

Root Cause Analyse

Okay this didn’t went so smooth as described above. When trying to add the cluster node back to the cluster this fails with the FWS error again. In cluster node command output it is noticed that on Node A Node B is down. And on Node B Node A is down and Node B is Joining. Hey wait there is a split and the Joining indicates that Node B is trying to bring up it’s own cluster. Good that it is failing. When removing the node from the DAG Kaspersky Virus protection is loosing connection as this is configured to the DAG databases. At the same time Node A has the same errors and something new, RPC Server errors with Kaspersky. Ahhhh Node A networking services not correctly working is the culprit here. The organisations admins could not tell if networking updates and changes had a maintenance restart/reboot. So there probably something is still in memory. So inform the users, check the backup and reboot the active node. The node came up good and low and behold node B could be added to the fail over cluster. At this time I could rebuild the DAG. Health checks are okay, and documented.

– Hope this helps when you have similar actions to perform.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s