SharePoint Disaster Recovery Failover Techniques

So you’ve got or are interested in two SharePoint farms running in parallel and you want to know how to switch users between the two farms for when you need to. There’s a few options to do it; choosing the right one largely depends on how quickly you need to be able to failover everyone, and to what extent you can invest in an architecture to allow said instant failovers.

I say “disaster recovery” but you can also call this “passive/active” SharePoint farm failover too; either way you have users on one farm and you want them to go on the other farm for whatever reason. Here are your options.

SharePoint Farm Failover Methods

There’s a few ways of performing your failover, depending on how much effort you can spare to setting it all up beforehand.

Content Database Failovers 1st?

Something you’ll want to think about is what order to failover users & databases, given there’s no clean way of doing both perfectly at the same time.

Generally speaking, for nice & planned failovers in a maintenance window the order for failover should be:

  1. Failover SQL Server; make the up-to-now primary SQL instance the now-to-be (readable) secondary, and the previously-read-only passive SharePoint farm instance the now-to-be primary.
    • Once done, the primary SharePoint site will now be in read/only mode for users. Hopefully they’ll be expecting this.
  2. Failover users via one of these methods below.
    • Users will start using the secondary farm immediately or not, depending on which way you failover users.

It’s impossible to guarantee nobody will see a read-only SharePoint until everyone + databases are moved over to the secondary farm, so personally I’d have a “maintenance hour” if this is a planned failover so users can expect a couple of speed-bumps as everything switches.

If it’s an emergency then, well just failover ASAP of course; we’re way past caring about perfect user transitions if it’s a firefight.

DNS Updates

This is a pretty standard way of redirecting; you have DNS “A” record www.contoso.com pointing at a network-load-balancer on farm 1. You just update the record and wait for the clients to notice.

image

Nothing really that magical here; we just depend on the time-to-live value of the “www.contoso.com” A-record. That’s to say, once we update the record the changeover won’t be instant for clients that have the DNS info cached already, which maybe an issue.

clip_image004

Pros:

  • Easy to execute.
  • No reliance on fancy architecture.

Cons:

  • Delayed response for clients.

Network Load Balancer (NLB) Reconfiguration

This is a bit trickier and assumes you even have a network-load-balancer for web-front-end traffic. In short, you reconfigure the NLB to just send traffic to the other farms web-front-ends and that’s it.

image

Again, not overly exotic the solution but it’ll work with instant effect – no dependence on clients to update. Just make sure the configuration doesn’t at any point send users to both farms as that’ll cause all kinds of havoc (nothing damaging; just web-form updates for example may start to fail for users, etc).

Pros:

  • Instant effect.
  • No reliance on fancy architecture.

Cons:

  • A misconfiguration will give split-farm syndrome – all requests need to go to one or the other farm, without fail.

Reverse Proxy Redirect

This is probably the most problem-free in terms of failover risk, but also the most complicated to setup. It basically just means you have to configure both farms to accept the internal & external URLs (and please, make sure you thoroughly test this is setup right or bad things can happen).

image

The nice thing here is that because all the clients are already pointed at the reverse-proxy, there’s a single reconfiguration for the published application there to a new endpoint and that’s it.

Pros:

  • Instant effect.
  • Simple reconfiguration.

Cons:

  • Complicated to setup.

Things That Can Go Wrong on Failover

Not much really. Item edits may fail if you’re caught mid-failover without realising. Say you opened an edit-form from the primary & since editing the failover happened; the update would fail if sent to the secondary farm.

Probably the most potentially dangerous is the NLB approach if mishandled as that actually could theoretically send users to both farms. Still though, only one farm is ever readable so nothing too damaging could really happen.

That’s it! Comments welcome.

Cheers,

// Sam Betts