Skip to main content

Live replication failover

Posted by andrewyager on Mon, 04/20/2009

Hi,

I'm toying with an idea for live failover of an MTE install. Basically, clients would connect to the MTE server using a SIP SRV record (only SIP clients).

A secondary server would periodically rsync the client and asterisk configuration from the primary server, and be listed at a higher priority in the SRV lookup. (In this case, by higher priority, I mean a bigger priority number, and therefore treated less favorably)

If the primary server goes down, theoretically, the device should re-register against the secondary (higher priority server) and continue to work. The client would loose the abiltiy to modify configuration, but would retain the ability to make calls.

Does this seem reasonable? Are there any pitfalls I might not be aware of? Is anyone else using a similar configuration?

Thanks,
Andrew


Submitted by eeman on Mon, 04/20/2009 Permalink

It wont work, there are a dozen pitfalls. All have been discussed in this forum many times. unless, of course, you are only doing residential where every sip extension is a stand-alone registration that doesn't care if other members of its tenant still register to the other machine.

Submitted by andrewyager on Mon, 04/20/2009 Permalink

My intention here is total system failure of system 1 to system 2, rather than a partial unavailability. The theory is that any endpoint would re-register with the other server.

With regards to the registration issue, it strikes me that you may be able to avoid the problem with split registrations if you made use of realtime config with the sipregistrations table in Asterisk 1.6 (it looks as though it might be in 1.4, but I haven't poked too hard).

Submitted by eeman on Mon, 04/20/2009 Permalink

they will jump to the other server even without total system failure. Just consider how they go unavailable for a few seconds every now and then. Sometimes a phone will jump on its second registration server and create a 'split brain' condition. using heartbeat is a much more effective method of having failover as long as you properly fence the other server to avoid split brain conditions. as far as 'rtupdate' in sip.conf have you considered that you'd have 2 servers trying to re-up with the same handset? fencing fencing fencing.

S hoot

T he

O ther

N ode

I n

T he

H ead

Submitted by nocadmin on Tue, 04/28/2009 Permalink

Here is how I'm handling the two-server failover setup with Thirdlane currently. This covers our butt for most major issues, thankfully this has only taken effect once maybe every year.

I use heartbeat to fail over a floating IP address between two servers.

Asterisk is completely dead on the secondary box until heartbeat starts it up when the first node fails (along with webmin for the users). Some licensing issues with Thirdlane due to the mac address change to be expected.

Sync astdb every minute between servers, this includes all the user settings and SIP registrations (user IP/port, important for keeping users registered between failovers).

Ideally, store your Asterisk configs/spools on a central storage unit of your choosing, or rsync often.

When Asterisk starts up on the second box, it binds to the floating IP (set this in sip.conf so it only uses that IP), loads the current SIP registrations, and keeps talking to the phones via the current floating IP. The key thing is keeping the phones registered so you don't have to wait for your expire time or the phones configured reregistration time.

Calls in progress are dropped during failover, but phones don't need to reregister, NAT timeouts on the remote routers aren't an issue (qualify keeps running), and things pretty much work as before.

-Darryl

Submitted by eeman on Wed, 04/29/2009 Permalink

nocadmin:

unless you paraphrased it sounds like your backup unit did not cover replication of webmin data which means no one could log back into pbx manager. Same goes for the phone provisioning directory and your mysql database. Have a look at DRBD, it works a lot better for replication than scheduled rsync's. One thing not mentioned often is that auto_failback is best left disabled so that you don't have a second round of dropped calls until the time of the admin's choosing.

P.S. it is now possible to buy a special license that works with 2 mac addresses. This allows the secondary box to be a full serving pbxmanager interface no different than the primary.

Submitted by nocadmin on Wed, 04/29/2009 Permalink

Yes, I shortened it. Webmin is in there as well.

Phone provisioning is handled on a different cluster of servers for FTP/TFTP/HTTP provisioning (as we have multiple clusters of Thirdlane, but one configuration host for phones).

I looked at DRBD, but will be moving towards a central location/nfs mount of some sort down the road instead as it is more stable overall for our network.

We have a special license, wasn't sure if that was public knowledge (IE: was this something special setup for us years ago when we first implemented this) so I didn't want to announce it if not :)

-Darryl

Submitted by gerteizinga on Wed, 05/06/2009 Permalink

Hi

We have a simular setup. we use two boxes, we use SYN3 (https://www.syn-3.eu) This distribution has its own failover out of the box (Based on drdb)

All asterisk and webmin files are written to the filesystem running on both boxes, the heartbeat detects the live machine and is using the physical ip address. All phones are registering to that ip address.

In case of a failure the 'live' machine will become 'failover' and the 'failover' machine will become 'live'.

In this setup you need 2 Thirdlane licences, but the failover avoids a lot of headache

Regards

Gert

Submitted by thirdlane on Mon, 05/18/2009 Permalink

In order to support failover (and data replication) we recently made a change in PBX Manager that allows 2 MACs in the license file. You still need to purchase 2 licenses but if you need to use them in this environment you can request a license file with 2 MACs so that the same license file can be installed on both servers (and included in replication).