Skip to main content

reload of sip.conf causes peers to lag

Posted by jason@demand.co.za on Fri, 09/20/2013

Hey

I have this issue when I do a sip reload; extensions will just go continuously lagged up until the point that I restart the service; I imagine that this is because asterisk is so busy taking requests while it reloads; that when it comes back it locks up.

Is this just a limitation of using flat files and the only solution being to migrate to mysql config? Or is there a way around this?

We've got around 2000 peers in the sip.conf


Submitted by jason@demand.co.za on Wed, 10/16/2013 Permalink

Yeah we sat under 2000 for ages but at or around 2000 and over it's started happening more and more...

Often a reload will go by fine and there won't be a problem... but everyone now all sip peers will just go unreachable/lagged.

Submitted by ericH on Mon, 10/21/2013 Permalink

What are you using as peers?

You mentioned all sip peers will go unreachable/lagged - is that an absolute or does a certain percentage go down?

Could be unrelated - but -
I have something similar after upgrading our PBX verison where a certain number of peers go unreachable/lagged for about 30 seconds. I found out that they are Polycom phones and Audiocodes.

Adtran TA9XX, SPA2102, Cisco IP phones, and soft phones didnt exhibit the problem.

Upgrading the firmware on the Audio Codes and Polycom Phones resolved the issue. It had to do with the nonce and things being out of sequence on the reply.

Submitted by jason@demand.co.za on Wed, 10/23/2013 Permalink

Okay all peers might have been an exaggeration but I would say around 50% at the very least. It'll obviously take a while to reach that point but given enough time it'll get there. The peers are a mixture of phones that are being run over WAN links (snom's and yealinks generally); to trunks with an OpenSIPS box on the same LAN.

Submitted by eeman on Mon, 12/02/2013 Permalink

sometimes the biggest delay isnt the ping/pong keepalive sent to the devices, but the toll that sip subscriptions (aka BLF keys) take on the system during a reload. I have found that the more BLF usage a system has the lower the tolerance for too many handsets. Another lag inducing issue comes by way of trying to 'work around' NAT issues by changing the regisration period from 3600 seconds (1hr) down to just 60 seconds. At 2000 handsets you're looking at 34 registrations every second. Every new registration has a different port mapped into the registry completely different from the previous registration just 60 seconds ago. This forces disk I/O to become part of the equation at the rate of 34 writes every second to the ASTDB. The net result is this workaround will ultimately deminish your scaling capacity by a significant amount.

Are you doing any of the above two features in any significant volume? If so then you've likely reached the scaling limit of your design and will need to bring on another server for new clients.