Skip to main content

Asterisk freeze for short time each 5 min

Posted by xxot on Tue, 05/05/2015

Hello, Everyone.
We are using Thirdlane MT realtime version and found a strange issue when asterisk freezes for some time and stop to process sip packets. Time from time in the asterisk/full log appear messages like these:

[2015-05-05 11:35:37] NOTICE[3199] chan_sip.c: Peer '5002-ww' is now UNREACHABLE! Last qualify: 292
[2015-05-05 11:35:37] NOTICE[3199] chan_sip.c: Peer '2001-ww' is now UNREACHABLE! Last qualify: 268
[2015-05-05 11:35:37] NOTICE[3199] chan_sip.c: Peer '803-aa' is now UNREACHABLE! Last qualify: 27
[2015-05-05 11:35:37] NOTICE[3199] chan_sip.c: Peer '1008-ww' is now UNREACHABLE! Last qualify: 258
[2015-05-05 11:35:38] NOTICE[3199] chan_sip.c: Peer '5002-ee' is now Reachable. (726ms / 2000ms)
[2015-05-05 11:35:38] NOTICE[3199] chan_sip.c: Peer '2001-ww' is now Reachable. (845ms / 2000ms)
[2015-05-05 11:35:41] NOTICE[3199] chan_sip.c: Peer '1006-dd' is now UNREACHABLE! Last qualify: 28
[2015-05-05 11:35:41] NOTICE[3199] chan_sip.c: Peer '1000-dd' is now UNREACHABLE! Last qualify: 28

We noticed that the time, when endpoints become unreachable, very often a multiple of 5 (5, 10, 15, etc) and it happens in the second half of the minute (i.e. 11:30:37-60, but almost never 11:30:00-30). In ten seconds or less all endpoints went back to "reachable". Of course, first we suspected the internet failure on the customer side. However, the problem happens with the different tenants in the same time, which is very strange. So, might be a network issue on our side. However, we have another Thirdlane server (not realtime) in the same subnet (on the same internet connection) and this server doesn't have such issues. So, we dumped traffic and sip logs and found that some packets appears in the dump, but not in the asterisk sip log.
Also, we are using Cisco SPA phones which have keepalive NOTIFY option and very often customers complain that the line buttons become orange, which means "no response from the server".
We dive deeper and found that it might be server io issue, as mysql updating database each time the endpoint trying to register and disk utilization sometimes jumping over 50%. So, we moved sipregs table in the RAM, and it helped a little bit, but not completely. I still can see that endpoints go down time from time. If anyone has similar issue (just run "cat /var/log/asterisk/full | grep -i reach" to see if you have it) share your experience.


Submitted by jakevxobee on Fri, 06/12/2015 Permalink

Unreachable phones are almost always a networking issue...

Are these phones local? What type of router are you using?

If they are remote have you adjusted UDP session timers and/or SIP ALG?

If it consistently happening around a 5, 15 or 30 minute mark I suspect you are running into a session timer issue.

Submitted by xxot on Thu, 06/25/2015 Permalink

It is mulitenant solution and server located in the data center. Phones belong to different customers and are in the different network behind different routers. Another server in the same subnet does not have these issues. The only difference - it is not a realtime architecture.
We spent decent amount of time on troubleshooting, but didn't find where the problem is. One of the most helpful command which can show that the problem in the asterisk:
for k in $(seq 1 10000); do date; netstat -plan | grep 5060; sleep 1; done;
Each 5 minutes network buffers for port 5060 become overfilled and the system wait for asterisk to process these packets. In the same time you can see that phones goes down in the logs. It can be clearly see in the output:

Proto Recv-Q Send-Q Local Address Foreign Address
Thu Jun 25 10:55:28 EDT 2015
udp 0 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:55:29 EDT 2015
udp 0 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:55:30 EDT 2015
udp 0 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:55:31 EDT 2015
udp 0 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:55:32 EDT 2015
udp 0 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:55:33 EDT 2015
udp 9984 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:55:34 EDT 2015
udp 29184 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:55:35 EDT 2015
udp 130560 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:55:36 EDT 2015
udp 229632 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:55:37 EDT 2015
udp 229632 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:55:38 EDT 2015
udp 229632 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:55:39 EDT 2015
udp 229632 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:55:40 EDT 2015
udp 230656 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:55:41 EDT 2015
udp 0 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:55:42 EDT 2015
udp 0 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:55:43 EDT 2015
udp 0 0 8.8.4.9:5060 0.0.0.0:*

Submitted by xxot on Thu, 06/25/2015 Permalink

Just another part of the log. One of the interesting things that it happens only after 30 seconds from the beginning of minute, i.e. 10:00:30, 10:05:32. 10:10:31. Looks like some internal cron running in the asterisk process and doing something each 5 min. There is no network spikes in this time, so something happens inside the asterisk.

Thu Jun 25 10:25:31 EDT 2015
udp 0 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:25:32 EDT 2015
udp 0 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:25:33 EDT 2015
udp 0 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:25:34 EDT 2015
udp 6912 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:25:35 EDT 2015
udp 2304 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:25:36 EDT 2015
udp 59904 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:25:37 EDT 2015
udp 211968 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:25:38 EDT 2015
udp 230400 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:25:39 EDT 2015
udp 230400 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:25:40 EDT 2015
udp 230400 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:25:41 EDT 2015
udp 230400 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:25:42 EDT 2015
udp 230400 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:25:43 EDT 2015
udp 230400 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:25:44 EDT 2015
udp 230400 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:25:45 EDT 2015
udp 230400 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:25:47 EDT 2015
udp 230400 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:25:48 EDT 2015
udp 230400 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:25:49 EDT 2015
udp 230400 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:25:50 EDT 2015
udp 230400 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:25:51 EDT 2015
udp 230400 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:25:52 EDT 2015
udp 18432 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:25:53 EDT 2015
udp 4608 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:25:54 EDT 2015
udp 88320 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:25:55 EDT 2015
udp 2304 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:25:56 EDT 2015
udp 6912 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:25:57 EDT 2015
udp 4608 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:25:58 EDT 2015
udp 0 0 8.8.4.9:5060 0.0.0.0:*
Thu Jun 25 10:25:59 EDT 2015
udp 0 0 8.8.4.9:5060 0.0.0.0:*

Submitted by xxot on Mon, 08/31/2015 Permalink

The problem was solved. It happened because of CDR database was located on another server. As soon, as we moved CDRs to the local server the problem disappeared.
PS. remote server which was used for CDRs, located in the same subnet, in the same rack, and there was no delays in network or any other visible issues with the performance neither on asterisk, nor on the database server.

Submitted by jakevxobee on Mon, 10/05/2015 Permalink

I have also seen a CDR with tables that need to be optimized slow a system down.

After optimizing the database the system was fine again.