Skip to main content

HELP! - Server crashing periodically

Posted by devand_cloud on Thu, 01/24/2013

I was hoping there might be someone on this forum that could help me determine why one of our Asterisk servers periodically deadlocks or crashes.

I believe that I have the necessary coredump files to debug the issue, but don't know my way around them well enough.

Any help would be much appreciated. The resolution of this issue is of the utmost urgency.

Thanks in advance!


Submitted by devand_cloud on Tue, 01/29/2013 Permalink

Thank you both for your replies.

@trinicom: When you say "auto-reboot", are you referring to the asterisk service or the entire server?

Submitted by eeman on Tue, 01/29/2013 Permalink

I would definitely not recommend that solution. Ask any unix administrator and they will cite case after case of servers with uptimes of 900+ days without issue. Thats generally a windows-style approach that stems from lots of experience dealing with software written by a greedy mega-corporation that doesn't care one bit that they con people into bloated code written by programmers who dont do adequate memory management resulting in memory leaks. A windows admin will often suggest weekly reboots as a work-around to being forced to deal with poorly written software that he has no other control over. Kernel upgrades are the only reason I have ever, in 17 years, had to reboot a unix box.

In every single patch note involving a crash or a deadlock, exactly none of them have anything to do with memory leaks. They are event driven results. That means if you reboot 15 minutes ago it isnt going to make a blind bit of difference. If you get X event under Y conditions, you are going to crash. The same goes with the deadlocks. The patch notes describe in detail, after the fact, what conditions caused the crash/deadlock and how they changed it. Your only solution to crashes and deadlocks is get on a version that doesn't cause you problems. That generally means downloading and compiling. Some deadlocks don't even manifest until a certain level of usage simply because the conditions that need to be met are infrequent. For example in the 1.8 branch up until 1.8.9 there were deadlocks associated with sip hint updates during a reload. Well this increases with frequency based on A) how many BLF buttons you have provisioned on the server B) how high a call volume you have to those watched subscriptions and C) the frequency that your admin portal gets used and triggers reloads.

Submitted by trinicom on Tue, 01/29/2013 Permalink

We reboot the whole box and since then we almost never had another asterisk issue.

I agree with Eeman it is not the best solution, but we have been waiting for 1.8 to get mature enough
and the next version of Thirdlane before messing with going through a whole upgrade in the mean time it has saved us a lot of hassle.

And yes we have a bunch of BLF customers

Submitted by trinicom on Tue, 01/29/2013 Permalink

Oh, FYI our edge voice box which is running asterisk 1.6.1.12 only (no thirdlane) has been up for 583 days, we are running 1.6.1.11 in the Thirdlane Box.

Submitted by eeman on Tue, 01/29/2013 Permalink

the 1.6 branch were considered developmental branches and not really slated for production by digium. To illustrate how significant that is, in 1.6.2.11 they introduced a really nasty deadlock related to the BLF example I cited earlier. They didnt fix it until 1.8.8 and even though they hadnt closed out the code in 1.6.2 branch, they chose to never backport the fix. Once 1.8 became completely stable I encouraged my customers to upgrade onto it. Your edge box likely is working in an environment where the events that are causing your crashes don't occur due to its role. If your edge box is doing basic bridging then bugs related to BLFs or voicemail, or queues most likely wont affect you :-)

Submitted by eeman on Tue, 01/29/2013 Permalink

aside from this one voicemail related crash that occurred when people actually used the other folders in VM, the 1.8.11-certified branch was ok. They fixed that bug in regular 1.8.14 but never backported it. I dont even know if they knew they fixed it or if they fixed something else and it went away. I never could find a patch note. The latest 1.8 regular branch (19?) hasnt reported any issues and i moved people off the 1.8.11-certified onto the regular branch around 1.8.15 onward without issue. I recently rolled out the 1.8.15-certified branch that they _just_ launched and seems to be doing fine. They fixed the voicemail issue and I havent heard of any other issues with it. If certified works, I'm going to try and stay with that branch so I can keep using the DPMA on my single-tenant boxes.