Skip to main content

asterisk 1.8.5.0 released

Posted by eeman on Tue, 07/12/2011

some highlights -

chan_local: 10 lines Fixes chan_local crashs in local_fixup() Thanks OEJ for tracking down the issue and submitting the patch.

app_mysql.c: Fix a crash in the MySQL() application. This code was not handling channel datastores safely. The channel must be locked.

app_directed_pickup.c, main/features.c: Crash when using directed pickup applications. The directed pickup applications can cause a crash if the pickup was successful because the dialplan keeps executing.

features.c: Crash while transferring a call during DTMF feature timeout. When a call is being attended transferred during the time between AST_FRAME_DTMF_BEGIN and AST_FRAME_DTMF_END, the transferred channel becomes a zombie (so tech data is not available), making ast_dtmf_stream() segfault when it tries to send the DTMF digit (at least with SIP channels)

channel.h, main/channel.c, main/features.c: Give zombies a safe channel driver to use. Recent crashes from zombie channels suggests that they need a safe home to goto. When a masquerade happens, the physical part of the zombie channel is hungup. The hangup normally sets the channel private pointer to NULL. If someone then blindly does a callback to the channel driver, a crash is likely because the private pointer is NULL. The masquerade now sets the channel technology of zombie channels to the kill channel driver.

chan_sip.c: Fix Deadlock with attended transfer of SIP call Call path sip_set_rtp_peer

chan_sip.c: Remove the need for deadlock avoidance in chan_sip do_monitor. Deadlock avoidance between the sip pvt and the pvt->owner is very difficult. Now that channel's are ao2 objects, this complication is no longer necessary.

chan_local.c: chan_local: resolve a deadlock. This patch resolves a fairly complex deadlock that can occur with the combination of chan_local and a dialplan switch, such as dynamic realtime extensions, which pulls autoservice into the picture when doing a dialplan lookup.

chan_sip.c: Resolves a deadlock that occurs during sip_new This is based on an uncommitted patch by jpeeler for the issue.

chan_sip: fix a deadlock in check_rtp_timeout. Don't block doing silly deadlock avoidance. Just return and try again later. The funciton gets called often enough that it's fine. Also, this change was already made in trunk.

app_directed_pickup.c, main/features.c: Remove potential deadlock in call pickup race. Deadlock is possible in ast_do_pickup() when holding the target channel lock and trying to get the chan channel lock

chan_sip.c: Unlock the sip channel during fax detection like chan_dahdi does to prevent a deadlock with ast_autoservice_stop


Submitted by eeman on Tue, 07/26/2011 Permalink

i have 4 STE boxes running it without issue. But those STE boxes have a max concurrent call around 10.

Someone else was preaching doom and gloom with crashing due to 'timerfd'. Here is something to note about timerfd

1) its not part of the centos 5.x kernel tree. You have to have kernel 2.6.25 or higher to to even support it which means those complaining about the instability are most likely running on Debian or debian based distros or downloaded a stock kernel from kernel.org making it no longer a centos distro.

2) its the newest of the 3 timing sources and does seem to have issues, but then again so does pthreads in some versions

Submitted by olekaas on Tue, 08/09/2011 Permalink

Ugraded from 1.4 - 32bit and 64bit systems

Debian GNU/Linux 1.6.32 (squeeze fully updated)

The process have roughly been:

1. download and unpack 1.8 source
2. ./configure
3. make menuselect
4. figure out missing packages go back to step 2 until all wanted features can be selected.
5. make
6. wipe modules dir (we don't want to auto load 1.4 mods on 1.8 - duh ;)
7. make bininstall
8. restart *
9. update TL from webmin - auto detecting 1.8
10. correct directory structure in sounds - donated from another server with native 1.8 install.
11. check logs for deprecated syntax - do search and replace in config files.

Do some tests to see things are working. Wait for explosions under heavy load. Fix any cause of explosion and redo step 11. Findings so far:

DO NOT use timerfd - unresponsive to sip request after some time (32bit and 64bit)
DO NOT issue a *full* reload under load (ie 100+ concurrent calls)

The CPU usage seems to be a little bit higher but the load avg (peak) has dropped from 1.5 to 0.2!

/Ole

Submitted by eeman on Tue, 08/09/2011 Permalink

timerfd is a debian only issue, anyone running centos5.x wont have res_timer_timerfd installed.

the biggest syntax to fix is inbound_action.include those in 1.4 will have a SetMusicOnHold command that needs replacing. using vim you can search/replace with this string

:s/SetMusicOnHold(/Set(CHANNEL(musicclass)=/

If you dont know how the search/replace works in vim (yum install vim-enhanced) use the application vim-tutor till you get to that section. The above string followed by the number of lines to apply it to will do the entire file. make a backup copy before you start just in case.

One server i helped upgrade had over 500 inbound routes, no way i was just going to republish those, so I used vim with its built in search/replace feature.

are you still getting deadlocks in 1.8.5.0 with a full reload? they removed a ton of deadlock avoidance in this version so I am hopeful that it resolved that particular problem.

Submitted by olekaas on Tue, 08/09/2011 Permalink

I have a small cronjob that will check an do a full reload if required (every 10 min). After disabling that - no crash at all. We had 11 crashes that day until it was disabled around 13:30. Frequency increased by load.

you could use that syntax with sed too. I used emacs though - it reported close to 4000 replacements...

I placed a noload => res_timing_timerfd in modules.conf to fix the issue. timerfd is available after kernel 1.6.26 (afair) and I'm currently running 1.6.32. I'm wondering if there are any asterisk related benefits of upgrading to a more recent kernel version?

I'm looking forward to next patch release where this fix might be included:

https://reviewboard.asterisk.org/r/1313/

Scroll down to the last comment. Priceless.

/Ole

Submitted by eeman on Tue, 08/09/2011 Permalink

what makes me skeptical about that patch is that they say its been a problem since 1.4 but these high load deadlocks didnt start until 1.6.2.12, but it sounds like you have the highest probability of reproducing this bug so if you test and confirm the problem disappears once the 1.8.6-rc1 comes out then please post that news here. Most of my customers I can do a reload a bunch of times and still not get a deadlock even under load, but they still experience the problem eventually.

Submitted by justdave on Tue, 08/09/2011 Permalink

We've been getting the deadlock on reload pretty reliably. When we were running 1.4, my desktop support people could add and remove extensions and do a reload when they were done and they'd take effect immediately. Since updating to 1.8.x, we've had to have them tell people it'll take effect overnight, and cron the reload for late evening (when it's likely to not be loaded if it dies and we have to kick it)

I'll be much looking forward to testing this fix. :)

Submitted by eeman on Fri, 08/12/2011 Permalink

release note highlights

* channels/chan_sip.c: Optimization to buffer initialization fix.

* main/channel.c: prevent double masqurading channels when one is
been hung up and deadlock avoidance is used. There is a race
condition in ast_do_masquerade / ast_hangup (at least) Reported
by me signed off by schmidts with input from David Vossel Review:
https://reviewboard.asterisk.org/r/1323/

* channels/chan_sip.c: Fix a SIP transfer deadlock. The locking in
this function is very scary. There are like 6 structs involved.
(closes issue AST-470)

* main/pbx.c: Deadlocks dealing with dialplan hints during reload.
There are two remaining different deadlocks reported dealing with
dialplan hints. The deadlock in ASTERISK-17666 is caused by
invalid locking order in ast_remove_hint(). The hints container
must be locked before the hint object. The deadlock in
ASTERISK-17760 is caused by a catch-22 situation in
handle_statechange(). The deadlock is caused by not having the
conlock before calling the watcher callbacks. Unfortunately,
having that lock causes a different deadlock as reported in
ASTERISK-16961. * Fixed ast_remove_hint() locking order. * Made
handle_statechange() no longer call the watcher callbacks holding
any locks that matter. * Made hint ao2 destructor do the watcher
callbacks for extension deactivation to guarantee that they get
called. * Fixed hint reference leak in ast_add_hint() if the
callback container constructor failed. * Fixed hint reference
leak in complete_core_show_hint() for every hint it found for CLI
tab completion. * Adjusted locking in
ast_merge_contexts_and_delete() for safety. * Added
context_merge_lock to prevent ast_merge_contexts_and_delete() and
handle_statechange() from interfering with each other. * Fixed
ast_change_hint() not taking into account that the extension is
used for the hash key. (closes issue ASTERISK-17666) Reported by:
irroot Tested by: irroot JIRA SWP-3318 (closes issue
ASTERISK-17760) Reported by: Byron Clark Tested by: irroot JIRA
SWP-3393 Review: https://reviewboard.asterisk.org/r/1313/

* apps/app_directed_pickup.c: Update PickupChan documentation. The
PickupChan uses the ampersand as the argument separator. Was
documented as: PickupChan(channel[,channel2[,...][,options]])
Fixed documentation to:
PickupChan(Technology/Resource[&Technology2/Resource2[&...]][,options])
This is a continuation of ASTERISK-17494 for v1.8 and later.
(closes issue ASTERISK-18144) Reported by: Erik Smith Patches:
pickupchan_ducumentation-v2.patch (License #6263) patch uploaded
by Erik Smith Tested by: Erik Smith

* res/res_timing_timerfd.c: Reverts fix for timerfd locking issue.
jrose discovered a performance issue with this fix that prevents
his analog phones from working when using timerfd as a timing
source. Until it is understood what is causing this performance
problem, this patch is being reverted.

* channels/chan_sip.c: Better way to get chan and pvt lock for
issue ASTERISK-17431. Redoes -r308945 for issue ASTERISK-17431
deadlock fix for sip_set_udptl_peer() and sip_set_rtp_peer(). *
Lock the channels in the defined order and avoid the need for a
deadlock avoidance loop. * Lock the channel before getting the
pointer to the private structure to be sure that the pointer will
not change due to a masquerade or channel hangup. * To preserve
sanity, check that chan and p->owner are the same. (Pointer
rearangements should not happen without the protection of locks
because bad things tend to happen otherwise.)

Submitted by olekaas on Fri, 08/12/2011 Permalink

1.8.6.0-rc1 downloaded, compiled and installed.

A lot of full and partial reloads have been issued without deadlock, crash or mishaps. At least one call active during the reloads. Time will show if it hold for heavy load too...

Another issue: We cant get PickupChan working for some tenants. I was under the impression that theres no 63 pickgroup limit with the pickup app? Note that this has nothing to do with this update to 1.8.6.0-rc1 and might be because res_coffee.so has been exhausted.

/Ole

Submitted by olekaas on Sun, 08/14/2011 Permalink

caffeine level "normalized" - after getting the feature codes right, the chanpickup stuff works as expected.

Still no crash on reloads. I really hate to search for something that I really don't want to find. Looking for something good, you're glad once you find it AND your search is over.

/Ole

Submitted by olekaas on Mon, 08/15/2011 Permalink

With an auto reload cronjob that would reload every 10 minutes if TL had wriiten the reload file there was at some time a deadlock every 10 minutes. The upgrade from 1.4 to 1.8 was accomplished about 3:30 am and the first deadlock didn't happen before around 9:30 and the next around half an hour later. From there the frequency increased to every 10 minutes until 12:10 where it ran for 70 minutes without deadlock. Finally - looking at the time of the alarms, it didn't take long to figure out what to do... Later I tried a "dialplan reload" to push some urgent changes. Even with eyes shut and finger crossed it ended up with a deadlock.

Outside office hours we issued a number af reloads while we were sorting out an issue for a customer. Even though there were only an handfuld of active calls, it would deadlock on each and every reload. On the other hand I did issue 2 reloads during office hours to activate some very urgent changes. For some reason they went ok.

My guess is that these criteria need to be met to have a deadlock:

* a change in config (I'm not sure this makes sense, but it seems to be the case)
* some load - a handful of active calls should do
* a huge config

The entire asterisk config dir is 16M with 2.4M inbound.include and 1.4M extensions.include.
We have a server still on 1.8.5.0 (i386 arch) which runs fine with auto reload every 5 mins. Config dir size 6.4M.

I just did a reload 9:05. It went ok - 100% success so far :p

/Ole

Submitted by rfrantik on Mon, 08/15/2011 Permalink

Erik:

With the toubles discussed here... what version of Asterisk are you running on your production machines? 1.6.2.11?

We are currently on 1.6.2.14 and have seen the occasional deadlock... no ability to process SIP traffic, but we can still call the IVR via a DID on the PRI. As you say, it seems more related to a special set of circumstances that we can't quite identitfy. We are currently running about 100 extensions with moderate traffic. Trying to decide if we should step back down to 1.6.2.11 as we are adding more phones every day.

Thanks,
Bob

Submitted by eeman on Mon, 08/15/2011 Permalink

1.6.2.11 is an option, but 1.8.6 might be another option.. check in a few days to see if it resolved it. Until then suspend reloads until after hours.

Submitted by wiredwaters on Wed, 09/07/2011 Permalink

Erik-

We are looking to upgrade to Asterisk 1.8.6.0

The current ISO has 1.8.4.2 and the Thidlane repository does not have 1.8.6.0 and actually gives 1.8.3.x revision if you pull it with YUM

If you add the repos digium and centos 1.8.6 shows available with a list of 8 dependencies as below:

Dependencies Resolved

==================================
Package Arch Version Repository Size
==================================
Updating:
asterisk18 i386 1.8.6.0-1_centos5 asterisk-current 4.0 k
Installing for dependencies:
asterisk18-core i386 1.8.6.0-1_centos5 asterisk-current 15 M
asterisk18-dahdi i386 1.8.6.0-1_centos5 asterisk-current 1.2 M
asterisk18-doc i386 1.8.6.0-1_centos5 asterisk-current 11 k
asterisk18-voicemail i386 1.8.6.0-1_centos5 asterisk-current 239 k
kmod-dahdi-linux i686 2.5.0-1_centos5.2.6.18_238.19.1.el5 asterisk-current 3.7 M
libopenr2 i386 1.2.0-1_centos5 asterisk-current 162 k
libss7 i386 1.0.2-1_centos5 asterisk-current 63 k
libtonezone i386 2.5.0-1_centos5 asterisk-current 18 k

Transaction Summary
==================================
Install 8 Package(s)
Upgrade 1 Package(s)

Questions -

Will this work and can we expect any gotchas on it?

We could just patch from the source, but where is the source tree located?

Thanks-

Sean

Submitted by eeman on Wed, 09/07/2011 Permalink

do you not know how to compile software? download the source code from downloads.digium.com and compile it. make sure you apply the pending parking patch so that multi-tenant parking continues to work.

Submitted by rx4change on Fri, 09/09/2011 Permalink

I have been testing this on SVN branch r334953 (current branch as of 9/8/11) and even with https://reviewboard.asterisk.org/r/1323/ I can confirm that this is still an issue, at least under moderate load.

I have been able to demonstrate crashes at the rate of around 3x / day with fairly frequent reloads. Eric's suspicions would seem to be confirmed. What a mess.

Submitted by eeman on Fri, 09/09/2011 Permalink

a crash and a deadlock aren't the same.. they have fixed the deadlock issue in 1.8.6.0. If you look at the patch notes of 1.8.7-rc1 you'll notice they also made a significant sip change that was causing a lot of performance drain on the sip stack.

more crashing bugs got squashed, but if your actually generating core files in /tmp with an automatic restart you should definitely recompile with the DONT_OPTIMIZE flag and submit backtraces so they can get fixed. I know files with spaces in the name and directories with spaces in the name has caused crashes lately. MP3 music on hold is the crash devil. Convert all your mp3 into 8khz wav files. Aside from avoidiing crashes it will take a huge load off your pbx having to decode mp3s on every single caller channel, wav files just dump on the channel with very little transcoding resources.

Submitted by rx4change on Fri, 09/09/2011 Permalink

You are correct - I should have been more specific. What I am experiencing is actually a core dump on reload. I have seen it 3x in less than 24 hours, though that represents a fairly small number in relation to total reloads.

I neglected t o compile with DONT_OPTIMIZE, but I'll go ahead and do that. In this case, it appears to be fairly random and has nothing to do with load on the box.

A few other 1.8.6+ notes: multiple call parking is sorta functional. Reloads don't seem to refresh settings (specifically parkinglot = on peer) and there are occasional random dumps to the default. It appears that setting the parking lot in a channel variable works better than relying on the peer.

Timerfd is definitely still broken. Stay away from it.

Submitted by rx4change on Wed, 09/14/2011 Permalink

Appears the core dumps were actually caused by Skype for Asterisk (supposedly) compatible 1.8 plugin. This is virtually impossible to troubleshoot, so I'd recommend just dumping support for it or spinning out a separate box for now if you really have to have it.

Just wanted to close the loop on that one. We haven't seen any deadlock on reloads with SVN-branch-1.8-r335064.