Skip to main content

HA Clustering not working on DELL R210

Posted by phoetger on Wed, 03/23/2011

We bought 2 brand new servers (DELL R210)
They have dual onboard NIC's (Broadcom)

In our lab environment with some home build machines the clustering worked as designed. On the DELL R210 the clustering does not work. On the Master server the message "bnx2: eth1 NIC Copper Link is Down" keeps repeating, and then it fails and the Thirdlane GUI says: HA CLUSTERING INTERNAL FAILURE. The message only happens on the eth1 that is connected to the eth1 on the 2nd server with a crossover cable (on the 10.10.10.x IP.)

We looked online to see if there are any solutions and we added a line into the modprobe.conf, modprobe bnx2 disable_msi=1,1,1,1 now the error does not come up any more, but clustering still does not work. It keeps failing.

Any suggestions?

Thanks


Submitted by eeman on Wed, 03/23/2011 Permalink

you dont want to disable your NIC, thats never a good thing.

my suggestion is not using a crossover cable. thats a schwagg solution. your better off having a good, managed, gigabit ethernet switch that you can run jumbo frames across. a couple ports re-allocated to be switchport mode access for a different, private, vlan should do the trick.

for your experiment just use any switch that will give you good link indication. Once you confirm it resolved the solution you're going to want to use something like a dell 5424 or 5448 switch. make sure to turn on jumbo frames inside it.

Submitted by phoetger on Wed, 03/23/2011 Permalink

Erik,

Thanks for your quick reply...

modprobe bnx2 disable_msi=1,1,1,1 does not disable the NIC, this parameter is used to disable Message Signaled Interrupts (MSI) that causes issues on CentOS 5.3, 5.4 and 5.5

We've tried hooking it into a switch instead of cross over and we get the same error.
We also tried forcing a 100MB connection instead of 1000MB, but it made no difference.

I guess Thirdlane Clustering does not like BroadcomNetXtremeII NICS. It works on INTEL and REALTEK NICS.

Output on the broadcom NICS:
Running installation script ...
Please wait ...
----10min later----
Internal error during HA cluster configuration

Has anyone ever had this error and know a work around? As of right now 2 servers and 2 HA licenses are sitting there unused...

Thanks

Submitted by eeman on Wed, 03/23/2011 Permalink

what your describing (link going up and down) isn't a thirdlane issue but an OS issue. Ive never had a single problem with dell servers using broadcom nic's. Ive done heartbeat+DRBD on R200's, R710's, 2950's, and even 2850's. I have never had to alter the driver parameters of the NIC. But then again, I don't use an ISO, I do it the real way, so that as each layer is built, I get feedback and confirmation of that specific layer working. It sounds like you could really use that kind of feedback right about now. A link going up and down; well that's layers way beneath a post-install bash script. Is it only one of the two machines this errors on (when connected via a switch)? Maybe you just have a bad NIC? Technically its simply an IP interface and DRBD runs over it. Can the two NICs even ping each other?

BTW the ISO is supposed to still be BETA.

Submitted by phoetger on Wed, 03/23/2011 Permalink

All interfaces are pingable from all servers without data loss, the interfaces only do that when the clustering is started.

It's not like we never setup anything like this. 2 Other servers (home builds with INTEL and Realtek NICs) work fine. (Same setup, crossover cable between ETH1)

As you said, it might be an OS issue.

For now all we can do it try and tweak the OS to see if it resolves the issue... If we get it working I will post a reply with the solution.

Thanks

Submitted by eeman on Wed, 03/23/2011 Permalink

if the interface is only doing that during data transfer, try scp'ing a decent size file across the network. What MTU is the ISO setting on the interfaces?

Submitted by eeman on Wed, 03/23/2011 Permalink

check to see if eth1 is getting its MTU set to 9000 (jumbo frames). The R200 nic's (BCM5722) did not support jumbo frames but the R610 with its BCM5709 has no issues

Submitted by phoetger on Thu, 03/24/2011 Permalink

I've set the MTU to 9000, but it did not make any difference. :-(

Is there a way to tell the Thirdlane software not to use jumbo frames? In out Lab environment both NICs on both servers are set to MTU 1500 and we have no problem running the HA Clustering.

In the lab the NICs are E100 Intel.

Thanks

Submitted by eeman on Thu, 03/24/2011 Permalink

the 9000 mtu was just a theory. I see no reason why you should not be able to run DRBD on eth1 of the R210. My suggestion is to take both boxes and put centos 5.5 on a partition that includes only the first 30gig of your HDD, leaving the rest unpartitioned. Once up and running and you have a working backside network partition the disk space, create the drbd replicated block device, then format that block device with the mkfs.ext3 command. There are plenty of documents that talk about how to do that much.

once you have that accomplished, have mounted the block device into your directory structure, and have tested reading/writing to/from the filesystem you'll know immediately whether the problem is with the OS+hardware combination or some setting in the process that the ISO uses to create your cluster.

Submitted by eeman on Thu, 03/24/2011 Permalink

I just had a thought... what if your problem is with the fakeraid that the R210 uses.. Ive never been a fan of the R210 purely from the lack of hot swap storage aspect. Are you trying to use that S100 raid controller to mirror your HDD? That only works with winblows.

Submitted by phoetger on Thu, 03/24/2011 Permalink

Good thought, we had the R210 ordered with "SAS IR6" hardware raid controllers as the S100 and S300 would not work.

Unless that could still be the issue. We might have to try and just hookup one HDD on each server without any raid and try the clustering then to eliminate that as a cause.

Submitted by phoetger on Thu, 03/24/2011 Permalink

It looks like we are going to buy a PCI-Express Dual NIC card and not use the onboard ones. Any recommendations?

NC360T PCI Express support jumbo frames.

Once we get those and install them I'll let you know what the results are.

Thanks for all your help.

Submitted by phoetger on Thu, 03/31/2011 Permalink

We have tried the following now, but still no luck on the clustering:

Onboard NIC's = no go
GB switch in between the servers = no go
Swapped out Network cards with Intel GB NIC's (with Jumbo Frame Support) = no go
Hardware Raid = no go
Software Raid = no go
Onboard Raid controller =no go
SAS6IR Raid controller = no go

On some home build desktop PC's the clustering worked like a charm, but they are not powerful enough to handle the load. (Celeron 1.7Ghz with 512MB RAM and Intel E100 NIC's).

The only thing we have not tried yet on the R210 is, to run the clustering on the servers without any RAID (just a single HDD) to eliminate it has something to do with that. (We'll try to see if it is the RAID configuration that prevents the clustering)

Will keep you updated.

(For now, don't suggest anyone buying a Dell R210 if they plan on MTE with clustering) :-)

Thanks

Submitted by eeman on Thu, 03/31/2011 Permalink

keep me posted..

for the record I would say "For now, don't suggest anyone buying a Dell R210 if they plan on MTE with clustering using the simplified ISO to do it"

I have a high degree of confidence that I can get drbd to replicate across the backside network on those boxes if doing it step by step the way I typically do it.

Submitted by phoetger on Thu, 03/31/2011 Permalink

OK, we tested it without the raid...
Still the same issue...

When we look at the running processes when we start the HA CLUSTER CONFIG we see that
miniserv.pl comes up and then goes to
miniserv. defunct
and then it repeats for the 10 min.

Then when GUI says, "Internal error during HA cluster configuration" the miniserv.pl stops popping up as well.

Would that have anything to do with it?

Thanks

Submitted by eeman on Thu, 03/31/2011 Permalink

sounds more like a bug in an install script somewhere. I don't know much about the installation scripts used by anaconda/python during install, that was developed by someone else. do you know enough about drbd to do it manually? My expertise resides in established systems more so than installer scripts. I build most of my stuff by hand anyway. Its like the difference between building a model plane out of balsa wood and shrink skin or buying a pre-built plastic plane to fly around. IMO building it as just as satisfying as flying it.

Submitted by phoetger on Wed, 04/06/2011 Permalink

I set up DRBD manually on both servers, the failover works; however calls on the secondary server do not complete - they just go to dead air. I don't even see in the asterisk logs that calls are making it to the secondary server when it is 'active.'

What script is it trying to run when it is initiated from the GUI so I can look at the steps it is taking?

Thanks.

Submitted by eeman on Wed, 04/06/2011 Permalink

by 'do not complete' do you mean new calls or the calls in play when you failed over?

how are you making one server active and the other standby?

did the new active server take over the 'floating' ip ?
is this the ip you have the phones registering too?
is this the ip asterisk is configered to listen on?

Submitted by phoetger on Wed, 04/06/2011 Permalink

Erik,

It looks like the calls never make it to the (Slave) of the cluster server, even though it takes over the IP. I'm sure there is something else missing that is not being properly replicated, that's why I asked what script is it trying to run when initiated through the GUI so I can see step by step what is being installed, copied, changed...

We issued a REBOOT command on the MASTER SERVER through the CLI (While rebooting, the SLAVE PC became the active server with the floating IP and the DRBD0 of the r0 resource was mounted as /DRBD (Web Gui worked and it retained all the settings, phones however were never able to dial out (just dead air) and I saw in the "asterisk -rvvv" on the SLAVE (while active), that nothing was coming across to it...

When the MASTER came back up, it switched everything back onto the MASTER server and everything seems fine, (Calls in/out, Settings were retained aso...)

- IP was taken over by the SLAVE (it swicthed back when the master came back up)
- The IP was the same as registered on the phone

As to what IP asterisk is listening on, I have not checked that, that might be the problem... I will check on that. What command do I have to issue to see the IP Asterisk wants to send and receive calls on? (It should listen on eth0:0 not eth0, correct?)

Thanks

Submitted by thirdlane on Wed, 04/06/2011 Permalink

Asterisk is supposed to start from pbxportal script which is supposed to be under HA control - this way it will listen to the shared address once the address is active. Is it possible that Asterisk starts before the address is active? This would happen if you would manually reinstall Asterisk