SARK-HA (Red Eye) High Availability Cluster Planning & Set up.

Introduction

SARK-HA is available as a set of additional rpms for a regular SARK/SAIL installation or as a full-blown self-installing .iso. SARK-HA gives near real-time failover between a pair of SARK UCS MVP Servers running as a High Availability Cluster. The pair share a virtual IP address and this is what the phones and lines use to communicate with the currently-available server. In the event of a failure the phones and lines do not have to change IP address to re-register. Failover currently takes about 12 seconds to complete.

In order to make the failover invisible to the phones and VoIP lines, we use three IP addresses for each HA-Cluster. Each server (the primary and the secondary) has its own local IP address. A third "virtual" ip address is used to communicate with the rest of the sub-net and the internet. This third IP Address is passed back and forth between the two servers during failover.

In addition to being connected via a heartbeat mechanism, the two machines' PBX images can also optionally be kept in synch by a new Daemon written specially for the purpose. The daemon opens an SMB share between the two machines and periodically compares the two PBX images. Should the primary image change (as a result of on-going maintenance) then the Daemon will automatically fetch the failover image up to the same level.

In Release 1.0.0 the HA Cluster will automatically failover if either Asterisk fails on the primary machine or the entire machine fails. Failover can also be triggered manually in the event that, for example, hardware maintenance needs to be carried out on the primary without stopping the phone service.

Control & Setup

In keeping with the rest of SARK, SARK-HA setup is almost entirely automatic. Once the new rpms have been installed, there are four (4) new fields to fill out in globals. This, of course has to be done on two machines. Once done, the High Availability engines can be started on each machine and the service is enabled. SARK-HA requires SAIL-2.2.4 or higher to operate. It also requires license clearence from the authors. For testing purposes you can receive a limited license by sending an e-mail to admin@selintra.com with the heading SARK-HA testing and enclosing the Serial number of the SARK/SAIL system you would like us to clear. The serial number for your system can be found on the globals panel.

Restrictions

  • In this first release SARK-HA will only run across a pair of SME-Servers which are running in Server-Only mode with DHCP turned OFF. DHCP services (including option66) need to be provided by a separate system.
  • The two machines need not be symmetrical.
  • The two machines must run SAIL-650 or higher and they should run the same release (although point releases don't matter too much).
  • The two asterisk releases need not be the same.
  • Either or both of the machines may be virtualised, although it is NOT a good idea, except perhaps for testing, to have both virtual images on the same physical platform (for obvious reasons).
  • If you are upgrading an existing site then you will need to make a minor adjustment to your existing extension entries in the server manager. This is not necessary if you are contemplating a new installation or if you are adding new extensions. The change has to do with the fact that the provisioning IP address will change to the "virtual" IP address when the system is running HA. Here is how an existing extension might look on your system...

In preparation for HA you need to replace the previously hard coded SIP and SIP registrar IP addresses with the symbolic value $localhost, as follows..

A new late-address resolver in Sail-2.2.4-21 will substitute the correct values into the tftp server at startup. In this way, you system can run either in normal mode or in HA mode without having to change the IP addresses.

Installation and testing.

Upgrade both of your machines to Sail-2.2.4-21 or higher.

As part of your install pack you will have been given a download address for the Linux HA rpms. Download and install all of these rpms onto each machine. The rpms can be installed using rpm -Uvh commands. You will have been given a sequence in which to apply them.

Download and install the SARK-HA rpm onto each machine and install with yum localinstall

run signal-event post-upgrade on each machine and then reboot them both.

Startup

Log in to the server-manager on the system which you have nominated as your primary. The globals panel should look like this...

You will see that your globals panel has some new information at the top telling you the state of your HA Cluster. INitially it will come up with both Asterisk and the HA engine stopped. You must fill out 4 new fields in the panel before you can start the HA service...

HA Synch Mode:

Description - Dropdown.
Permissable values - {LAZY|LOOSE}. LAZY means periodic synchronization of the two system images in the cluster. LOOSE means no synchronization.
Default - LAZY.

HA IP Address:

Description - This is the Virtual Address that the cluster will run at. You should allocate a free static address on your subnet for this .
Permissable values - standard dotted quadrant
Default - None.

HA Primary Node Name:

Description - Nodename of this server as given by uname -n.
Permissable values - The uname -n node name.
Default - None.

HA Secondary Node Name:

Description - Nodename of the failover server as given by uname -n.
Permissable values - The uname -n node name of the failover server.
Default - None.

IMPORTANT NOTE

These four fields must be IDENTICAL on BOTH HA servers in the cluster.

Startup.

Once you have set up both servers and committed your updates, you are ready to start the high-availability engines. On the primery server press the "HA ON" button. After a few moments the button should change to red and it should read "HAOFF". At this point, the asterisk "START" button will have disappeared. However, if you refresh the screen after a few moments, you should see that this has changed to the asterisk RED "STOP" button, indicating that the HA engine has started asterisk. You can now repeat the procedure on the failover server. Asterisk should NOT come up on the failover server, the STOP button should remain invisible.

After startup, here's how your primary server should look...

...And here's how your failover server should look...

Testing

You should now be able to log-in to your primary server-manager panel at the virtual ip address. You can now begin using and testing your phone extensions and lines. N.B. if you are not using remote provisioning to set up your phones then you should remember to register them at the Virtual IP Address of the cluster and NOT the real ip address of either of the servers.

You can test the failover by bringing down either primary (i.e.active) asterisk or the entire primary server. Be as brutal as you feel comfortable with. The gentlest way to force failover is simply to stop asterisk on your primary server. After a few seconds, asterisk will start on the failover server and when you log-in to the virtual address you will be connected with the failover server. You can fail back by stopping asterisk on the failover server.

Implementing shared T1/E1 PRI with the Rhino Failover Card

SARK-HA has full on-board support for the Rhino Single port failover card. The card allows ISDN PRI circuits to be shared between HA cluster nodes. Access to the ISDN circuit is passed back and forth between nodes during failover and failback. In keeping with the rest of SARK, operation of the card is entirely automatic. All the user need do is to inform SARK that the card is present in the configuration and wire the card correctly.

Card Installation

The software drivers for the card are shipped with the SARK-HA rpm. However, you will need to make a small modification to /etc/ld.so.conf to reflect the load directory...

Make the following changes on BOTH machines...

[root@hasalpha ~]# cat /etc/ld.so.conf
include ld.so.conf.d/*.conf
/usr/local/bglibs/lib
/usr/local/lib

Add the last line (/usr/local/lib) to etc/ld.so.conf and run ldconfig.

You must also inform SARK of the card's presence by running the following commands at the linux console

db selintra setprop global RHINOSPF YES

db selintra-work setprop global RHINOSPF YES

DON'T FORGET THAT THESE CHANGES NEED TO BE MADE ON BOTH MACHINES IN THE CLUSTER!

Reboot both of your systems and the card is now ready to use.

Cabel out the card as follows... With the card's USB port on the left and proceeding left to right...

  1. Connect Socket 1 (marked "IN" on the diagram) to the PTT NTE.
  2. Connect Socket 3 (marked "NO" on the diagram) to the T1/E1 card in the PRIMARY node.
  3. Connect Socket 4 (marked "NC" on the diagram) to the T1/E1 card in the STANDBY node.

Operating Sequence - catastrophic failure

With power OFF, the card will bridge IN and NC. With power ON, SARK will issue the necessary commands to the card to bridge IN and NO. In this way, the system will feed ISDN30 signal to the PRIMARY server when power is on and to the secondary server when power is off. Thus if the PRIMARY node fails (loses power) then the ISDN30 will be transferred automatically to the STANDBY node.

Operating Sequence - Asterisk failure

A watchdog daemon runs on both the PRIMARY and STANDBY nodes. Should Asterisk fail on the PRIMARY node (or indeed the secondary - if it is running there), then the daemon will

"force" a failover event. It will also send the necessary commands to the Rhino card to failover the ISDN30 signal.

Modifying ha.cf values

The main control file for the heartbeat mechanism in HA is held in /etc/ha.d/ha.cf. You can read up on all of the ha.cf parameters and what they do here

http://www.linux-ha.org/ha.cf

If you need to modify ha.cf values then they need to be changed in the templates that generate the ha.cf file. You will find the main file here...

/etc/e-smith/templates/etc/ha.d/ha.cf/20-doit

The actual template is very simple...

==============snip=====================>

package esmith;

use strict;
use Errno;
use esmith::config;
use esmith::util;
use esmith::db;

my %selintra;
tie %selintra, 'esmith::config', '/home/e-smith/db/selintra';
my $haprinode = db_get_prop(\%selintra, 'global', "HAPRINODE");
my $hasecnode = db_get_prop(\%selintra, 'global', "HASECNODE");

$OUT .= <<'HERE';
auto_failback off
debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility local0
keepalive 1
deadtime 20
warntime 5
initdead 120
udpport 694
bcast eth0
HERE

$OUT .= "node $haprinode\n";

$OUT .= "node $hasecnode\n";

===============snip==================>

Make your changes and save the file back then issue an expand-template.

expand-template /etc/ha.d/ha.cf

You will probably also need to issue a reload to heartbeat...

/etc/init.d/heartbeat reload

Example

Here is an example where we are going to change some of the parameters to incorporate a serial cable and to calm down the failover behaviour (which is a little too fast on the standard install, particularly if you get the odd drop out errors on your network)

==============snip=====================>

_package esmith;

use strict;
use Errno;
use esmith::config;
use esmith::util;
use esmith::db;

my %selintra;
tie %selintra, 'esmith::config', '/home/e-smith/db/selintra';
my $haprinode = db_get_prop(\%selintra, 'global', "HAPRINODE");
my $hasecnode = db_get_prop(\%selintra, 'global', "HASECNODE");

$OUT .= <<'HERE';
auto_failback off
debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility local0
keepalive 1
deadtime 20
warntime 5
baud 38400
serial /dev/ttyS0
initdead 120
udpport 694
bcast eth0
crm off
HERE

$OUT .= "node $haprinode\n";
$OUT .= "node $hasecnode\n";

===============snip==================>

expand-template /etc/ha.d/ha.cf

/etc/init.d/heartbeat reload

Topic revision: r7 - 16 Nov 2009 - 20:56:46 - TWikiAdminUser
 
    

This site is powered by the TWiki collaboration platformSARK SARKPBX and POLYGATE are registered trademarks of Aelintra Telecom Limited.
Ideas, requests, problems regarding SARK UCS/MVP? Send feedback