clustering is SO AWESOME

Fri Oct 9 21:41:47 UTC 2009

The last couple of weeks I've been messing around with clustering here 
at the office. I've been banging my head against the wall trying to get 
it working (busted CentOS RPMs aside...).

Anyway, I've got it working now and oh wow! When it works, it is a thing 
of beauty. If I had portable enough hardware I would love to give a talk 
on it.

For example;

I've got a simple 2-node cluster running LVM on DRBD. This acts as a 
base for a set of Xen VMs. I use on-board IPMI as my fence devices and 
CentOS/Red Hat Cluster suite for the magic.

I've been testing failure and recovery. Just now I decided to bite the 
bullet and kill both nodes (simulated power event). This was were things 
kept falling apart for me up until now. This time though, with the bugs 
squashed, it recovered fine.

It's how it recovered that was so sweet.

So I fire up the first node and set to work on the docs. After about 
five minutes I think "well, it should be up, lets see how bad it is". I 
fire up 'luci' and log in to check the cluster state. It said both nodes 
were up and the cluster was fine.

Now, I think to myself "crap, what went soooo wrong that it thinks the 
cluster is ok??". So I log in and start parsing the log file. Then I see 
this:

-------------------------------------
Oct  9 17:12:26 vsh02 openais[3748]: [CLM  ] got nodejoin message 
10.255.135.2
Oct  9 17:12:26 vsh02 ccsd[3742]: Initial status:: Quorate
Oct  9 17:13:17 vsh02 fenced[3767]: vsh03.canadaequity.com not a cluster 
member after 3 sec post_join_delay
Oct  9 17:13:17 vsh02 fenced[3767]: fencing node "vsh03.canadaequity.com"
Oct  9 17:13:17 vsh02 kernel: tg3: eth1: Link is down.
Oct  9 17:13:23 vsh02 kernel: tg3: eth1: Link is up at 1000 Mbps, full 
duplex.
Oct  9 17:13:23 vsh02 kernel: tg3: eth1: Flow control is off for TX and 
off for RX.
Oct  9 17:13:24 vsh02 fenced[3767]: fence "vsh03.canadaequity.com" success
-------------------------------------

That's right, the first node said "hey, my buddy isn't here! Let me call 
him." and BOOTED THE OTHER NODE. Sure enough, a few minutes later, it 
came online, rejoined the cluster, sync'ed it's data over DRBD and all 
was good.

That is sooooo coooool.

Time for the weekend!

Madi
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists