clustering is SO AWESOME
Madison Kelly
linux-5ZoueyuiTZhBDgjK7y7TUQ at public.gmane.org
Fri Oct 9 21:41:47 UTC 2009
The last couple of weeks I've been messing around with clustering here
at the office. I've been banging my head against the wall trying to get
it working (busted CentOS RPMs aside...).
Anyway, I've got it working now and oh wow! When it works, it is a thing
of beauty. If I had portable enough hardware I would love to give a talk
on it.
For example;
I've got a simple 2-node cluster running LVM on DRBD. This acts as a
base for a set of Xen VMs. I use on-board IPMI as my fence devices and
CentOS/Red Hat Cluster suite for the magic.
I've been testing failure and recovery. Just now I decided to bite the
bullet and kill both nodes (simulated power event). This was were things
kept falling apart for me up until now. This time though, with the bugs
squashed, it recovered fine.
It's how it recovered that was so sweet.
So I fire up the first node and set to work on the docs. After about
five minutes I think "well, it should be up, lets see how bad it is". I
fire up 'luci' and log in to check the cluster state. It said both nodes
were up and the cluster was fine.
Now, I think to myself "crap, what went soooo wrong that it thinks the
cluster is ok??". So I log in and start parsing the log file. Then I see
this:
-------------------------------------
Oct 9 17:12:26 vsh02 openais[3748]: [CLM ] got nodejoin message
10.255.135.2
Oct 9 17:12:26 vsh02 ccsd[3742]: Initial status:: Quorate
Oct 9 17:13:17 vsh02 fenced[3767]: vsh03.canadaequity.com not a cluster
member after 3 sec post_join_delay
Oct 9 17:13:17 vsh02 fenced[3767]: fencing node "vsh03.canadaequity.com"
Oct 9 17:13:17 vsh02 kernel: tg3: eth1: Link is down.
Oct 9 17:13:23 vsh02 kernel: tg3: eth1: Link is up at 1000 Mbps, full
duplex.
Oct 9 17:13:23 vsh02 kernel: tg3: eth1: Flow control is off for TX and
off for RX.
Oct 9 17:13:24 vsh02 fenced[3767]: fence "vsh03.canadaequity.com" success
-------------------------------------
That's right, the first node said "hey, my buddy isn't here! Let me call
him." and BOOTED THE OTHER NODE. Sure enough, a few minutes later, it
came online, rejoined the cluster, sync'ed it's data over DRBD and all
was good.
That is sooooo coooool.
Time for the weekend!
Madi
--
The Toronto Linux Users Group. Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists
More information about the Legacy
mailing list