last weekend was not very good weekend for me. sucks big time! i was oncall last week. there was no page at all during the week days. but on friday my bos told me that there will be a power outage in one of our DC this weekend. they want to do something with the power supply & will fail over the power to generators. my thinking, i should not be worried then since the power will still be there. but i was wrong!
on saturday i got paged as early as 7am. one by one our servers rebooting. login to office's vpn and looked at our monitoring tool. shit! all of our servers rebooted & a few of them still down including 4 out of 6 our cluster servers! i knew something was not right.
connected to console to check what wrong with the servers. some of them crashed & needed fcsk, some of them keep on rebooting with root_mount_not_found and some other weird errors that i never encountered before.
my boss called and told me to go to DC. he's coming too of cos. i was there till 10pm doing fsck the servers. we managed to recover all except 1 server and continue on Sunday till 12pm.
there goes my weekend...
but out of it i learned a lot of things especially recovering root filesystems on solaris disk suite, veritas volume manager & veritas cluster server as well as the preasure behinds it when the big bos keeps asking when will the systems back online.
what went wrong was the generator failed!!! fsck /dev/generator
p/s: in total i received 100 over paged