Thursday, November 22, 2007

fsck /dev/oncall

last weekend was not very good weekend for me. sucks big time! i was oncall last week. there was no page at all during the week days. but on friday my bos told me that there will be a power outage in one of our DC this weekend. they want to do something with the power supply & will fail over the power to generators. my thinking, i should not be worried then since the power will still be there. but i was wrong!

on saturday i got paged as early as 7am. one by one our servers rebooting. login to office's vpn and looked at our monitoring tool. shit! all of our servers rebooted & a few of them still down including 4 out of 6 our cluster servers! i knew something was not right.
connected to console to check what wrong with the servers. some of them crashed & needed fcsk, some of them keep on rebooting with root_mount_not_found and some other weird errors that i never encountered before.
my boss called and told me to go to DC. he's coming too of cos. i was there till 10pm doing fsck the servers. we managed to recover all except 1 server and continue on Sunday till 12pm.
there goes my weekend...
but out of it i learned a lot of things especially recovering root filesystems on solaris disk suite, veritas volume manager & veritas cluster server as well as the preasure behinds it when the big bos keeps asking when will the systems back online.
what went wrong was the generator failed!!! fsck /dev/generator

p/s: in total i received 100 over paged

3 comments:

Red Mummy said...

the weekend that u missed the webcam wt yr family

microkernel said...

Haha..mmg hampes..aku keje morning time tu...time aku balik kul 3pm/2am Hou, Roxanne ngan Philip tgh keje lg. Sib baik esoknyer aku pg hehe training..selamat :p

cik easy said...

ohhhh bila solaris run veritaS cluster dan shutdown tak dijangka... run> ha start pon takkan jalan...sebab SAN or NAS tak mount bebetul.. ohhhh aku pernah kena.. tak sanggupppppp