Friday, December 30, 2005

Christmas break... NOT!

We planned carefully, worked hard before Christmas so we could all have a break of a week over Christmas. Over Christmas Eve, Christmas Day and Boxing Day a colleague in Egypt would be 'on call' as they celebrate Christmas in January not December. The rest of Christmas week we would be taking turns on being 'on call'. We have a system where is a fault develops it sends an SMS message to the person on call OR if someone needs support then again an SMS message is sent.

Tuesday was one of my other colleagues birthday so that day I was 'on call'. There were about 20 alerts from the monitoring system and 6 calls for help. So much for a break. The most serious was that on the server that handles enquiries from listeners to radio stations or visitors to web sites. Our servers have multiple hard disks in what is called a RAID system. The theory of this is that if one hard disk fails the other takes over... but... this only applies to data not to the 'system' which cannot be RAIDed without extra expensive hardware. Since it's the data that is critical to us we throught that this was the best way forward.

There are a number of servers in our office handling the various facilities we host. This is the 'engineers eye view' of the server rack. It looks somewhat complicated, but the rack contains servers and audio cabling and battery backup systems... everything we need to attempt to have 24/7 service!

Of course, this RAID system of multiple hard disks is the best way forward except during a Christmas break! The data was fine, but the system disk went unreliable. On Tuesday I 'patched it up' attempting to correct errors on the hard disk, with the aim of keeping it going till the following week when we were all back at work. Good theory. Didn't work in practice.

Wednesday... the hard disk failed totally. Peter is 'on call' so I can relax... hmmm... good theory? Other colleagues all round the world phoned me on my mobile, and didn't follow the correct procedure which would have put them in touch with him. Grrrrr... some un-Christian thoughts passed my mind! Messy day and Peter began the process of trying to sort out the mess.

Thursday... yep, my turn again 'on call'. Serious stuff... time for a rebuild of the server. Yes, that's me peering dangerously into the server with screwdriver in hand. We have a 'spare' server, so I ended up gutting that to get the main audience relations server running.

In the process I have re-built the system in a different way. Instead of dual hard drives in a RAID system which protected the data but left us with rebuilds whenever there were problems with the system disk I have used the spare server as a 'mirror' for the main one. The theory of this is that if on or other server fails the other can take over. We shall see!

