Friday, December 30, 2005

Christmas break... NOT!

We planned carefully, worked hard before Christmas so we could all have a break of a week over Christmas. Over Christmas Eve, Christmas Day and Boxing Day a colleague in Egypt would be 'on call' as they celebrate Christmas in January not December. The rest of Christmas week we would be taking turns on being 'on call'. We have a system where is a fault develops it sends an SMS message to the person on call OR if someone needs support then again an SMS message is sent.

Tuesday was one of my other colleagues birthday so that day I was 'on call'. There were about 20 alerts from the monitoring system and 6 calls for help. So much for a break. The most serious was that on the server that handles enquiries from listeners to radio stations or visitors to web sites. Our servers have multiple hard disks in what is called a RAID system. The theory of this is that if one hard disk fails the other takes over... but... this only applies to data not to the 'system' which cannot be RAIDed without extra expensive hardware. Since it's the data that is critical to us we throught that this was the best way forward.

There are a number of servers in our office handling the various facilities we host. This is the 'engineers eye view' of the server rack. It looks somewhat complicated, but the rack contains servers and audio cabling and battery backup systems... everything we need to attempt to have 24/7 service!

Of course, this RAID system of multiple hard disks is the best way forward except during a Christmas break! The data was fine, but the system disk went unreliable. On Tuesday I 'patched it up' attempting to correct errors on the hard disk, with the aim of keeping it going till the following week when we were all back at work. Good theory. Didn't work in practice.

Wednesday... the hard disk failed totally. Peter is 'on call' so I can relax... hmmm... good theory? Other colleagues all round the world phoned me on my mobile, and didn't follow the correct procedure which would have put them in touch with him. Grrrrr... some un-Christian thoughts passed my mind! Messy day and Peter began the process of trying to sort out the mess.

Thursday... yep, my turn again 'on call'. Serious stuff... time for a rebuild of the server. Yes, that's me peering dangerously into the server with screwdriver in hand. We have a 'spare' server, so I ended up gutting that to get the main audience relations server running.

In the process I have re-built the system in a different way. Instead of dual hard drives in a RAID system which protected the data but left us with rebuilds whenever there were problems with the system disk I have used the spare server as a 'mirror' for the main one. The theory of this is that if on or other server fails the other can take over. We shall see!

Saturday, December 10, 2005

A week is a long time...

They say a week is a long time in politics... but it seems that two weeks is even longer in the work I do, since it's two weeks ago that I last wrote in this blog.

A couple of weeks ago the main email server in London had been attacked and just been rebuilt. Well... nearly rebuilt as we found out later. It was not that the attacker had left anything behind - we had been worried that he might have left a trojan or or two. Trojan programs are the computer equivalent of the Trojan horse in Greek history - they are programs hidden in the system that allow malicious attackers later access, without actually being obvious as being something nasty. There weren't any trojan programs left.

But... the server was totally unreliable for about 10 days. Servers like the ones we run have as part of their Operating System a method of self-protection. What that means if if they start running out of resources they automatically kill off normal programs so that the server keeps going. So, for instance, if it was running short of memory the server would kill off the email service and keep going. This stops us having what Windows users call the 'blue screen of death' but can be extremely irritating to find you have to restart the email service or whatever reguarly. But we couldn't work out why we were running out of memory. The server has inside it two computers [processors] and about four times the amount of memory a home computer has. It shouldn't run out of memory!

In this trauma we thought the memory was faulty so we got the leasing company to change the memory. Because we were having memory problems they tested the memory before installing and after they changed it we still had memory problems.

Eventually we traced the fault down to when the leasing company had re-installed the system they had installed the wrong 'kernel'. The kernel is the heart of the Operating System and the kernel they had installed was designed for very old processors that could not handle as much memory as we had installed in the server. We remotely installed a new kernel. This worried us as if the kernel didn't install correctly the server would crash and we would have to request another total install of the Operating System. However, it did install correctly and we now have the server behaving correctly, working faster and not running out of memory.

We took the opportunity of the rebuild of the server to implement some security enhancements that we had been planning to do around now anyhow. But we had been planning to do them on another server to test before installing them on the 'live' one. It also required writing a new user interface for part of the system... all to do done as fast as possible so that 'normal service will be resumed as soon as possible' as it used to say on the TV screens while I was a kid growing up.

In between all this I was co-ordinating a project for a new very large website we will launch in February. We have had a programmer here for a month [he leaves today actually] and there has been the need for a lot of thought about how the underlying structure will work so that it will be expandable in the future.

And... the mobile phone text message system is just about to go into phase two of development. We have proven it is both needed and doable, but we need a more reliable and expandable system, that would also be able to be easily installed in other locations around the world. We had a planning meeting about that leading onto research oabout equipment we can use in the future. Phase two starts as soon as possible...

Between all this I have been editing a video training series for the organization we grew out of. Eventually we hope it will be an interactive DVD. Training videos are always very difficult to do. The reason for this is that they have to be interesting. I know all films and videos have to be interesting, but training films are somehow more difficult to get to be interesting.