Thursday, October 30, 2008

Most men, so I'm told, are glued to the TV when either the football or Olympics is being shown. Not me. I'm happily fairly oblivious to either. But now... it's the Volvo Ocean... and why do I mention it? Well, my team is currently in the lead. And, to make things even better, they have just beaten a world record.
Torben Grael and the crew of Ericsson 4 swept into the history books yesterday as the first monohull to breach the 600-mile barrier in 24 hours. They’ve been chased by men, machines and the elements in the last 48 hours – and nothing has touched them.
They had been lying fourth behind but battling it with the leaders - Green Dragon, Puma and Telephonica Black. But its pretty dreadful weather they are sailing through as Mark Chisnell puts it:
In their foaming, boiling, 25-knot wake the fleet lies scattered as the devil and the deep blue sea picked off the hindmost one by one – the cold front sweeping over them with a mix of murderous squalls and ugly waves in a pitch black night. We’re almost down to the last man standing.
If you're as gripped as I am you can follow the race online, even through a 3D virtual simulator, where the boat's instruments, when they are working, relay everything via satellite to your computer at home... almost in real time. But they don't always work. In fact, Ericsson 4 have equipment failure now.

So back to reality for me... over the past couple of weeks we have been battling murderous squalls on the technical front. Three weeks ago I wrote about the DDoS attack. One of the outcomes of reviewing this was a decision to upgrade two or more of the servers. They are three years old now and so replacing them is about due. But its not just a case of copy the files and off you go... it will take about three of us at least a month to move everything over and upgrade all the systems on the new servers. A very big job, which is why we only try to do it every three years!

Having decided to do this we brought Raed over from Egypt to help and then ordered the new hardware. We lease the servers rather than buy them, leaving the leasing company responsible for the hardware maintenance. On Monday they will pass them over to us, with a bare operating system on them and we will start the task of checking them and installing all the systems and moving the sites across.

In between all this the attacks have continued - like a cold front sweeping over us. We watch the attackers in real time, and have defense mechanisms set up to rebuff them. But trying to second guess their moves is difficult, so we have set up what is called a 'honey trap' to try and lure them in to showing their methods. This will give us some indication of how much they know about us and why certain sites are more attacked than others.

One of our partners - with a site for central Asia - was online chatting with me today and they want to increase the facilities, to start online broadcasting to their region. Another site - for the Middle East - will have new facilities and a new design before the new year. A further new site - also for the Middle East - should be live before the new year. So it feels like a 'foaming, boiling, 25-knot' race downwind barely in control of what is happening. I am looking forward to Christmas - which I hope will be the end of this leg of our race and the sites and new servers will all be behind me.

Friday, October 03, 2008

DOS attack

Most of day yesterday we suffered what was called a 'Distributed Denial of Service' or DDoS attack. This meant that web sites on one server were unavailable at times. The problem will have shown itself as either the server appearing to run slowly, or unavailable or problems within the website that looked like a MySQL problem.

So what is a DDoS attack? Well in our case all of these were caused by a whole load of computers sending invalid file requests many times per second - or at their slowest many many times per minute. What this did was to start extra instances of the web server to respond to these requests, till the server ran out of resources and failed to deliver. Normally the 'load of computers' are Windows computers with viruses [usually called a botnet] that allow them to be controlled from a master computer or robot system. All automated. Against us.

Peter eventually wrote a new rule into our automated response system to stop this happening by blocking users who try the same method of attack. Within seconds they were being blocked.

Fortunately it was a relatively minor attack. We recorded only 59 computers attacking us from the time we turned on the rule in the automated response system to block them. Today this has dropped to a trickle of 26 still attacking us in the first 8 hours of the day - all being blocked. Some botnets are huge - for instance, this August the Dutch police shut down a botnet of approximately 100,000 [Windows] computers infected and controlled by two people.

Oh, the the problem on Wednesday turned out to be a faulty cable. How come a faulty cable did all that? Well, the switch connecting to a workstation in the office, which, by the way, was turned off, sensed something strange on the cable and decided to keep trying to sort it out many thousands or millions of times per second. It also decided to tell the entire LAN about the problem [a broadcast message] again many thousands or millions of times per second. This broadcast message affected other switches and affected the server. Cable fixed, fault disappeared!

In case you're thinking that sounds rather like the DoS attack we suffered, it was. It was a type of DoS attack. The difference being that one is accidentaly, but from the evidence in the logs we can see the other was malicious.

Wednesday, October 01, 2008

Yikes its a bridging storm?

Today is a public holiday... so I should be off. I had hoped to go sailing with a friend.

Alert on my phone: All the connections to the FUP system are down... in fact sarah is down [sarah is the name of one of our servers]. This needs urgent attention. So I speed to the office.

It appears that one of the transceivers on one of the routers has failed. So I change it. No difference... but one part of our system starts to sort of work. So I check all the cables... and find that some that I need to know what they are are not labeled. [We have copious free time for labeling... not!] So I label all the cables, plug in the critical ones and everything looks fine.

I plug in the rest and... one of the servers has totally locked up. What? Crazy... cannot happen. Spend next few hours sorting out the server and everything looks fine... for a while... but my notebook cannot get an IP address. Why? So I unplug all the non-critical cables and... my notebook gets an IP address. Everything looks fine.

I plug in the rest and... one of the servers has totally locked up. What? Crazy... cannot happen.
OK, this time I learnt my lesson. I leave all the uncritical cables out, reboot and sort out the server and leave for home [dinner time now].

After dinner... I get an alert. One of the servers is not connecting. So I go back to the office... and find that in all my plugging an unplugging one of the cables has become lose. So fix it and plug in and go home.

Then the strange bit. I speak to Peter. He esplains [hope I get the jargon right] that we may have a 'bridging storm' going on. Basically its this... we have more than 50 devices [servers, routers, phones, workstations etc] in the office on 3 different physical LANs [ie networks] connected to about 16 'switches'... connected to 2 Internet connections to the outside world.

Switches are the things that connect all the devices together and talk to each other making a tree with one being the 'boss' [I'm sure Peter had a more technical word for that]. And the master switch talks to all the others telling them where in the tree they are and how to behave. If one of them wants to become the boss then an argument starts and can result in a bridging storm where some switches [and thus devices] are cut off. Why so many switches? Well, three reasons - firstly it's difficult and expensive to cable every point from a central location, secondly we have many extra points we need for testing and research and development and finally manufacturers [including manufacturers of VOIP phones now add a switch in the back of their devices.

So how would this make servers lock up? Well... we have some clever software in that to make sure that either the main or the backup is up and working. This is roughly the same language as the switches talk and maybe, just maybe, the bridging storm makes this go really crazy. Well... it makes me go crazy anyway.