Today is a public holiday... so I should be off. I had hoped to go sailing with a friend.
Alert on my phone: All the connections to the FUP system are down... in fact sarah is down [sarah is the name of one of our servers]. This needs urgent attention. So I speed to the office.
It appears that one of the transceivers on one of the routers has failed. So I change it. No difference... but one part of our system starts to sort of work. So I check all the cables... and find that some that I need to know what they are are not labeled. [We have copious free time for labeling... not!] So I label all the cables, plug in the critical ones and everything looks fine.
I plug in the rest and... one of the servers has totally locked up. What? Crazy... cannot happen. Spend next few hours sorting out the server and everything looks fine... for a while... but my notebook cannot get an IP address. Why? So I unplug all the non-critical cables and... my notebook gets an IP address. Everything looks fine.
I plug in the rest and... one of the servers has totally locked up. What? Crazy... cannot happen.
OK, this time I learnt my lesson. I leave all the uncritical cables out, reboot and sort out the server and leave for home [dinner time now].
After dinner... I get an alert. One of the servers is not connecting. So I go back to the office... and find that in all my plugging an unplugging one of the cables has become lose. So fix it and plug in and go home.
Then the strange bit. I speak to Peter. He esplains [hope I get the jargon right] that we may have a 'bridging storm' going on. Basically its this... we have more than 50 devices [servers, routers, phones, workstations etc] in the office on 3 different physical LANs [ie networks] connected to about 16 'switches'... connected to 2 Internet connections to the outside world.
Switches are the things that connect all the devices together and talk to each other making a tree with one being the 'boss' [I'm sure Peter had a more technical word for that]. And the master switch talks to all the others telling them where in the tree they are and how to behave. If one of them wants to become the boss then an argument starts and can result in a bridging storm where some switches [and thus devices] are cut off. Why so many switches? Well, three reasons - firstly it's difficult and expensive to cable every point from a central location, secondly we have many extra points we need for testing and research and development and finally manufacturers [including manufacturers of VOIP phones now add a switch in the back of their devices.
So how would this make servers lock up? Well... we have some clever software in that to make sure that either the main or the backup is up and working. This is roughly the same language as the switches talk and maybe, just maybe, the bridging storm makes this go really crazy. Well... it makes me go crazy anyway.
No comments:
Post a Comment