So because I was the duty person this week I came in early on Tuesday to turn off all the servers. The duty person is one of the three of us who receive SMS text messages telling us any problems with our servers. For the week we are on duty that person is the key person for all maintenance and the other two [hopefully] get an undisturbed week to work on other projects.
I started powering down all the servers at 7:45, and at 7:56 CLUNK... the power went off. Very unusual for Cyprus to be early. The final server was closing down and the battery backup system gave it enough power to close down neatly. Well, there was nothing to do so for the morning I went off to the boat to put a coat of paint on her.
I arrived back after lunch and the power was already on. So I powered up all the servers. One had a hard disk problem which I manually corrected, but the main Internet connection wasn't working. Diagnosed it with a colleague to be one of the Cisco routers that might or might not be working. This isn't like the router you have at home, its a much more complex rack mounted piece of equipment that when new was a couple of thousand dollars (we bought second hand through ebay). It's a complex unit - we have two of them, which we need to replace soon.
Cisco routers have a terminal connection that you can connect a 'dumb terminal' to and check their status. So tried that and found that the programs we used to emulate a dumb terminal didn't seem to work and so we were then scrabbling around to try and work out what was happening.
Peter arrived in an hour later and said it looked like that when the power had come back on there was a surge which had blown the power supply. We keep spares for some things and did have a spare Cisco power supply so we then had to take it out of the rack to change the power supply.
At that point we found a layout logic problem with the rack. All the cables from the switches to the patch panels went across the routers. This meant we had to remove all the cables in the bay before we could remove the router. This meant all the cables needed labeling so they went back in the right place.
Now detail is not my middle name so it took three attempts to label the cables correctly. We then unplugged them all, and changed the power supply and the router started up correctly. Sounds nice and easy... well... we also had to remove the rat droppings from a vermin attack last year that we had not seen before!
We put it back in the bay and some things started working. Well... most things started working. There had been about 40 cables to label and then plug back in the right place. I had misplugged some of them and mislabeled one of them! I told you detail was not my middle name.
Eventually between all three of us (me on duty and the other two who should be doing other things) we got the system back up and running and everything working again.
I say everything... come Wednesday and totally unconnected with this there was a problem with email on one of our servers in Germany and Peter had been working on this problem and thought it was a problem with the security certificate. So he ordered another one. Unfortunately he put a password into the certificate when there should be none, so when the certificate was authenticated and delivered it didn't work. So he tried (with me across it, as I am the authorized representative of the organisation) to get a replacement.
That failed, coming up with a 'security failure' - so then I am trying to phone South Africa to talk to the company that issues the security certificates and find out what is happening. No answer from the company and tried sending emails/response forms... still now answer. All the while this means our and other companies email is not working!
I'm frustrated. I hate computers. Peter and I ended up talking about the future. I realised I spend my time approximately the following way:
- 80% Technical
- 10% Writing proposals
- 15% Organisational admin
- 25% Partner interaction
- 10% Media
I try to get the technical to be a smooth operation - it's not smooth, we have more work than possible for the team - and that creates a burden for me.
So, why don't we just drop the technical? Or reduce it?
When I say technical, what I mean here is the system maintenance. Maintaining more than 10 servers as a platform to enable the media work to proceed. Many different organisations rely on these servers.
Pete and I looked at the different servers and realised that if we just cut back to our core media project we would still need to maintain almost all of the servers just for that one project. We would save very little time. And... the partner contributions from many different groups actually help us to run those servers which then creates the platform for the main media project we want to achieve. Sort of Catch 22.
Come Thursday and I am still 'Hanging in there', but frustrated.
I get an alert (I'm duty person remember) about the temperature in the server room. The air conditioning in the server room is not working. No problem... obviously didn't restart when the power went back on. Walk round to the server room and press the remote.
Nothing happens.
Oh, must be flat batteries on the remote, since we leave it on 24/7 and don't change it. Find another remote for the same type of air conditioner, test it to make sure it does work on the other air conditioner... take it to the server room and press the remote start.
Nothing happens.
Obviously something else blew up when the power went back on.
Frustrating.
When Dena (our administrator) is back in on Monday I shall have to get her to arrange an air conditioning technician to come and sort out the unit. This being Cyprus, who know how long that will take.
Anyway in all these frustrations we realised we had two needs:
- A Finance Director
- About size young people Media/IT literate
4 comments:
Hi Richard
I really enjoyed your post but don't understand the second of your listed needs:
About size young people Media/IT literate
Call me pedantic but could you clarify?
(I'm a lover of the technical and somewhat into detail).
Cheers
Phil Ferris
It was a typo... should have read 'about SIX young people'
Richard
Hi Richard
I don't understand the second pahra of your post
==========
Seo Servises
Let me try to explain the second paragraph: In the same way that there are doctors 'on call' for if someone gets sick, we have a 'duty person' who is responsible for the servers for each week. It's the 'duty person' who gets automatic messages from the servers if anything is going wrong. The 'duty person' is also the person that does anything like powering down servers for maintenance etc.
Post a Comment