Some kind of honorary BOFH award needed here...
...don't you think?
A sysadmin last week pleaded guilty to attempting to disrupt the power grid in California by shutting down a data center that managed the state's electricity supply. Lonnie Charles Denison, 33, of South Natomas in California confessed to breaking a glass cover and pushing an emergency power off button at the Independent System …
OK, lets get this straight. A single false alarm costs $14k and 20 (thats what the man said, count 'em TWENTY) computer specialists seven hours to correct it , while the state of California faces imminent melt down.
So thats 140 man ^H^H^H person hours to find the reset switch. And replace the broken glass. Someones taking the peace out of someone here, but I can't quite see who. Hence the gratuituous Paris emoticon.
1) Get someone else to do it (Preferably some bean counter)
2) Make money while it is done (energy futures?)
3) Get someone else blamed for instigating it.
4) Turn in your most hated person for doing it.
5) Erase all evidence of involvement
6) Have the contract to "restore" the system.
7) Get paid for supplying the entire 20 person team
8) Split the proceeds of the contract with PFY (70/30 at least) for only doing 10 minutes of work between yourselves.
9) Have the whole thing handled while across the street in the pub quaffing a brew.
Obviously you haven't worked in a large IT center.
Its never a simple as "finding the reset switch"
You would need network admins, system admins, DBAs and application support to bring all of the systems back online. Once back online everything would have to undergo testing to ensure everything works as its supposed to. This all takes manpower and time and specialists.
Of course, if its a large installation then they cant just switch the power back on, it would have to powered back on in stages so you'd also need a couple of power engineers for that alone.
Seven hours for a large data center is a job well done.
To everyone who is banging on about 20 people bing BS/Rubbish/Crap etc. In a large datacentres this is entirely realistic, if on the light side. You can't just turn on a datacentre after it has had the power yanked from under it, there will be reams of documentation about what has to come up in what order, systems will have to be checked out, filesystems will need to be checked, logs cecked. Incident/Recovery managers will need to confirm what the initial problem was and resolve issues related to the bringing up of systems. That's not to mention stuff like reseting the Air Con, the EPO system, etc. etc. You may be able to just hit the 'on' button on a Wintel/Lintel server after it's lost its power, but you really can't do this on a tape robot that lost power or other more exotic hardware like your Mainframes etc. As well as all of this, you will almost certainly loose a load of disks and other hardware which need to be replaced.
At the company I work for, we had the following support areas, who were involved in a recent power outage (someone accidentally pressed the EPO when they were working on it - twunts)
UNIX: AIX, HP, SOLARIS
Storage Disk/SAN
Storage Tape/Backup
Messaging (Windows & appliances)
General Server Apps (Windows)
AD Systems
Networks (LAN/WAN)
Mainframe
I Server
Tandem
Recovery Management
Facilities Management
That's 14 departments, some of which will have several members working in the recovery and that's not to mention the business analysts required to make sure that everything works before you swing services back from the DR datacentre. That's a whole shitload of man hours and cost my company many thousands of pounds. We now have new electrical contractors.
So, why do these datacentres have an EPO button - I can understand having emergency stops for the things with large moving parts (the aircon units), but why have a button that turns off all the power to the servers - the only two situations I can see you would have to legitimately use this are a) if someone is getting electrocuted or b) fire.
Now, for a) if your power supplies are correctly set up with appropriate breakers then if someone gets electrocuted the power will trip out anyway, or if it doesn't surely a large wooden stick could just be used to push the person away.
For b) if the fire alarm goes off, then most people aren't going to be on the datafloor anyway, they'll be in the NOC, hear the alarm and so could quite easily go to the power area and switch stuff off there.
I'll admit that the fire argument does give some justification, and at least in this case there was a glass cover to prevent accidental activation (I've read stories of someone accidentally pressing one thinking it was the door release switch!) - but still if you look at the risks I'd personally have said they cause more problems than they solve...
I can happily say that of those 20 guys, 3 will be managers running around like headless chickens trying to organise things that they know nothing about, 2 will be senior engineers sitting at their desks grumbling about how things were easier in their day while trying to find where they put all there DR info, 8 will be NOC techs scrambling about switching things on, fetching backup tapes and praying that everything comes up, 6 will be off site phoning in with helpful comments such as " i still can't see the server from here" and "have you tried power cycling the box?" and finally there will be a guy like me, NOC team leader sitting in a corner being ever so helpful with comments such as "i don't think you want to do that" and "are you sure you don't want to turn that box on first?" as well as "why haven't you gotten me a coffee yet?". Ah, those were the days........
Many years ago I toiled in the service of a bank. One day a mad Irishman and a pile of tape boxes combined to switch off the entire data centre at around five pm on a weekday. It took half a dozen BOFHs, a shedload of support and development bods and a Field Circus bloke all night and half the next day to get things back to where they'd been before Dave had his attack of brain fade.
The next day the Boss issued an edict that tape boxes were not to be stacked more than two high, and arranged for a mollyguard to be fitted to the Big Red Switch...
Why was the little red button not connected to the unfiltered side of the ups, so that the UPS would nicely shut down the servers. This would then allow the server to be turned on in the set sequence to get the system back up and running as quickly as possible.
But then again I have seen little red buttons more than electronically turn of the power, but actually cut through the power cable suppling the server room, but this was still on the unfiltered side of the ups.
I'm loving the thought process:
"Someone's being electrocuted, but it's a real bastard to power off the data centre. Hmmm. Let the f**ker fry. Oh, that will probably set off the smoke detectors... you - minion - poke the electric person with a wooden pole. Top show. You can use the electric Dymo labeller today."
"Why was the little red button not connected to the unfiltered side of the ups". Because it's actually called an emergency power off for good reasons - it's for very rare occasions when all power to all the servers and equipment has to be shut off immediately. It doesn't greatly matter that the mains voltage is providced from the UPS or not - the batteries will be designed to keep the centre running at full power for a couple of minutes before the auxilliaries kick in. It there's a fire or some chassis have gone live due to a wiring fault then there's plenty enough stored energy to cause a great deal of damage.
This thing required the glass to be broken so it's hardly an accidental thing. It's really rather difficult to protect equipment from attack by an insider. The fact that this guy blew his top because his Unix privileges had been withdrawn is interesting. Maybe it had become obvious that he had a volatile temperament. If so, then maybe he shouldn't be anywhere near the insides of a data centre anyway. In our datacentres sys admins do not normally have access to the computer halls.
In the final position, if the system is really so essential to keep the Californian power grid running on an hour-by-hour basis then you have to wonder why there isn't a backup centre. It doesn't require a disgruntled employee to kill a data centre - there are plenty of other ways of doing it.
What did they expect allowing someone who's basically been fired in such a sensitive area (he may not have been actually fired but if they have revoked this computer rights then they should also have revoked his access rights).
To the people who are saying 7 hours is a long time, its clear that they haven't had much experience with LARGE data centers. We arnt talking about a single server or two. We are talking about whole racks upon racks of servers. Each rack normally is home for 42 1u servers (sometimes the servers could be bigger but we dont know the specifics). Any large data center will have hundreds of these racks.
We know nothing about what services/functions these servers were performing but to even power on this amount of kit, it would have to be done in stages. Once all the servers are powered on then the admins will need to bring the services online in the right order, test each interconnecting system and ensure its operating correctly and there has been no data loss or problems caused by the power cut. Fix any problems that may have been caused (restoring backup's/rebuilding raid arrays etc). All in all its a lot of work to get any large data center back up and running after a total power failure.
...only it wasnt malicious, was merely stupidity.
Apparrently an exec was showing some prospective customers round the computer room, and to demonstrate the effectiveness of the UPS system in place, he decided to flipped the mains power switch unannounced.
Unfortunately, someone had bypassed the UPS for some scheduled maintenance. Cue the sudden spooling down of about 50 servers, the mainframe, the sounds of alarms through the room, ops people running about like headless chickens...Doh!
We used to joke that the ops should have locked him in and let him demonstrate the halon system while he was at it. BOFH would have known what to do lol.
You'd think he'd realise that the time of least demand isn't the moment to strike, wouldn't you? I'd be interested to find out
1/ Why his admin privs had been pulled, but not his building access? Why was someone who needed to have this prives revoked allowed into such a sensitive area? Why didn't they remove him to the canteen, and then tell him when he's out of harms way?
2/ What did he think he's achieve with a false bomb threat, other than making himself look an even bigger tit and making his future prospects even LESS rosey?
At my place of work, we fire our sysadmins properly. That is, we get them off-site; preferably for an hour or so (lunch tends to work for us.) When the person exits the rotating door on the way out (i.e. finishes entering the big blue room), the person at the back of the line to go out remembers something, runs back to his desk, and sends a ping to the person doing the main deed.
Access is terminated by the time said individual gets to whatever transport is being used for lunch. This includes both computer access and physical access authority.
When they return from lunch (or whatever outing they were on), there is one or more boxes, containing all of their worldly possessions they had left on-site.
This way, they have no access to any systems, inside or outside the computer room.
(Note: back when we had the blackberry ssh program in testing, the software supporting that service would flake out every so often. Mysteriously, about 5% of the time it did this, someone was fired. Oddly enough, this equated to 100% of the time we were firing someone with sufficient access to be able to do anything with it - not that they could manage anything, with their access revoked, but we didn't want lunch disturbed by them realizing they were fired.)
I worked in a medium-sized data centre (four big mainframes, four VAXclusters, numerous Windoze servers and a few PDP-11s(!!)) on the ground floor of a three-storey building with a flat roof.
Most of the roof drains got blocked by leaves (that was the official excuse - no trees nearby!) and all the rainwater tried running down one 6"-diameter pipe... the pressure forced an inspection cover partway open and it was only cos I noticed the sound of running water in the data centre (funnily enough, not a normal feature!) that we discovered the floor void filling with the water from the roof.
At the deepest point, the water was about two inches or so below the trunking carrying the 3-phase power cable. For those who don't know, UK 3-phase runs at 415V and lots of amps...
Our EPO cutoff switch killed all power instantly, as it was supposed to.
The same cannot be said for a certain datacentre in Scotland where the EPO was tripped when *they* flooded, but the UPSs kept some of the server and disk cabinets and comms racks in the machine room live for almost twenty minutes... kudos for the person who specced the capacity needed, but just as well the hardware guys told the Data Centre Manager to take a running jump when he told them to go in and start fixing things before the kit ran out of juice...