I recall a day when we had an email server with a similar problem. One morning a perfect storm of a power outage and a failed UPS battery meant that power was lost to the corporate DC before the genny could kick in. The sound of UPS alarms was accompanied by the descending drone of fans and hard disks spinning down and joined seconds later by the usually reassuring sound of the genny firing up. On that particular morning that sound came too late to reassure anyone.
Controlled restarts of servers were carried out with disk arrays carefully checked before full boots and all was well. This was while we were actually in the middle of a project to duplicate all servers to a second DC with real time data replication for the most important servers and out of hours replication for those deemed less critical. For some reason the MS Exchange server was well down the list for this work and hadn't been actioned yet.
The storm for the exchange server became even more perfect. Firstly when we came to carry out the controlled restart we found somebody had left it set so it would boot with the restoration of power rather than waiting for a manual startup. And secondly the main RAID 5 array was corrupt. A well meaning technician noted that there was a flashing light indicating that one of the hot swappable disks was poorly. So without consulting anybody he decided to swap it hot. The problem for him was that while this disk was poorly it hadn't failed completely. If I recall correctly the light sequence was flashing amber=poorly, solid amber=falling, red=fucked. And obviously no light meant that a disk want even getting power. Maybe the unit was some dead or maybe it just needed reseating. This light was flashing amber. What the technician hadn't noticed was that the lights were completely out on another disk in the array. When he pulled the poorly disk that was it. The array was shagged. Had he swapped the dead disk first it's possible that the array may have been able to rebuild itself but not now. A RAID 5 array with two disks out is useless.
This is where the storm got really bad. The backup procedure for that server was supposed to be weekly full backups to tape with nightly differential backups. The problem was that the last few full backups had failed. Operations hasn't bothered raising a trouble ticket for the backup failures. And operations were the people who had swapped the wrong disk. Operations were not popular. We needed to firstly build a blank RAID array, restore from a four week old backup and then then apply about twenty differential backups on top. So here was my plan. Take the server off the network to start restoring it. Spin up a replacement server so people could send and receive emails. And then when the original server was back figure out how the hell we were supposed to consolidate the two.
The dinosaur in charge of the IT department decided that this would take too long.Yes it would take a while, here we were in the late afternoon and I estimated that the full restore would probably take into the following morning and then feeding it the tapes for the differential backups would take into the following night. But that would mean that we had the email server back by the time people returned to the office on the third morning. A Friday as it happens. We would be operational by the weekend. What, you may ask, about local ost files for users histories? Unfortunately most users were on Citrix so they didn't have local ost files.
The dinosaur had already called out hardware and software vendors and he decided that they could fix it without resorting to backups. I explained that this was almost certainly impossible, but his response was that he paid a fortune for support so he was going to use it. Support engineers duly arrived and sat down to formulate a plan. The dinosaur went home before they'd all arrived. We hung around till about 10pm and as there was nothing we could do, off we fucked as well. I arrived at 7am the next day to find that the engineers were finalizing their plan. That's right they'd achieved nothing overnight. The plan arrived at 7:30, it was a hefty document. We sat through a presentation from the assembled support engineers. The first half of which was an explanation of exactly what damage had been done and how. The second half was how the engineers proposed to repair the damage. It was about 30 seconds into this second part of the presentation that I realised that the plan was almost exactly the same as there plan I'd hastily cobbled together the previous afternoon. The only difference being that they knew how to consolidate the two servers at the end. I could see the dinosaur looked distinctly uncomfortable. However his discomfort increased exponentially when one is the engineers days that we should have implemented a replacement email server before they even arrived.
Dinosaur barked out orders as to who was going to do what, but you could see he was panicking. He had a meeting with the board at nine. He're we were almost 24 hours after the initial failure and email was still down. Email of course being just about the only system the board actually used. Not only that but dino now had to tell them that it wouldn't be back until Friday evening. It was likely that quite a lot of people would have to work the weekend. And that would be at time be and a half.
Dino wanted to take a delegation to the board meeting as backup. The trouble is that anybody from engineering would be likely to explain that Dino had delayed things by over 12 hours by bringing in external engineers who had come to the same conclusion as his own engineers. He decided to take the operations manager and the facilities manager. Not as backup, but as scapegoats. He could blame operations for the failed backups and swapping the wrong disk. And he could blame facilities for the failed battery. But blame is a dangerous game. Of course ops and FM had some difficult questions to answer, but when you're being blamed the normal tactic is to deflect some of that blame. So it wasn't long before the operations manager dumped all over Dino and explained the delay of well over twelve hours. Not to be out done FM jumped in and pointed out that their monthly genny and UPS tests had been cut to size monthly by Dino because he didn't like paying the bills.