Hmm?
"Sun has so far been unable to respond."
An interesting choice of words.
The combination of Oracle and Sun has apparently scored a massive hit on Microsoft even before the firms consummate their merger. The Sidekick service crash involved an Oracle RAC database and Sun Solaris and Linux servers, according to reports. Oracle RAC (Real Application Clusters) involves a single database running across …
Where was the Physical Standby (preferably a half day behind production)? As I'm guessing there were more than 4 sockets in the oracle RAC so they'd be running on Enterprise, which has dataguard as standard.
Why wasn't there SAN replication?
I'm thinking they were scrimping on licenses, 9 years without a big failure is all well and good, but a big blow out will happen eventually.
Roughly Drafted Magazine has a fourth source that points the finger at Microsoft's Roz Ho...
http://www.roughlydrafted.com/2009/10/15/microsofts-pinkdanger-backup-problem-blamed-on-roz-ho/
After personal involvement with rectifying other people's stuff-ups, I would go for management's "We don't need no stinking back-ups" instruction as the likely cause.
What a load of tosh!
RAC is a just a clustered, multi instance Oracle DB, mounted on a clustered filesystem, usually ASM down to raw filesystems. The only thing I can attribute to the nonsense statement above is possibly Oracle ASM failure under the RAC? The mounted disks under ASM failed to come back, corrupted headers maybe and Oracle ASM was unable to re-assemble the disks into the correct disk groups, thus the DB would not mount and startup.
The basic facts still stand. The outfit didn't have a backup that they could restore. It doesn't matter who's hardware or database server was being used. the original, unforgivable sin was to not have a backup. If you have a secondary server that is mirroring the first, it will get exactly the same updates as the primary and will therefore execute the same DELETE, DROP, ALTER and other destructive commands that the primary is fed. That's not a backup.
From the job advertisement, it sounds like they already knew their architecture wasn't very high quality (which no doubt will be explored fully in any legal case), which makes the lack of a backup even more negligent.
You have to love the marketing speak, "Oracle RAC (Real Application Clusters) involves a single database running across a cluster of servers for fault tolerance, performance and scalability reasons"
Perhaps they should be selling this product instead;
"Horrocle Quack (Duck and the brown stuff will miss you) involves a monolithic, decades old dinosaur of a database being frantically shored up with more and more duplication of everything to try and paper over the designed in problems.
Quack continues to provide the availability you have enjoyed with your overpriced hardware replicating SANs by ensuring that any error that occurs anywhere in the system is automatically propagated across all instances destroying your data and your business in a single stroke.
If you spend enough money on disk snapshots, ouija boards, rabbit's feet and Horrocle licences there is a tiny chance that your database might survive a disk failure, none at all that it will survive a Horrocle upgrade or administrator error"
SAN replication may not have helped if the primary DB was corrupted, since it would also corrupt the replicate. The only way to recover from a corrupted DB is to load an online backup taken before the corruption occurred then replay the transaction logs up until just before the time of first corruption.
The risk here is that you can lose TX made to the DB after the first corruption happened, so you may need to re-enter those TX manually.
This is a configuration and design problem. It's easy to point at whoever provided the kit and blame them, but this is clearly not as simple as that. Where was the DR system? Where were the backups? Where were the integrity checks on the live & DR systems that would have picked up something bad happening? Where were the regular restores from backup to ensure you can actually get the system back up from backup? All these seem to have been just missing.
First, let's kick the fanbois! So, we have an ex-Apple crew, and when they are tasked with building a service they don't go for the much-hyped and over-priced Apple servers or storage, they go for a mix of commodity x64, Linux and Sun SPARC. So much for Apple's servers being ready for the enterprise!
And now for the Sunshiners! After their naked glee at what they thought was Big Bad M$ screwing it up, they are faced with the root of the problem probably being a Sun Slowaris issue. Oops, would you like to retract some of the M$ bashing?
And the Oraclistas, are you maybe getting a little worried about that Sun hardware bizz you bought, which you haven't changed, but which you are now pushing to customers as enterprise-ready, resilient and reliable?
I'm just waiting for the first conspiracy theorist to start claiming it's all a big M$ plot to discredit Sun, Oracle and Linux......
;)
"The back-end storage is not known..."
http://www.tgdaily.com/content/view/44313/97/
"While users will be relieved that their information looks likely to be recovered, the episode poses several questions over the competence of Danger’s staff; the technical ability of contractor Hitachi Data Systems; and the inherent stupidity of the Cloud concept."
"While we are unlikely ever to be told the full story, it looks very much as if Hitachi’s attempts to upgrade Danger’s Storage Area Network failed big time and that the data was put at risk not by hardware failure, but by good old-fashioned human error."
Doesn't matter how good your kit is if you have monkeys administrating it. Oooh now time for the slopey shoulders finger pointing and trying to direct the blame elsewhere. hmmmm.
Also I thought this was suspected to be an act of sabotage? Well they get what they deserve for running their server rooms like a zoo and having no backups.
" existing environment has fragile software and is unreliable" -- This statement more than likely refers to the the application/database client.
If the organization is as clueless as their jumble of statements would imply, I could easily see them trying to patch the cluster with Oracle still up. That they apparently could not recover with the remaining nodes --- well I think this is an example of simple stupidity after all. I have done more Clusters(RAC and HA) than I care to think about it. The vast majority of all problems with RAC are customer related.
M$ was still responsible for this data. If they thought the platform was "unstable", well, they were under a contractual obligation to make it stable. Also, any "critical" IT infrastructure change requires a full backup, especially one like this one. Do they think it was bad, and that the MS Windows platform was better? Then they should've migrated to that. They had more than a year to do it.
If I did this at a bank, I'm pretty sure I would no longer have my job.
I'm also pretty sure any OS or RDBMS will crash and burn if your storage goes poof, which has been the MS argument all along.
MS not being able to manage a kick in a riot
Regardless of WHAT fucking platform was running, so manager made the decision that the existing backup proceedures were good enough.
1) Backup/redundancy systems were not up to par
2) System was shoddy to begin with
3) They were penny pinching
4) A clueless manager may or may not have killed a backup prematurely
5) There were too cheap to buy additonal space, despite space being as cheap as dirt these days
One big clusterfuck all around, and it's all down to system mangement NOT what platform is being used.
That clusterfuck, is to be laid squarely at the feet of what ever IT "manager" was in charge.
Said IT manager, should meet with a tragic accident if there was any justice in the world.
I have been an IT tech in such companies. When there was THAT level of penny pinching, when it came to system stability and backups, if I wasn't allowed to fix it, or management didn't see the massive oncoming fuckey that was going to happen... I quit and found a better job.
I its my personal HATE, fixing shit like that. I really feel for the poor techs who are trying to sort that crap pile out. If your reading this boys (or girls) I would march in to the office, and get a commitment in writing to fix the system once this is all done and if they even hesitate to give it to you, quit on the spot. Once this is fixed, it's only a matter of time before it happens again, unless they do a major re-engineer of the system, which means spending dollars.. lots and lots of dollars.
It's not worth the ulcer, lost nights, stress and bullshit.
Go and work for a company with some fucking brains, and a reasoniable idea of what constitutes a backup and recover plan.
Beer is for you guys.
Best of luck,
CW
Is Microsoft liable for the damage done to T-Mobile and their customers? No doubt about that.
Was it Snoracle who failed to deliver on their stability promises? Fuck yes!
"More than a year" to migrate enterprise software to a completely different HW+SW platform? Well, for any piece of code more complex than a calculator it's hardly enough for decent reqs and designwork.
Imagine yourself and your company inheriting a clusterfuck of "highly reliable" software/hardware, so well-built by industry's finest talents it can't be touched without causing service downtime. Because there's a good chance this is what'd happened, ladies and gentlemen.
I'm not saying Snoracle stack is a disaster waiting to happen. But it only goes to show that "everybody lies", even if they are extremely open-source and highly anti-Microsoft.
Sun sent this e-mail about the story:-
"Sun has a policy of not commenting on specific customer-related matters. What we can say is that Sun takes all customer concerns very seriously and works closely with both its customers and partners to ensure the highest satisfaction rates."
Chris.
He's just sore because Oracle won't use HP's POS kit anymore.....
There, there Pratt, you and Mark Turd can cry in your beer.
Because HP NEVER suffer any failures huh?
http://www.theregister.co.uk/2009/06/17/barclays_gloucester_outage/
So run back to your PHUX buddies (are there any left?) and be afraid because a seachange is coming and HP is directly in the path of the tsunami.
...in the ongoing decline of the reg! ms/danger didn't follow standard protocol in data-center managment, fell on their noses in public and now sun/oracle is to blame...?! an article based on a job posting and some uneducated guesses. man you can do better than this!
total fud and I learned nothing here, except that the reg is willing to compromise everything for a few clicks and comments.
Matt Bryant - "they are faced with the root of the problem probably being a Sun Slowaris issue"
Matt's FUD prediction was wrong, as usual. Matt spelled Solaris wrong again, poor illiterate.
http://www.roughlydrafted.com/2009/10/15/microsofts-pinkdanger-backup-problem-blamed-on-roz-ho/
According to the source, the real problem was that a Microsoft manager directed the technicians performing scheduled maintenance to work without a safety net in order to save time and money. The insider reported:
“In preparation for this [SAN] upgrade, they were performing a backup, but it was 2 days into a 6 day backup procedure (it’s a lot of data). Someone from Microsoft (Roz Ho) told them to stop the backup procedure and proceed with the upgrade after assurances from Hitachi that a backup wasn’t necessary. This was done against the objections of Danger engineers.
”Now, they had a backup from a couple of months ago, but they only had the SAN space for a single backup. Because they started a new backup, they had to remove the old one. If they hadn’t done a backup at all, they’d still have the previous backup to fall back on.
“Anyway, after the SAN upgrade, disks started ‘disappearing.’ Logically, Oracle [software] freaked out and started trying to recover, which just made the damage worse.”
The problem with this report is that is places the blame, not on a complex Oracle deployment, not on bad SAN hardware or a firmware glitch, not a disgruntled employee with inappropriate levels of access to a mission critical service, but squarely upon Microsoft management.
My original post seems to have been dropped (too cutting, was it "Be Kind To Village Idiots" Day?), so let us consider a few of those nasty little things you Sunshiner's like to avoid - yes, facts. I'll try and make it as easy as possible for you, we'll do it in tiny little steps so the whole reality thing doesn't scare you too much.
Firstly, we all seem to agree that MS bought the service with the IT structure already in place. Let's just consider that for a moment - MS is not responsible for choosing the mix of software and hardware used in the service, therefore, if there are flaws in the design then they are not MS's fault. No, the design responsibility lies with the ex-Apple people that ran the service initially (you paying attention at the back, fanbois?). So, we should all agree that ex-Apple people are responsible for the faulty design if backups, redundancy, and the possible impact of failed updates were never considered in the design. Then again, it's likely Sun presales or one of their resellers would have been involved in a design involving Sun kit, especially if it involved a StorEdge 9000-series array, so they also must bear a burden of responsibility for a poor design.
Secondly, for all you know, MS may have spent the time between purchase and now unravelling the system and looking at ways to improve it. Then again, maybe MS didn't look at what they acquired and just bought the old Apple hype that Apple people actually know how to design commercial systems rather than just consumer toys. Or it could be they swallowed the Sun hype about how reliable, redundant, etc, the Sun kit was. After all, would it be so surprising that MS people might not know where the problems are with Sun hardware and software, given that most MS people only see the stuff as it is being carried offsite when it is being replaced by x64 servers running Windows. So, it's highly likely the design documentation was poor, or relied on flawed information from Sun. Again, not MS's fault, they probably just wrongly assumed Sun kit and ex-Apple people were up to the job.
Thirdly, you may want to consider that redesigning and replacing an online, 24x7 system is not an easy prospect. It took us three years to junk all the SPARC-Slowaris in our systems, and that was the much easier task of migrating to other UNIX and to Linux, not the big jump of Slowaris to Windows. And when we did our task, the easy steps we did first was replacing the Sun kit serving commercial apps, such as Oracle DBs, as they were the easiest to move. The hardest to migrate were the inhouse apps as they often had poor or no documentation, and consequently needed more than just a bit of recompiling and tweaking to run well on other platforms. By the sounds of it, the Sidekick/Danger setup seems to have more than a chunk of purpose-written code, all of which would have to be examined to make sure it could be ported off Slowaris to a better platform. So, for all you know, MS may simply not have got round to that bit fo the project (migration) yet, and was simply trying to shore up the rickety solution that was in-place. I'm sure that, after this fiasco, they'll now be twice as concentrated on looking at how fast they can remove the Sun bits!
Fourthly, trying to pretend the Sun StorEdge 9990 array (a badged HDS Lightning), which I hear was the storage in question, is nothing to do with Sun is childish. Yes, I know all Sun do to the hardware is stick a Sun badge on it, but they also throw out the HDS manuals and supply their own. If the array upgrade failed it may have been due to Sun engineers not doing the upgrade properly or incorrect Sun manuals. After all, we have had upgrades on our hp XP arrays without issue, and they do have the same hardware core as the Sun version (though hp do make changes to the code, produce their own XP software, and are involved in the product development, which is a lot more than Sun's stick-a-badge-on-it process). Seeing as hp insist on keeping a very tight control on any upgrades to XP arrays, I'm assuming Sun do the same for their badged Lightning arrays and therefore were party to or responsible for the failed upgrade. Of course, if Sun don't retain such tight control, maybe their customers should look to hp's XPs for better servicing and support.
So, we've reached the conclusion, and those Sunshiners and fanbois that haven't run off screaming in fear of the unknown, we have established a few facts and we have shown that the core of the problem is not MS's fault, but likely to be Sun's and those ex-Apple employees. Enjoy!
/SP&L
Dear Matt,
I don't know what happened to your previous comment - someone else may have rejected it, or it may have been published without you noticing. I'm not going to go footling around looking for the truth on that one. In any case, you are boring and I am busy and I'm going to reject anything else you post today because you are over your word limit.*
Mwah,
Sarah
*My word limit.
Not really, no. Do you think I have time to actually read and digest and consider every post on a thread like this? Balance schmalance. I don't have any bias, I just feel inclined to keep a lid on the most antagonistic posters from time to time.
Just... try and conduct yourself with a little decorum, OK? That goes for everyone. I can't be expected to pick up the toys *every* time you throw them out of the pram.
... as long as you arent a sidekick customer.
I'm new to this thread and clearly there are lots of on-going battles betwen M Bryant and the rest but the last post about 'who is at fault' does raise an interesing question for me:-
If i paid upwards of $100m for a business I would make *damn* sure I knew what the risks inherent in the platform were BEFORE I took on the SLA - so the person at 'fault' is the one who signed the cheque without being confident of the due diligence process.
fix: Find that person, sack them, and move on.
To those that say 'a backup should have been available' - well clearly it was or the data would stil be missing - and your simplistic atack ignores the fact that there was no working system to restore to. (And yes I agree there SHOULD have been but we just sacked the guy who made the mistake and now we are MOVING ON)
Microsoft own the problem and the buck stops there - everything else is just marketing froth
Reg
Market drones on this thread aside it sounds more and more like Roz Ho might have just made a career limiting move. As previous stated this sounds like a management charliefoxtrot which is irrelevant to hardware or software choice. The fact remains with decent management and proper funding with competent admins and decent engineering up front you could easy have recovered in minutes to hours with an HP, a Sun, AIX/DB2 hell even Windows and SQL server (well maybe days and the scalability horror but I digress) solution. I don't buy for a second M$ is not blame for this. As has previous been mentioned if everything was fubared then someone from M$ didn't do due dilligence if nothing else. Easy to blame a vendor to deflect blame and for marketing purposes but hardly necessary. After all right now Oracle and the EU are doing more than M$, IBM, or HP ever could to kill off Sun.