"0.1.0.0.1..." Sorry, can you read that part again..? Was it a 1 or a 0..?
Sounds to me like some storage virtualisation that has gone awry. The data is probably still physically on disk but inaccessible due to the loss of an algorithim
It appears that the direct cause of the Sidekick data loss may have been storage area network remedial work outsourced to Hitachi Data Systems. A significant outage at Microsoft's Danger subsidiary, which stores data from T-Mobile mobile phone users, caused a huge amount of users' data to be lost, probably irretrievably. If so …
One real easy way to destroy data permanently is to have do a data resync of a mirror in the reverse direction while having an active filesystem against it.
i.e. lets say before you do a upgrade you mirror the drive as a rollback so you have a point in time to roll back to. If an admin does a restore rather than a resync, you'll be rolling the disk back to a date a month ago live (which is something you can do with arrays), while the filesystem didn't know you did that and will continue writing to filesystem with new data. So you'll have a filesystem with old data and new files and a corrupted filesystem. If you have some form of a database (i.e. oracle, ms sql, etc) which is most likely in this configuration even filesystem forensics looking for file EOF markers wouldn't work. If you are replicating the data it would copy over the corruption as well.
Hardware won't stop you from a bad admin (we had this type of thing happen to us by a contractor 10 years ago)
What is wrong with Microsoft? This company is utterly inept. Blaming this on Hitachi? The problem has gotta be Microsoft's. The behaviour of Microsoft is bizarre and unbelievable. As the article suggests, how is it even possible that Microsoft was able to lose terabytes of data?
At least Microsoft should come clean about what really happened. We deserve a better explanation that what Microsoft has given so far.
Microsoft's share of the mobile phone market is declining rapidly, and this latest incident will further alienate Cellphone network providers (eg T-Mobile) as well as customers, and hasten Microsoft's mobile decline into irrelevance.
I bet this was the result of a misunderstanding/miscommunication. No way in hell all three (or four?) companies involved employ total amateurs too dumb to have a backup or five. In my experience , when too many companies with different storage policies have access to one dataset, its usual conflicting policies badly communicated that lead to several people overwriting their version of the same data, each thinking that the other is keeping not a backup, but "a version we should be able to merge the rest with later", or something to that effect; have this go some rounds of " uh-no,-do-that -backwards-and-then-the-other-way-round", and suddenly all versions are useless.
The sad thing is that, as far-fetched as it sound, it's happend more than once where i work.
(not one of the companies named in the article, fortunately.)
We've been in talks with HDS lately for some kit but it all went a bit sour, they simply couldn't really be bothered to deal with us, not retruning calls, hardly any bartering. We were looking at about £2M, reasonably small by most standards but by contrast EMC couldn't do enough to get the order. HDS left a very sour taste and it looks like their services division are not much better than pre-sales!
Well, It depends what went wrong on their SAN. I mean yeah, the bytes are probably still on the drives, I doubt it went screwy enough to overwrite everything.
BUT.. What if the logical mapping is gone? How do you know what belongs to which partition? Kind of like when your partition table gets overwritten, except you're dealing with hundreds of drives.
It is absolutely trivial to wipe out data in any RAID or SAN/NAS. Simply tell it to consider an empty or foreign drive as part of a set/volume, and it does the destroying for you. Force load a corrupt or empty configuration. Plug in a drive that spits out "impossible" responses that were never programmed against. Seen it done manually, automatically, even done it myself.
It is also customary to store data in effectively non-recoverable scattered and compressed format (e.g., Microsoft Outlook PST, although that was done for vendor lock-in).
Company-saving backups are NOT the responsibility of a T&M vendor. Data copied to a different area of the same <whatever>, while useful, is absolutely in no way a backup. The only valid reason I can think of not to have several tested backups is that Ballmer had them destroyed as favor for one of his T-Mobile-competitor friends.
Seriously, whatever data you have at Google, Amazon's cloud, Microsoft, etc, is probably NOT backed up, merely replicated. This can protect against certain hardware and network failures, but your data is always just one programming error or one tired technician away from complete destruction.
None of the above companies would fail to backup their own company-critical data, but neither T-Mobile nor their business nor their customers nor their lawsuits are any more critical to Microsoft than your Hotmail spam folder.
I assume everything's in some proprietary database. It's probably not stored as plain text/numbers - with that much data, compressing each record will save a lot of space and disk accessing - so a low-level scan of the disks won't necessarily produce anything usable if the main index has been trashed. RAIDs and such are only a defense against physical drive failure, not data corruption. If the nightly (or whatever) backup routine is allowed to run before everything has been fixed, well, there goes the backup too. Some people keep a series of backups, but that requires a lot of disks, and who actually expects such a catastrophic failure?
I'm sure someone can come up with a witty comment about the wisdom of acquiring a company called 'Danger', or even naming it that in the first place.
"It appears that the direct cause of the Sidekick data loss may have been storage area network remedial work outsourced to Hitachi Data Systems."
To you perhaps. To me, it appears that Microsoft are looking for someone else to blame for never making an offsite backup. It also appears that El Reg are happy to back Microsoft up on this, based on some astroturfing they did on Engadget. Hey, here's a juicy rumour, let's print it as observed fact. Twats.
You know as well anyone else that whatever HDS fucked up should have resulted in no more than some downtime while they recovered from backup. The non-existence of this backup is the direct cause of the data loss. But no, it's never management's fault.
How the fuck can you not see right through this story?
Massive problems like this are never *caused* by predictable things such as SAN upgrade fuckups or fuses blowing on plugs. They are *caused* by organizations' refusal (on the basis of cost) to mitigate against events which "can't happen" but invariably do.
Right, lets start with the facts : Er, data was lost.
Reuters report a server failure damaging two databases, Engadget reports "rumours" and that "the upgrade runs into complications" so very little in the way of facts to work with there.
There is no meat to say if a database was corrupted or portions dropped with no backup , if a filesystem with a flatfile store was the "database" and the FS was corrupt or if a disk array related failure occured and it was wiped, ruined, overwritten or otherwise lost by a tired technician. Theres not a single mention of array technology for replication anyway but HDS get the slagging off.
Who knows why it fell over and why no backups on tape exist (assuming that tape is not out of fashion yet), maybe a disgruntled techie burnt the lot!
I just find it amusing that a company called Danger was purchased by Microsoft, who probably migrated everything onto M$-$QL and M$ Server products and the whole lot fell over, a bit like Microsoft touting the London Stock Exchange on their products as a reference site only for it to suffer total trading outages a few months later.
I'll join in and speculate they should have been on some form of Unix with proper databases as my money says Wintel / MS-SQL problems are hidden in the middle somewhere.
"What kind of server or SAN crash would actually delete terabytes of data on a SAN's disk drives?"
You don't need a crash to do this, cluelessness is more effective any day. One of the apps I was looking after a while ago went spectacularly titsup in that sort of low level error that you've heard about but never seen kind of way. The root cause was the Storage array, or specifically the Storage array admin. Bright young thing, hot off the training courses. He'd been asked to allocate disk to A.N. Other server on the same SAN loop. Looking at the associated array he found it was a bit pushed for space but managed to find some vdisks that were not allocated to filesystems although they were attached to a server, so he borrowed them. This led to a short education with the clue stick as to the fact that some RDBMS's prefer their disk served raw......
Two RAID arrays mirrored to each other locally and synched to an identical pair on another site. Having four copies makes you feel secure until some clueless wazzock tells the whole shebang to forget the data in question. Having four copies *and* tape backup makes you feel paranoid until that moment when you're delighted to find umpty-something gigs of "thank fuck for that" in a silo somewhere.
This post has been deleted by its author
"Amazing that ALL of your backups are stored in the same set of hardware."
Maybe, Juust Maybe, they had the entire database in some derelict building outside a caravan site in, f.ex., Slough and the local lads & ladettes simply burgled the place for the copper wiring.
This kind of thing Never Happened Before, right??