Cheers to that guy who hit the ENTER key!
I've certainly done some eff-ups, but nothing that comes remotely close to this scale.
The Atlassian outage that began on April 5 is likely to last a bit longer for the several hundred customers affected. In a statement emailed to The Register, a company spokesperson said the reconstruction effort could last another two weeks. The company's spokesperson explained its engineers ran a script to delete legacy data …
"We maintain extensive backup and recovery systems"
Backup maybe but not recovery.
There does not seem to be a system in place to recover and restore this particular set of data. Because if there had been a well maintained and tested recovery system in place the data would have been restored days ago.
It would not have taken a week to restore just 9% of their customers data.
Or the Backup System is designed to recover from a System level failure - not deleted individual data sets? So they're having to rebuild a complete system from backups before they can pull the data sets?
Suspect this setup common in cloud land, hence the need for customers to somehow keep backups of their own data sets (read the small print for MS365/Azure for example)
Given the embarrassment caused by this outage, cloud providers need to look at their customer backup/recovery processes?
Working with Confluence in Jira a good number of years ago, the only way to give someone a copy of one workspace was to backup everything, then restore it all to a new instance, delete everything you didn't want and then export what was left over.
Caught in their own incompetency it seems.
@thondwe “Or the Backup System is designed to recover from a System level failure - not deleted individual data sets? So they're having to rebuild a complete system from backups before they can pull the data sets?”
I think you are very much on the money there. Probably made all the more difficult and time consuming by this @AC
“the only way to give someone a copy of one workspace was to backup everything, then restore it all to a new instance, delete everything you didn't want and then export what was left over.”
It looks like there is no way to export/backup import/restore at the workspace or even the individual customer level. If that is the case, then Jira/Confluence etc is missing a very important disaster recovery tool/feature for both Atlassian’s customers and themselves.
Any ransomware outfit would be delighted to come up with a script that does so much damage for so little effort - the destructive efficiency on show is unprecedented. I hope Atlassian are keeping the script under lock and key, because we would be well and truly f**ked if the Russians got hold of it!
But on a more serious note, I would like to offer heartfelt and genuine sympathy to the person or persons responsible. It could happen to the best of us at some point in our careers. I can only imagine how they have felt the last week or two, chin up guys and/or girls!
"Given the embarrassment caused by this outage, cloud providers need to look at their customer backup/recovery processes?"
They won't. It's just easier to go cheap and write it off. This is clearly a case of an incomplete BCP, which is another area that businesses like to go cheap on. Maybe backup software but no recovery plan at all.
>> It sounds like Atlassian haven't properly tested their backups.
That's a hugely simplistic and disingenuous statement to make!
Imagine your standard database from mainstream database vendor. You can have in place all the fault-tolerance, log shipping, full and incremental backups, offsite storage and the ability to restore (relatively) quickly that database.
Now tell me how any of that will allow you to effectively restore data having had a script which has selectively removed thousands of records from dozens of tables.
That's much more like what it sounds like Atlassian are facing that a simple "restore the backup" scenario.
Not that this is intended to be a defence of Atlassian. When you have a process that may delete copious amounts of data, it's inherent on to take whatever time is necessary to ensure it does (only) the right thing. If that means taking three-times the time to have it run in a "what if" mode, then that's what you need to do. Yes it will take longer to do that, but I would suggest not as long as it's taking Atlassian to recover from this.
Yet another reason why I loath the drive for cloud, when it falls over (not if) you are fully at the mercy of cost optimised likely understaffed company to hopefully have a working backup and recovery system to sort it out at their leisure. When you have control over the system you are at the mercy of your own backup and recovery procedures which you yourself can influence.
after someone "stops" my PHB, because his brain is permanently-wired into accepting vendor wooing/outright kickbacks, and loving shiny-new-trendy things. He gets away with it because he's a "people-person" and he works it like a god. All his higher-ups thinks he's the greatest thing since toast.
Two incidents reflect the "benefits" of cloud.
1. Had to stand up a Citrix CVAD environment in the cloud service a couple of years back. If it had been on premises, I could have had the basic service setup in a day, then just spend time on tuning and optimising the VDAs. In cloudland it took a week to get the basic service as whenever I hit an issue, I could not do any troubleshooting and had to log a job and wait for them to get back to me. The first ticket I had to log was the cloud service sign up portal wasn't working, then downhill from there.
2. A client wanted to stand up some extra Azure VMs of a type they were already using. Unfortunately, due to capacity issues, these particular VM types were all reserved for large clients. Our client didn't fit into this category so had to build new servers using a different VM type. Basically "We are only giving these VMs to people we give a shit about, and that doesn't include you matey! Just pay your bill and move along."
"Instead of deleting the legacy data, the script erroneously deleted sites, and all associated products for those sites including connected products, users, and third-party applications."
With an SQL database any sufficiently paranoid DBA will run a SELECT with the same WHERE clause as the intended DELETE to check that what would be deleted is what's intended. I suppose this is all trendy NoSQL stuff. Doesn't that have the same facility?
If that was my cloud thing i'd ensure that any wholesale deletes go through a multistage process.
get a list of everything that will be affected by the delete process
determine a process to undo what your about to do
mark each of those entries/files with a will-be-deleted-on-x-date flag
see what is actually still using those entries/files, if they are for delete then no user processes should be accessing them
remove access to those identified entries/files & see what breaks
if no one has complained in 4 weeks then fully remove those entries/files.
its expensive, time consuming & no one will know when it goes really well, which is actually the point.
You want to make absolute certain mistakes won't happen so you over engineer things, that's why people buy cloud solutions, they expect the careful dedication to ensuring things go right is being done by the cloud provider and included in whatever the fee is. The cloud provider benefits from their scale by having teams manage all that boring stuff instead of each customer having all the necessary teams to do that individually.
they expect the careful dedication to ensuring things go right is being done by the cloud provider
Give me a moment until I have finished laughing, sorry. The big not-so-secret of these cloudy providers is that they have the same bookkeepers that moved your own company to the cloud in the first place, so I'm afraid you may have to tone down your expectations to "almost good enough" level, because that's what you're going to get - after all, the fog surrounding cloudy things ensures you have zero insight in how the vapour is actually produced.
They too have to satisfy shareholders they have not left a cent/penny/dime on the table.
This is why I'm holding onto my Server instance of Jira as long as I can. At this point it's either selling data-center to the PHB or finding another platform. They're pushing a less capable, less available product as far as I'm concerned to try and make the execs more money.
These kinds of screw ups in the cloud is why I won't trust core services to a cloud provider without multiple layers of redundancy I control. Inevitably someone will screw up and I'll be the one paying for their "learning experience".
This post has been deleted by its author