User topics

Article topics

Log in Sign up

Atlassian comes clean on what data-deleting script behind outage actually did

Atlassian has published an account of what went wrong at the company to make the data of 400 customers vanish in a puff of cloudy vapor. And goodness, it makes for knuckle-chewing reading. The restoration of customer data is still ongoing. Atlassian CTO Sri Viswanath wrote that approximately 45 percent of those afflicted had …

COMMENTS

Post your comment

House rules Send corrections

Add to 'My topics'

Thursday 14th April 2022 14:05 GMT EvilGardenGnome

Oof

This sounds like a series of horrible, dumb, but honest mistakes. Also, I have to commend them on their honesty and directness; most would obfuscate, but this seems like a clear request for forgiveness.

Icon for all involved, but the victims and people currently fixing the problem get first dibs.

87 1 Reply
1. Thursday 14th April 2022 15:42 GMT DS999
  
  Re: Oof
  
  An "honest mistake"? Having automated deletion scripts that don't verify the heck out of things, and require some sort of special mode required to delete entire sites?
  
  That's not a mistake, that's incompetence.
  
  28 39 Reply
  1. Thursday 14th April 2022 19:02 GMT Yet Another Anonymous coward
    
    Re: Oof
    
    All mistakes are incompetence - that's rather the definition of competence.
    
    But a new deletion script on a hosted environment, would probably have wanted it to have a "show me what you would do but don't do it" mode
    
    64 0 Reply
  2. Friday 15th April 2022 08:53 GMT Pascal Monett
    
    There was a special mode.
    
    The mistake is to have used it.
    
    RTFA
    
    13 2 Reply
    1. Friday 15th April 2022 19:24 GMT DS999
      
      Maybe YOU need to read the article
      
      The "special mode" was for immediate deletion rather than marking for later deletion, and had nothing to do with deleting entire sites which is what I suggested needed a special mode.
      
      2 2 Reply
      1. Saturday 16th April 2022 22:30 GMT matjaggard
        
        Re: Maybe YOU need to read the article
        
        I don't think legal compliance deletion would allow for "special modes" whatever that is. I suspect a full client data deletion is actually quite common to deal with GDPR requests and the like.
        
        3 0 Reply
  3. Friday 15th April 2022 09:15 GMT ecofeco
    
    Re: Oof
    
    Incompetence indeed.
    
    The downvotes are also very disappointing.
    
    7 7 Reply
  4. Friday 15th April 2022 22:45 GMT An_Old_Dog
    
    meaningless-to-humans GUIDs
    
    I'm thinking the IDs were GUIDs, which are context-free (same ID format for programs, datasets, sites, directories, user objects, etc.) and are meaningless to humans.
    
    So a human can't tell they were provided a wrong GUID by just looking at it. That's a safety-check which has been eliminated.
    
    19 0 Reply
  5. Saturday 16th April 2022 06:41 GMT Terje
    
    Re: Oof
    
    I think that the main issue here is that no one seems to have come to the conclusion that having the (supposedly more commonly used) safe and gentle script and the nuke from orbit option in the same script makes it not a question of if but when something goes horribly wrong!
    
    19 1 Reply
2. Saturday 16th April 2022 06:33 GMT aki009
  
  Re: Oof
  
  Assuming that this story is what really happened.
  
  Given the time to come clean on this, it seems to me that this is more likely to be a fairy tale put together to make something far worse look acceptable. Why else would they not have told the world about it within hours or days of the incident.
  
  There wasn't a whole lot of trust to begin with, and now at least our company will move to the Atlassian cloud over my dead body.
  
  3 7 Reply
Thursday 14th April 2022 14:09 GMT Doctor Syntax

Measure twice, cut once.

90 0 Reply
1. Thursday 14th April 2022 14:40 GMT TimMaher
  
  Cut once
  
  ...and always cut away from yourself.
  
  57 0 Reply
  1. Thursday 14th April 2022 15:01 GMT Gene Cash
    
    Re: Cut once
    
    "Cut toward your chum, not toward your thumb!"
    
    42 0 Reply
    1. Thursday 14th April 2022 16:11 GMT Psmo
      
      Re: Cut once
      
      Except my chum is the one with chequebook in their name, and they need thumbs for writing....
      
      19 0 Reply
      1. Thursday 14th April 2022 21:55 GMT John Brown (no body)
        
        Re: Cut once
        
        You still write cheques? How very last century! :-)
        
        11 5 Reply
        
        Friday 15th April 2022 12:15 GMT Robert Helpmann??
        
        Re: Cut once
        
        You still write cheques? How very last century! :-)
        
        Yes. On a cow's back.
        
        12 0 Reply
        
        Saturday 16th April 2022 20:15 GMT Anonymous Coward
        
        Re: Cut once
        
        Having asked for some historical family records from the MOD, you’d be surprised. To be fair, they accept postal orders as well as cheques.
        
        5 0 Reply
2. Thursday 14th April 2022 14:45 GMT tip pc
  
  Measure twice, cut once.
  
  great advice until you realise that the the detail of what to measure from your colleague was wrong so you've accurately cut a wrong measurement.
  
  53 0 Reply
  1. Thursday 14th April 2022 15:38 GMT Arthur the cat
    
    BT,DT.
    
    18 0 Reply
3. Thursday 14th April 2022 17:17 GMT Blofeld's Cat
  
  "Measure twice, cut once."
  
  Measure with a micrometer, mark with chalk, cut with a hatchet.
  
  33 0 Reply
  1. Friday 15th April 2022 21:30 GMT heyrick
    
    Pretty much sums up the one (and only) time I ever did sewing.
    
    Kind of hard to appreciate the sorts of things girls make look easy until one tries it themselves and it goes horribly wrong. But, then, maybe it was a mistake to not wait for the scissors to be available but instead to try to make do with a large kitchen knife...
    
    Icon, because I can be honest with myself, I sucked at that.
    
    6 1 Reply
4. Friday 15th April 2022 08:47 GMT NXM
  
  I do hardware, my business partner does software.
  
  He's not allowed any sharp objects in case he hurts himself.
  
  9 0 Reply
5. Saturday 16th April 2022 06:35 GMT aki009
  
  Cut cut cut, and maybe measure
  
  Why measure when one can just cut cut cut cut and throw it in a shredder?
  
  Henceforth shredders in our office will be known as Atlassian Clouds.
  
  5 0 Reply
6. Saturday 16th April 2022 09:24 GMT Anonymous Coward
  
  A friend was ordering work surfaces for a new kitchen. One had to have a small corner rectangular cut out at one end. When they arrived he found that he had measured the cut-out incorrectly - and the hole was 100mm (4") too long. So he re-measured and ordered a replacement. When it arrived he found he had made the same mistake again. That was a feature until the next kitchen rebuild several decades later.
  
  A stained glass artist made me two panels for my front door. Came the day she arrived with the two large constructions of glass and lead. Then she found they were 100mm too wide. She went away and soon came back with them trimmed on one side. The interesting thing is that her design looked more aesthetic by leaving the trimmed portion to the imagination to complete. Cramming the whole design into less width would not look as good.
  
  5 0 Reply
Thursday 14th April 2022 14:33 GMT Anonymous Coward

immediately thought of

Precisely. The circuits that cannot be cut are cut automatically in response to a terrorist incident. You asked for miracles, Theo. I give you the F...B...I...

13 7 Reply
Thursday 14th April 2022 14:39 GMT wolfetone

GDPR

Is there nothing it can't screw up?

To be fair, I think when I heard of all of this I was expecting something monumentally stupid. Like "rm -rf" in the wrong folder of the server and no back ups performed. This though, it's fairly honest and happens to all of us at some point.

6 41 Reply
1. Thursday 14th April 2022 15:20 GMT stiine
  
  Re: GDPR
  
  if wasn't rm -rf. it was shred /dev/disk/customers
  
  13 0 Reply
2. Friday 15th April 2022 16:59 GMT OhForF'
  
  Re: GDPR
  
  I agree with the rest of your post but the title and first sentence makes we wonder.
  
  I don't see where GDPR says you can't mark an app in a cloud for deletion and keep it around for some time when it becomes obsolete.
  
  It is a bit different when you have to get rid of personably identifyable data but that's not what they should have deleted so why do you blame this on GDPR?
  
  7 0 Reply
  1. Saturday 16th April 2022 06:13 GMT Jon 37
    
    Re: GDPR
    
    It's not just GDPR. There are a bunch of laws that might require data to be really deleted.
    
    Without those laws, the sysadmins could always do "mark as deleted", which can be easily undone when someone makes a mistake. Because of those laws, they had to add a "really delete this now" mode to the script. And when someone made a mistake and had used that option, there was no way to get the data back except restoring from backup.
    
    5 1 Reply
    1. Saturday 16th April 2022 08:47 GMT Crypto Monad
      
      Re: GDPR
      
      May I propose a solution:
      
      - Script 1 marks listed items for deletion
      
      - Script 2 permanently deletes listed items, but only if they have already been marked for deletion (i.e. selective "empty trash")
      
      If you want permanent deletion, you have to run script 1 followed by script 2. Preferably with 24 hours in between.
      
      2 0 Reply
  2. Saturday 16th April 2022 20:33 GMT Anonymous Coward
    
    Re: GDPR
    
    Customer data is the *first* thing you have to get rid of. That's almost the entire point of GDPR.
    
    Scrubbing PII is a secondary concern that affects log retention etc etc.
    
    2 0 Reply
3. Saturday 16th April 2022 00:48 GMT anticlimber
  
  Re: GDPR
  
  I don't know why you were downvoted for this. Large providers now have to balance what they were doing before (keeping backups, in case customers needed them) vs complying with GDPR...which is well-meaning but has deeply eroded the durability of customer data, industry wide.
  
  Particularly in complex systems, "delete user data U within N days" ends up having N days divided, serially, across multiple systems. An underlying system that could safely and easily restore user data from any deletion in the last 3 months (permanent delete after 3 months) now has a permanent deletion window of, say, 1 month...so the next system up the stack can have its deletion budget.
  
  Cryptoshredding, you say! Well, lots of smart lawyers have studied that and the results are inconclusive at best.
  
  Have a look at the definitions for RPO and RTO. The missing thing is -- how long are these guarantees good for after data is accidentally deleted? A lot less time, after GDPR.
  
  4 1 Reply
Thursday 14th April 2022 14:39 GMT Headley_Grange

Every Cloud Problem has Silver Lining?

I don't know if it's related or not, but since these Atlassian problems I've not been getting my daily junk mails from various Atlassian domains.

22 0 Reply
1. Thursday 14th April 2022 17:55 GMT Flocke Kroes
  
  Re: Every Cloud Problem has Silver Lining?
  
  Except for the mushroom shaped ones which have a lining of iodine 133 and strontium 91.
  
  11 0 Reply
  1. Thursday 14th April 2022 18:24 GMT DJohnson
    
    Re: Every Cloud Problem has Silver Lining?
    
    Isn't Strontium-90 described as a "silvery metal"?
    
    8 0 Reply
    1. Friday 15th April 2022 08:55 GMT Pascal Monett
      
      Ooh, shiny !
      
      6 0 Reply
      1. Saturday 16th April 2022 09:28 GMT Anonymous Coward
        
        That's a good marketing theme when trying to sell cows' milk after nuclear fallout contamination of grass.
        
        A bit like the breakfast cereal advert that showed glowing children. It was quickly parodied by adding mention of the Windscale reactor leak.
        
        1 0 Reply
        
        Saturday 16th April 2022 23:44 GMT Anonymous Coward
        
        And don't forget to brush your teeth with Doramad Radioactive Toothpaste for a healthy glow, after you've had your ReadyBrek…
        
        1 0 Reply
Thursday 14th April 2022 14:48 GMT VoiceOfTruth

While I appreciate the honesty...

-> The bad news is that while the company can restore all customers ... there is no automated system to restore "a large subset" of customers into an existing environment, meaning data has to be laboriously pieced together.

That doesn't strike me as very good at all. It seems more like a reconstruction from whatever is available rather than a backup/restore. Be warned: if it can happen once it can happen again. So Atlassian should design a better recovery system.

19 4 Reply
1. Thursday 14th April 2022 15:17 GMT Anonymous Coward
  
  Re: While I appreciate the honesty...
  
  And... "The company is moving to a more automated process to speed [restoration] up"
  
  well 'automation' caused the problem in the first place, so that should go well...
  
  21 1 Reply
2. Thursday 14th April 2022 15:21 GMT stiine
  
  Re: While I appreciate the honesty...
  
  Don't you mean a better backup system?
  
  3 1 Reply
  1. Thursday 14th April 2022 19:16 GMT VoiceOfTruth
    
    Re: While I appreciate the honesty...
    
    No. I mean a better restore system. Backups are useless if you can't restore from them.
    
    30 1 Reply
    1. Thursday 14th April 2022 22:03 GMT Doctor Syntax
      
      Re: While I appreciate the honesty...
      
      A good place to start would be to build the option to restore into the automated system. Move the data to a reserve location and only delete it a few days afterwards when it's clear there were no issues. Pretty well very desktop system and every email client has that; it's there for a reason. No, the reason isn't to archive the emails once you've read them.
      
      8 0 Reply
    2. Friday 15th April 2022 03:17 GMT Falmari
      
      Re: While I appreciate the honesty...
      
      @VoiceOfTruth "No. I mean a better restore system"
      
      One that can restore at the individual customer level.
      
      "The bad news is that while the company can restore all customers into a new environment or roll back individual customers that accidentally delete their own data, there is no automated system to restore "a large subset" of customers into an existing environment, meaning data has to be laboriously pieced together"
      
      https://forums.theregister.com/forum/all/2022/04/11/atlassian_outage_backups/#c_4443722
      
      1 0 Reply
      1. Friday 15th April 2022 09:10 GMT Anonymous Custard
        
        Re: While I appreciate the honesty...
        
        This sounds like one of those broken arrow situations where you're not sure whether it would be more reassuring that they had something ready to fix up a major issue like this, or to be worried that they thought it likely enough to happen that they needed such a preparation in place...
        
        Icon for where the original dilemma came from.
        
        5 0 Reply
        
        Saturday 16th April 2022 09:29 GMT Anonymous Coward
        
        Re: While I appreciate the honesty...
        
        AKA the life-saver "here's one I made earlier".
        
        1 0 Reply
Thursday 14th April 2022 15:37 GMT Arthur the cat

The script was executed with the wrong execution mode and the wrong list of IDs

It's the wrong trousers Gromit! And they've gone wrong!

47 0 Reply
1. Saturday 16th April 2022 23:47 GMT Anonymous Coward
  
  Re: The script was executed with the wrong execution mode and the wrong list of IDs
  
  "It's the wrong trousers Gromit! And they've gone wrong!"
  
  Sounds like a definite case of "Out of cheese error"…
  
  3 0 Reply
Thursday 14th April 2022 15:54 GMT ChipsforBreakfast

Sh*t Happens

No matter how many safeguards you build, checks you put in place or precautions you take the fuckup fairy will come calling sooner or later. The more systems you manage, the sooner she's likely to get to you - there is no escape.

That's why we have things like backup strategies and RTO's, so that when she does visit it's not a company-ending event. At least they've been honest about what happened and how long it's going to take to put it right. No marketing spin. No fluff. Just an honest 'we screwed up, sorry'. They should be commended for that at least.

Their lackluster RTO on the other hand isn't so easily forgiven....

38 0 Reply
Thursday 14th April 2022 16:23 GMT Anonymous Coward

Erosion of trust

"We know that incidents like this can erode trust"

They can, but honesty and transparency in owning up to them can offset that. This incident has not inspired confidence in Atlassian's internal processes nor in the competence and experience of the staff responsible for it. However, there is no doubt at all that those same people will have learned some valuable hard lessons about change control, review, data validation, the value of staging environments, and enumerating all the recovery scenarios when designing backup and recovery systems. The people around them will have learned those same lessons in a somewhat easier manner. And the company's leadership are demonstrating the right values in their response, which when combined with the improvements likely to be made and whatever restitution or workarounds they're offering would probably be enough for most customers. If something really has to be trustworthy, it's already on-prem, and as a customer in this situation you ask yourself whether you'll be any better off with a competing service. My guess here would be probably not, unless I were already inclined that way for other reasons. It's much easier to accept a series of awful mistakes unlikely to be repeated than to accept dishonesty, evasion, panic, and refusal to learn. If you want to erode trust, follow the Okta model instead.

29 0 Reply
1. Thursday 14th April 2022 16:45 GMT Anonymous Coward
  
  Re: Erosion of trust
  
  There's another vector of trust here:
  
  By publising an analysis of what went wrong, they made it clear they have worked out what happened, and can so start work on preventing a repeat.
  
  29 0 Reply
  1. Thursday 14th April 2022 16:58 GMT Jellied Eel
    
    Re: Ch11 calling
    
    Depends on the customers. Fair play for explaining the cause, but the cause was essentially negligence resulting in substantial damage to the customers. If those litigate, Atlassian will probably have to pay more compensation than just service credits.
    
    I think it also neatly demonstrates the problem with cloudybollocks and especially enforcing cloudy SaaS. Businesses are forced to rely on the supplier, or try to find alternatives to Atlassian that they can manage themselves.
    
    11 2 Reply
    1. Thursday 14th April 2022 19:20 GMT Anonymous Coward
      
      Re: Ch11 calling
      
      Atlassian is an Australian company so I don't believe Chapter 11 applies. Maybe there's a US subsidiary that could go bankrupt. As for what customers could get out of them in a lawsuit, that would depend on their contract terms. I'm reminded of the standard clause in many parts of the world in residential Internet service contracts "No refunds, no warranty, for entertainment purposes only"; they're literally allowed to keep your money and never provide any service at all. Business contracts are typically only a little better than that, limiting the service provider's liability to formula-based service credits. Maybe a few giant customers negotiated better terms, I don't know.
      
      "I think it also neatly demonstrates the problem with cloudybollocks and especially enforcing cloudy SaaS"
      
      What does "enforcing cloudy SaaS" mean? Dictates from the customer CEO to outsource everything? Regardless, outsourcing core business-critical services is foolish. No one cares about your business like you do, and the SLAs and other contractual terms are never strong enough to compel competence. If you've decided to outsource something, it should be because you need temporary (with a known, committed, FIXED end date!), low-volume, or low-value applications that can't justify the capital cost of building out an owned solution. You must also have the discipline to reassess those attributes periodically and bring things in house if they start to matter, not only for reliability reasons but because the total cost of outsourced service is typically between 2x and 5x what you'd pay to run it yourself. Most companies' managers lack both the awareness and the discipline to do this successfully; only a CEO dumb enough to dictate all-SaaS would also be dumb enough to hire them.
      
      Expecting reliable service from an outsourced provider is silly; their basic mission is to get you to fire your IT staff and close your data centres, then hold you over a barrel while providing minimally acceptable service at the lowest possible cost to themselves. You should be thrilled with 3 9s and satisfied with 2; if 98% availability (or data integrity) isn't good enough, don't outsource. The best you can hope for is transparency when things go wrong and, within the limits imposed by cost-reduction rules, an effort to learn from mistakes. Trust is a relative thing and for outsourcing the bar is set quite low; somehow most still fail to clear it.
      
      15 0 Reply
      1. Friday 15th April 2022 16:28 GMT Jellied Eel
        
        Re: Ch11 calling
        
        I've not looked too closely at how it's structured, but it's common to keep a handy Delaware Llc for tax purposes, and access to Ch.11 protection. That tends to be more survivable than UK administration then liquidation once the administrators have extracted all the fees & expenses they can.
        
        As for contract protection, IANAL, but don't think a contract automaticaly overrules torts like negligence.
        
        On enforcing SaaS, previous article mentioned Atlassian had been changing their product line and removing stand-alone server licenses to force migration to a cloud or subscription model. I loathe that business practice. On a server, I can take steps to mitigate risks, in a cloud model, my business is at the mercy of the service provider. That's a bit foolish when it's a business critical service.
        
        It's also unnecessary, eg if a cloud provider's just spinning up an instance in a VM, there's no real reason why I shouldn't be able to do that on my own hardware. Which is where litigatation might help, so encouraging vendors to do a better job. Or just make damages high enough that risks become uninsurable.
        
        5 1 Reply
        
        Saturday 16th April 2022 08:19 GMT Fred Flintstone
        
        Re: Ch11 calling
        
        I agree 100%.
        
        As an aside, I would also commend you for the term cloudybollocks :).
        
        1 0 Reply
Thursday 14th April 2022 16:58 GMT Anonymous Coward

Yeah, their DR window wasn't accurately or clearly communicated to their customers

Not sure what was in their SLA, but clearly the whole architecture of their cloud is flawed if customers expected to be able to use the services again in a timely manner. This incident wiped out a small slice of their customer base. What if it wiped the lot? If 4-5 days for 60 tenants and 45 days for 400 is the window, how long to do a bare metal restore of services? What if they had a datacenter fire, flood, or other site disrupting event?

Is waiting a year for your data to be restored reasonable? Who decides what the queue is? For businesses whose operations are tightly integrated with their software that could be a death sentence.

Part of provisioning a cloud service at this point should be building a reasonable recovery window, communicating that with your customers, or working with them to build and test their own continuity of business plans to handle failing over to a locally hosted box or an alternate service. Since SaaS offerings may not be as easily swapable as compute workloads, the provider needs to be better prepared to react to issues like this.

8 1 Reply
1. Thursday 14th April 2022 17:50 GMT Malcolm Weir
  
  Re: Yeah, their DR window wasn't accurately or clearly communicated to their customers
  
  I think the issue is that DR backups are not a substitute for "archival"-type backups. The problem appears to be that since Tenant A was not impacted (because they didn't have the obsoleted tool) but Tenant B was (because they did), you can't do a DR recovery to make Tenant B "whole" because you'd wipe out everything that's been chugging happily along with Tenant A over the past 9 days.
  
  Personally, I reckon performing a full DR recovery on an isolated cluster and then transferring all the "Tenant Bs" (which you know, because you have the list of "zap these" IDs) is probably smarter than doing some kind of selective restore of the DR media, but this approach requires having a sufficient pool of hardware available, which is not always the case, because DR scenarios often work on the basis that everything is available for the recovery, with no accommodation for the issue that some tenants are still working fine...
  
  18 0 Reply
  1. Friday 15th April 2022 00:50 GMT Androgynous Cow Herd
    
    the issue is that DR backups are not a substitute for "archival"-type backups.
    
    This - precisely - BACKUPS are NOT a DR/BC plan. They are "Part of a balanced breakfast" but the RTO for this kerfluffle is completely untenable.
    
    Archive data has no place in the DR/BC runbook. Full Stop. Compliance issues happen when they co-mingle.
    
    The worst joke in all of IT is "The backups are easy, but the restore is a bit tricky".... And there are so many solutions in place that can do snapshot cloning etc to reduce complexity and RTO for DR/BC runbacks.
    
    You archive data you hope to never see again...you build a DR play for data you must be able to see again.
    
    14 0 Reply
2. Sunday 17th April 2022 06:58 GMT flibble
  
  Re: Yeah, their DR window wasn't accurately or clearly communicated to their customers
  
  "What if it wiped the lot?"
  
  They have an automated DR recovery for full recovery, so that would have gone much better. The problem here was the need to restore only some of the data, which they didn't have any automation for.
  
  3 0 Reply
Thursday 14th April 2022 17:46 GMT Filippo

Shit happens. It's what you do when shit happens that counts.

24 0 Reply
Thursday 14th April 2022 17:54 GMT aaaaaargh

Just asking, all honestly admitted mistakes aside, why does a single team even have the power to delete so many customer installations? Should the users and their installations not be separated into segments or something?

10 0 Reply
Thursday 14th April 2022 19:30 GMT Anonymous Coward

Redmine is the better product anyhow.

0 3 Reply
Thursday 14th April 2022 21:13 GMT benderama

well

The initial deletion script did exactly as it was supposed to do. The error, as always, was human. Humans provided the incorrect IDs for deletion, humans failed to prepare a feasible restore process.

They should take from this the skill to consider as many what-ifs as possible instead of only the expected outcome.

6 0 Reply
1. Thursday 14th April 2022 22:05 GMT Doctor Syntax
  
  Re: well
  
  They should take from this the fact that paranoia is a basic requirement in system administration.
  
  19 0 Reply
  1. Saturday 16th April 2022 16:46 GMT Will Godfrey
    
    Re: well
    
    Paranoia is a basic requirement for any design that comes into contact with wetware.
    
    1 0 Reply
  2. Monday 18th April 2022 09:32 GMT Brad16800
    
    Re: well
    
    Amen. Bothered to sign in on my mobile just to update. Sys admins are the gatekeepers for good reason.
    
    1 0 Reply
Friday 15th April 2022 02:52 GMT Anonymous Coward

I bet someone would have caught the mistake had they been given a list of human readable customer id's and app names (e.g. acme.com and app01) instead of uuid's (e.g. 856fd738-bc55-11ec-8422-0242ac120002 and 7fd2fb10-bc56-11ec-8422-0242ac120002).

13 1 Reply
1. Friday 15th April 2022 08:56 GMT Warm Braw
  
  Given they're Australian, I imagine it's a case of TITSDOWN - Total Inability To Successfully Delete Over Wrong Nomenclature.
  
  12 0 Reply
2. Friday 15th April 2022 18:23 GMT Anonymous Coward
  
  Oi! That's my Jira account you've just listed there. Use yer own!
  
  8 0 Reply
3. Sunday 17th April 2022 15:57 GMT Down not across
  
  Or the deletion script would look up the ID mappings and bailed out or warned "You really sure you want to do this?" if the ID was for a whole site rather than customer.
  
  Likewise perhaps the mappings should include GDPR compliancy flag, and the "GDPR mode" would not engage if the flag for that ID was not set.
  
  0 0 Reply
Friday 15th April 2022 04:13 GMT Big_C

They lost a lot of trust - and rightly so.

While the CEO gave a nice explaination and most lilely anybody in IT can sort of relate to that blunder, it took them way to long to come clean.

And that "hundrets of engineers" working around the clock statement was imho bs.

I guess they make restore runs from a large full site backup but have capacity issues, so they can only recreate a limeted number of sites each time.

And the slow speed points to tape systems...? Lets hope it does not break and that the current backups do not need to be stopped during the recovery.

4 0 Reply
1. Friday 15th April 2022 09:30 GMT yoganmahew
  
  Tape is quick if it is distinct. Like other commentators, I suspect they had a backup of everything written horizontally (maybe even at disk level?) that they're now trying to extract little vertical slices from. If you backed up everything in a customer instance logically to tape, it would be trivial to restore.
  
  3 0 Reply
Friday 15th April 2022 05:27 GMT cjcox

Atlassian is making the official report available

You can get the official report if you install the Jira Oopsie Plugin, which is free for up to 10 users, then a mere $10/mo. per user greater than 10. There's also a Premium level that removes some important redactions at $20/mo. per user. The ability to go beyond page 1 of any report is supplied through the Atlassian Markeplace with plugins such as Money Money Money, you will have to check the marketplace for current pricing.

13 0 Reply
Friday 15th April 2022 09:15 GMT sean.fr

Scripts

Is there anyone in IT how has not burned by a bad script story?

Anything one off has a high change of bugging.

So I test scripts on a small subset first -possible as small as one vm/switch/customer.

Then a bigger subset..

This does not look like that time my script got a "," when I was expecting a "." because I did allow for a mix of country setting.

This looks like no testing at all.

8 0 Reply
1. Friday 15th April 2022 09:19 GMT ecofeco
  
  Re: Scripts
  
  This. All of this.
  
  4 0 Reply
2. Friday 15th April 2022 16:59 GMT OhForF'
  
  Re: Scripts
  
  I agree that bad scripts are an issue. I have been there myself running a script doing something like "find $LOG/ -mtime +1 | xargs rm -rf" when $LOG was not set ...
  
  Only this was not an instance of a script being bad/doing the wrong thing.
  
  As other posters have already pointed out the script did what it was supposed to do.
  
  The may even have tested with a single site first and a small subset of sites after that - but using the correct (application and not site) id in those tests but providing the wrong id's when they went for the remaining sites in the batch.
  
  What the article does not state is if there is any instance that checks and approves the "deactivation request" or if that is just forwarded to another team for execution.
  
  This missing check for what was actually requested for deletion is where it went wrong - i don't think you can blame that on a script.
  
  3 0 Reply
  1. Friday 15th April 2022 22:15 GMT ecofeco
    
    Re: Scripts
    
    That was the OP's point. No matter what, testing the script is SOP.
    
    Brevity is also good.
    
    3 0 Reply
3. Monday 18th April 2022 15:53 GMT GermanSauerkraut
  
  Re: Scripts
  
  "This does not look like that time my script got a "," when I was expecting a "." because I did allow for a mix of country setting."
  
  As far as I understand, the script worked flawless and did exactly what it was told to do, without any issues.
  
  The problem was a single script used for three completely different operations. 1) to remove an app from a customer instance, 2) to mark a customer instance for deletion, and 3) to completely nuke a customer instance, with the script distinguishing between 1 and 2/3 only by the objects referenced by - most likely - completely synthetic IDs.
  
  While I can to some degree understand why one would combine 2 and 3, there is no justification for adding 1 to the mix. If you're using similar functionality in the background, put that into a library of some sort, but provide distinct front ends for the ops guys to use. What happened is the text book example why one should do that...
  
  0 0 Reply
Friday 15th April 2022 09:27 GMT Trotts36

Lol

“We know incidents like this can erode trust”

- ahem; yeah you’re screwed. Incompetence personified

3 0 Reply
Friday 15th April 2022 11:56 GMT Franco

Twas a major cock up, but hearing how it happened I'm not surprised. I had a contract a while back where one of the things we did was setup Atlassian Cloud (Non-profit organisation so they got very favourable pricing, hence the choice of Atlassian by the higher-ups) and Insight was very clearly an afterthought. Jira and Confluence had lots of tools for importing data and customising it, Insight was almost entirely accessible only via a REST API if you wanted to do anything in bulk and had pretty much no documentation and support knew almost nothing about it.

They added a web-based bulk import tool for assets a few months after I'd written a PowerShell script to convert everything from a CSV to a JSON and then import it via the REST API.

1 0 Reply
Friday 15th April 2022 12:39 GMT I Am Spartacus

I've said it before...

Having your apps on the cloud is really having them on someone else's computer. You trust that person to do the right thing, all day, everyday. Exporting your business to the cloud does save you an IT department, but it does not alleviate you from risks: business, technical or human.

I use Atlassian on the cloud. The biggest issue is that we can't back it up ourselves. There is no "Export and download" function.

3 0 Reply
1. Saturday 16th April 2022 15:24 GMT Jellied Eel
  
  Re: I've said it before...
  
  Of course you can't do local backups. The objective is to maximise revenue, not availability. Backups will be 20c per KB. Restoration $5,995 plus 50c per KB.
  
  I try to talk clients out of stuff like that. If they have business critical data, they need to make sure they can access it. I've had numerous jobs were apps like SAP was used to control production in sites connected via cheap xDSL. They go down, business stops. Money was saved, briefly.
  
  Dedicated infrastructure means a CTO might need a bigger budget, but they should be more directly aligned to keeping the business alive. IT staff only have 1 customer (ish) vs being in 1/400th of a ticket queue awaiting restoration.
  
  It's bizarre to me that a lot of senior execs don't get this, and keep drinking the cloudy kool aid. OK, the business might save $500k a year in payroll, but if a day's outage costs $5m, it can be a bad investment.
  
  Plus there's other FUN! Telecomms seemed to be a Remedy (oh ARS!) shop. Often used for trouble tickets and tasks. Want me to do something? Send me a task. Manglement can then run reports to see what's going on. I was in one job where the system went down, and had to be restored. Which then meant previously closed tickets re-opening, tickets raised not being in backups, and it took a couple of weeks to get everything back in sync with reality.
  
  And if you've outsourced your IT to the cloud, you've got a lot less capacity to manage the fallout from a major cloud outage.
  
  7 1 Reply
2. Saturday 16th April 2022 20:34 GMT lostinspace
  
  Re: I've said it before...
  
  There is an export function. There is also a REST API endpoint for it. We've scripted this to back our Jira instance up nightly. This is more in case we make a massive cockup like bulk deleting all tickets rather than expecting to be doing atlassians job for them though
  
  1 0 Reply
Friday 15th April 2022 15:11 GMT amacater

Lessons learned: backups, test DR, institute a two person rule - everything no-one does

This is a "We'll keep some of our data in our own data centre, thanks" moment for anyone looking to move to cloud offerings from Atlassian.

This is an "Instigate a two-person rule for major changes - and test, test and test again" scenario for anyone vaguely competent as a lessons learned.

This is "Keep your instances of production, dev and reference on the same version" scenario for anyone using Atlassian data centre versions at the moment - and probably a "distrust the sales droid from Atlassian who will try to upsell you cloud".

This is a "Run, don't walk, from a major purchase or future deployment of Atlassian" for many considering this.

Backups and DR don't solve everything - a pint for the poor Atlassian folk saddled with unfsck'ing the mess in spite of everything.

1 1 Reply
Friday 15th April 2022 18:16 GMT Pirate Dave

Fresh reminder that "cloud" doesn't mean it's properly backed up or replicated or will be available in the event of an incident on the hosting-provider's end.

I got bit by this a couple of weeks ago - had a CentOS virtual server hosted at vultr.com to run my two websites. They had a "major hardware issue" on the storage hosting my server, and the vm for my server is gone. Poof. Apparently they only do backups/replication if you pay more. Otherwise, server's just gone. "Sorry, have 2 free months of hosting for your trouble. Goodbye."

My error was that I didn't have my own backup, because, well ..."Cloud Hosting!" But turns out it was really just a relatively fragile, unprotected single box hosting it.

3 0 Reply
1. Saturday 16th April 2022 08:25 GMT Anonymous Coward
  
  Was your error not more the assumption that they had a backup for the lowest tariffs if they screwed up?
  
  I have been looking at some contracts and T&Cs of late, and the crap that they sometimes hide in the fine print is horrific.
  
  2 0 Reply
  1. Tuesday 19th April 2022 14:52 GMT Pirate Dave
    
    I guess my root error was assuming they would try to provide the same level of "always keep it going" that I do at work. Backups, replicas, hot copies, etc, all in an effort to not lose stuff that my users need. I mean, they're a "hosting" company, that's the kind of stuff they should do in the background as a basic part of their hosting, IMHO.
    
    1 0 Reply
Saturday 16th April 2022 00:11 GMT Slow Joe Crow

This will affect purchase decisions

I have one client using one Atlassian product, Bitbucket, which they are forcing into the cloud. We have already been unable to remediate Log4J because Atlassian won't sell a license so we install a patched version so I will recommend dropping it in favor of vanilla Git. We have a lot of stuff in the cloud and we have been selling a lot of backup products for Microsoft 365 and G Suite lately.

5 0 Reply
Saturday 16th April 2022 07:41 GMT Anonymous Coward

Quite a lot to unpack really, just off the top of my head. Firstly, this sounds like something that should be done with a tool not random scripts. One that has had testing, can only grab relevant IDs and ideally has rollback built in. The second is even if the above wasn’t going to happen because the company was cheap, they should have meticulously tested the script in a prepped environment with the exact IDs and commands first. Thirdly there seems to have been a total lack of oversight, two or three people should be checking these kinds of big changes, did the script and IDs not even get checked by someone else? No approval process?

Whilst I appreciate their honesty they have sever management and process deficiencies to rectify once they’re back up.

1 0 Reply
1. Saturday 16th April 2022 12:04 GMT Anonymous Coward
  
  I don't think using scripts is a problem. The take-aways for me are:
  
  1) Lack of roll-back. You should always assume a script will go wrong at any point during its execution sooner or later - if only because of the chance of random hardware failure. So there should be a mechanism to rollback from the point that it went wrong as well as rollback from a successful conclusion.
  
  Additionally, I think I would have pushed for "delete" puts accounts offline and a second script does the GDPR full nuke thing which could be run later after customers have had a chance to complain.
  
  2) Use IDs that are immediately obvious as to what they refer to. I'm making an assumption here but, as has been mentioned above, if the IDs were prefixed with "CUST" for entire customer accounts and "APP" for apps deployed then a Mk.1 eyeball would have easily detected the impending disaster.
  
  2 0 Reply
Saturday 16th April 2022 20:33 GMT Anonymous Coward

Been there seen that

Having been involved in a situation where a principal engineer decided on his own to run a non reviewed cleanup script in max parallelism across the entirety of AWS on the weekend, creating a day long service outage in the US ... yeah. Started my Sunday morning hung over on a 12 hour technical conference call.

People, please, just don't do that.

Obviously he kept his job. Wasn't even chastised.

5 0 Reply

POST COMMENT House rules

Not a member of The Register? Create a new account here.

Other stories you might like

911 goes MIA across multiple US states, cause unclear

Updated Some say various cell services were out, others still say landlines were affected. What just happened?

Networks 18 Apr 2024 | 36

Sacramento airport goes no-fly after AT&T internet cable snipped

Police say this appears to be a 'deliberate act.'

Cyber-crime 19 Apr 2024 | 44

US-EAST-1 region is not the cloudy crock it's made out to be, claims AWS EC2 boss

It's the region where stuff gets stressed at scale first, says Dave Brown, as he plots variants of Amazon's Outposts

PaaS + IaaS 10 Apr 2024 | 4

Cyberattack hits Omni Hotels systems, taking out bookings, payments, door locks

Updated As WhatsApp, Facebook Messenger, other Meta bits plus Apple stuff fall offline today

Security 3 Apr 2024 | 18

Datacenter outages are on the decline, but when they hit, they hit hard

Power snafus take limelight in latest downtime diary from Uptime Institute

On-Prem 2 Apr 2024 | 3

Tech trade union confirms cyberattack behind IT, email outage

Exclusive Systems have been pulled offline as a precaution

Cyber-crime 25 Mar 2024 | 11

McDonald's ordering system suffers McFlurry of tech troubles

Global meltdown turns fast food slow

Off-Prem 15 Mar 2024 | 109

LinkedIn's turn to fall over: Outage hits thinkfluencer hub

Updated What's not to like? At the moment, everything on Microsoft's social network

Personal Tech 6 Mar 2024 | 15

World-plus-dog booted out of Facebook, Instagram, Threads

Updated Millions of Meta addicts suddenly cried out in terror and were silenced

Networks 5 Mar 2024 | 61

AT&T's apology for Thursday's outage should stretch to a cup of coffee

Check your service level agreements to make sure you'll at least get a slice of cake when your vendor goes down

Networks 26 Feb 2024 | 13

Americans wake to widespread AT&T cellular outages

Final update Telco battles to fix busted connectivity as other carriers feel the effects

Networks 22 Feb 2024 | 81

X protests forced suspension of accounts on orders of India's government

Nonprofit SFLC links orders to farming protests

Public Sector 23 Feb 2024 | 20

The Register Biting the hand that feeds IT

About Us

Our Websites

Your Privacy

Situation Publishing

Copyright. All rights reserved © 1998–2024