back to article Atlassian comes clean on what data-deleting script behind outage actually did

Atlassian has published an account of what went wrong at the company to make the data of 400 customers vanish in a puff of cloudy vapor. And goodness, it makes for knuckle-chewing reading. The restoration of customer data is still ongoing. Atlassian CTO Sri Viswanath wrote that approximately 45 percent of those afflicted had …

  1. EvilGardenGnome
    Pint

    Oof

    This sounds like a series of horrible, dumb, but honest mistakes. Also, I have to commend them on their honesty and directness; most would obfuscate, but this seems like a clear request for forgiveness.

    Icon for all involved, but the victims and people currently fixing the problem get first dibs.

    1. DS999 Silver badge
      Mushroom

      Re: Oof

      An "honest mistake"? Having automated deletion scripts that don't verify the heck out of things, and require some sort of special mode required to delete entire sites?

      That's not a mistake, that's incompetence.

      1. Yet Another Anonymous coward Silver badge

        Re: Oof

        All mistakes are incompetence - that's rather the definition of competence.

        But a new deletion script on a hosted environment, would probably have wanted it to have a "show me what you would do but don't do it" mode

      2. Pascal Monett Silver badge

        There was a special mode.

        The mistake is to have used it.

        RTFA

        1. DS999 Silver badge

          Maybe YOU need to read the article

          The "special mode" was for immediate deletion rather than marking for later deletion, and had nothing to do with deleting entire sites which is what I suggested needed a special mode.

          1. matjaggard

            Re: Maybe YOU need to read the article

            I don't think legal compliance deletion would allow for "special modes" whatever that is. I suspect a full client data deletion is actually quite common to deal with GDPR requests and the like.

      3. ecofeco Silver badge

        Re: Oof

        Incompetence indeed.

        The downvotes are also very disappointing.

      4. An_Old_Dog Silver badge

        meaningless-to-humans GUIDs

        I'm thinking the IDs were GUIDs, which are context-free (same ID format for programs, datasets, sites, directories, user objects, etc.) and are meaningless to humans.

        So a human can't tell they were provided a wrong GUID by just looking at it. That's a safety-check which has been eliminated.

      5. Terje
        Mushroom

        Re: Oof

        I think that the main issue here is that no one seems to have come to the conclusion that having the (supposedly more commonly used) safe and gentle script and the nuke from orbit option in the same script makes it not a question of if but when something goes horribly wrong!

    2. aki009

      Re: Oof

      Assuming that this story is what really happened.

      Given the time to come clean on this, it seems to me that this is more likely to be a fairy tale put together to make something far worse look acceptable. Why else would they not have told the world about it within hours or days of the incident.

      There wasn't a whole lot of trust to begin with, and now at least our company will move to the Atlassian cloud over my dead body.

  2. Doctor Syntax Silver badge

    Measure twice, cut once.

    1. TimMaher Silver badge
      Thumb Up

      Cut once

      ...and always cut away from yourself.

      1. Gene Cash Silver badge

        Re: Cut once

        "Cut toward your chum, not toward your thumb!"

        1. Psmo
          Thumb Up

          Re: Cut once

          Except my chum is the one with chequebook in their name, and they need thumbs for writing....

          1. John Brown (no body) Silver badge
            Coat

            Re: Cut once

            You still write cheques? How very last century! :-)

            1. Robert Helpmann??
              Happy

              Re: Cut once

              You still write cheques? How very last century! :-)

              Yes. On a cow's back.

            2. Anonymous Coward
              Anonymous Coward

              Re: Cut once

              Having asked for some historical family records from the MOD, you’d be surprised. To be fair, they accept postal orders as well as cheques.

    2. tip pc Silver badge
      Facepalm

      Measure twice, cut once.

      great advice until you realise that the the detail of what to measure from your colleague was wrong so you've accurately cut a wrong measurement.

      1. Arthur the cat Silver badge
    3. Blofeld's Cat

      "Measure twice, cut once."

      Measure with a micrometer, mark with chalk, cut with a hatchet.

      1. heyrick Silver badge
        FAIL

        Pretty much sums up the one (and only) time I ever did sewing.

        Kind of hard to appreciate the sorts of things girls make look easy until one tries it themselves and it goes horribly wrong. But, then, maybe it was a mistake to not wait for the scissors to be available but instead to try to make do with a large kitchen knife...

        Icon, because I can be honest with myself, I sucked at that.

    4. NXM Silver badge

      I do hardware, my business partner does software.

      He's not allowed any sharp objects in case he hurts himself.

    5. aki009

      Cut cut cut, and maybe measure

      Why measure when one can just cut cut cut cut and throw it in a shredder?

      Henceforth shredders in our office will be known as Atlassian Clouds.

    6. Anonymous Coward
      Anonymous Coward

      A friend was ordering work surfaces for a new kitchen. One had to have a small corner rectangular cut out at one end. When they arrived he found that he had measured the cut-out incorrectly - and the hole was 100mm (4") too long. So he re-measured and ordered a replacement. When it arrived he found he had made the same mistake again. That was a feature until the next kitchen rebuild several decades later.

      A stained glass artist made me two panels for my front door. Came the day she arrived with the two large constructions of glass and lead. Then she found they were 100mm too wide. She went away and soon came back with them trimmed on one side. The interesting thing is that her design looked more aesthetic by leaving the trimmed portion to the imagination to complete. Cramming the whole design into less width would not look as good.

  3. Anonymous Coward
    Anonymous Coward

    immediately thought of

    Precisely. The circuits that cannot be cut are cut automatically in response to a terrorist incident. You asked for miracles, Theo. I give you the F...B...I...

  4. wolfetone Silver badge

    GDPR

    Is there nothing it can't screw up?

    To be fair, I think when I heard of all of this I was expecting something monumentally stupid. Like "rm -rf" in the wrong folder of the server and no back ups performed. This though, it's fairly honest and happens to all of us at some point.

    1. stiine Silver badge
      Mushroom

      Re: GDPR

      if wasn't rm -rf. it was shred /dev/disk/customers

    2. OhForF' Silver badge
      WTF?

      Re: GDPR

      I agree with the rest of your post but the title and first sentence makes we wonder.

      I don't see where GDPR says you can't mark an app in a cloud for deletion and keep it around for some time when it becomes obsolete.

      It is a bit different when you have to get rid of personably identifyable data but that's not what they should have deleted so why do you blame this on GDPR?

      1. Jon 37

        Re: GDPR

        It's not just GDPR. There are a bunch of laws that might require data to be really deleted.

        Without those laws, the sysadmins could always do "mark as deleted", which can be easily undone when someone makes a mistake. Because of those laws, they had to add a "really delete this now" mode to the script. And when someone made a mistake and had used that option, there was no way to get the data back except restoring from backup.

        1. Crypto Monad Silver badge

          Re: GDPR

          May I propose a solution:

          - Script 1 marks listed items for deletion

          - Script 2 permanently deletes listed items, but only if they have already been marked for deletion (i.e. selective "empty trash")

          If you want permanent deletion, you have to run script 1 followed by script 2. Preferably with 24 hours in between.

      2. Anonymous Coward
        Anonymous Coward

        Re: GDPR

        Customer data is the *first* thing you have to get rid of. That's almost the entire point of GDPR.

        Scrubbing PII is a secondary concern that affects log retention etc etc.

    3. anticlimber

      Re: GDPR

      I don't know why you were downvoted for this. Large providers now have to balance what they were doing before (keeping backups, in case customers needed them) vs complying with GDPR...which is well-meaning but has deeply eroded the durability of customer data, industry wide.

      Particularly in complex systems, "delete user data U within N days" ends up having N days divided, serially, across multiple systems. An underlying system that could safely and easily restore user data from any deletion in the last 3 months (permanent delete after 3 months) now has a permanent deletion window of, say, 1 month...so the next system up the stack can have its deletion budget.

      Cryptoshredding, you say! Well, lots of smart lawyers have studied that and the results are inconclusive at best.

      Have a look at the definitions for RPO and RTO. The missing thing is -- how long are these guarantees good for after data is accidentally deleted? A lot less time, after GDPR.

  5. Headley_Grange Silver badge

    Every Cloud Problem has Silver Lining?

    I don't know if it's related or not, but since these Atlassian problems I've not been getting my daily junk mails from various Atlassian domains.

    1. Flocke Kroes Silver badge

      Re: Every Cloud Problem has Silver Lining?

      Except for the mushroom shaped ones which have a lining of iodine 133 and strontium 91.

      1. DJohnson
        Happy

        Re: Every Cloud Problem has Silver Lining?

        Isn't Strontium-90 described as a "silvery metal"?

        1. Pascal Monett Silver badge
          Coat

          Ooh, shiny !

          1. Anonymous Coward
            Anonymous Coward

            That's a good marketing theme when trying to sell cows' milk after nuclear fallout contamination of grass.

            A bit like the breakfast cereal advert that showed glowing children. It was quickly parodied by adding mention of the Windscale reactor leak.

            1. Anonymous Coward
              Anonymous Coward

              And don't forget to brush your teeth with Doramad Radioactive Toothpaste for a healthy glow, after you've had your ReadyBrek…

  6. VoiceOfTruth

    While I appreciate the honesty...

    -> The bad news is that while the company can restore all customers ... there is no automated system to restore "a large subset" of customers into an existing environment, meaning data has to be laboriously pieced together.

    That doesn't strike me as very good at all. It seems more like a reconstruction from whatever is available rather than a backup/restore. Be warned: if it can happen once it can happen again. So Atlassian should design a better recovery system.

    1. Anonymous Coward
      Anonymous Coward

      Re: While I appreciate the honesty...

      And... "The company is moving to a more automated process to speed [restoration] up"

      well 'automation' caused the problem in the first place, so that should go well...

    2. stiine Silver badge

      Re: While I appreciate the honesty...

      Don't you mean a better backup system?

      1. VoiceOfTruth

        Re: While I appreciate the honesty...

        No. I mean a better restore system. Backups are useless if you can't restore from them.

        1. Doctor Syntax Silver badge

          Re: While I appreciate the honesty...

          A good place to start would be to build the option to restore into the automated system. Move the data to a reserve location and only delete it a few days afterwards when it's clear there were no issues. Pretty well very desktop system and every email client has that; it's there for a reason. No, the reason isn't to archive the emails once you've read them.

        2. Falmari Silver badge

          Re: While I appreciate the honesty...

          @VoiceOfTruth "No. I mean a better restore system"

          One that can restore at the individual customer level.

          "The bad news is that while the company can restore all customers into a new environment or roll back individual customers that accidentally delete their own data, there is no automated system to restore "a large subset" of customers into an existing environment, meaning data has to be laboriously pieced together"

          https://forums.theregister.com/forum/all/2022/04/11/atlassian_outage_backups/#c_4443722

          1. Anonymous Custard Silver badge
            Mushroom

            Re: While I appreciate the honesty...

            This sounds like one of those broken arrow situations where you're not sure whether it would be more reassuring that they had something ready to fix up a major issue like this, or to be worried that they thought it likely enough to happen that they needed such a preparation in place...

            Icon for where the original dilemma came from.

            1. Anonymous Coward
              Anonymous Coward

              Re: While I appreciate the honesty...

              AKA the life-saver "here's one I made earlier".

  7. Arthur the cat Silver badge

    The script was executed with the wrong execution mode and the wrong list of IDs

    It's the wrong trousers Gromit! And they've gone wrong!

    1. Anonymous Coward
      Anonymous Coward

      Re: The script was executed with the wrong execution mode and the wrong list of IDs

      "It's the wrong trousers Gromit! And they've gone wrong!"

      Sounds like a definite case of "Out of cheese error"…

  8. ChipsforBreakfast

    Sh*t Happens

    No matter how many safeguards you build, checks you put in place or precautions you take the fuckup fairy will come calling sooner or later. The more systems you manage, the sooner she's likely to get to you - there is no escape.

    That's why we have things like backup strategies and RTO's, so that when she does visit it's not a company-ending event. At least they've been honest about what happened and how long it's going to take to put it right. No marketing spin. No fluff. Just an honest 'we screwed up, sorry'. They should be commended for that at least.

    Their lackluster RTO on the other hand isn't so easily forgiven....

  9. Anonymous Coward
    Anonymous Coward

    Erosion of trust

    "We know that incidents like this can erode trust"

    They can, but honesty and transparency in owning up to them can offset that. This incident has not inspired confidence in Atlassian's internal processes nor in the competence and experience of the staff responsible for it. However, there is no doubt at all that those same people will have learned some valuable hard lessons about change control, review, data validation, the value of staging environments, and enumerating all the recovery scenarios when designing backup and recovery systems. The people around them will have learned those same lessons in a somewhat easier manner. And the company's leadership are demonstrating the right values in their response, which when combined with the improvements likely to be made and whatever restitution or workarounds they're offering would probably be enough for most customers. If something really has to be trustworthy, it's already on-prem, and as a customer in this situation you ask yourself whether you'll be any better off with a competing service. My guess here would be probably not, unless I were already inclined that way for other reasons. It's much easier to accept a series of awful mistakes unlikely to be repeated than to accept dishonesty, evasion, panic, and refusal to learn. If you want to erode trust, follow the Okta model instead.

    1. Anonymous Coward
      Anonymous Coward

      Re: Erosion of trust

      There's another vector of trust here:

      By publising an analysis of what went wrong, they made it clear they have worked out what happened, and can so start work on preventing a repeat.

      1. Jellied Eel Silver badge

        Re: Ch11 calling

        Depends on the customers. Fair play for explaining the cause, but the cause was essentially negligence resulting in substantial damage to the customers. If those litigate, Atlassian will probably have to pay more compensation than just service credits.

        I think it also neatly demonstrates the problem with cloudybollocks and especially enforcing cloudy SaaS. Businesses are forced to rely on the supplier, or try to find alternatives to Atlassian that they can manage themselves.

        1. Anonymous Coward
          Anonymous Coward

          Re: Ch11 calling

          Atlassian is an Australian company so I don't believe Chapter 11 applies. Maybe there's a US subsidiary that could go bankrupt. As for what customers could get out of them in a lawsuit, that would depend on their contract terms. I'm reminded of the standard clause in many parts of the world in residential Internet service contracts "No refunds, no warranty, for entertainment purposes only"; they're literally allowed to keep your money and never provide any service at all. Business contracts are typically only a little better than that, limiting the service provider's liability to formula-based service credits. Maybe a few giant customers negotiated better terms, I don't know.

          "I think it also neatly demonstrates the problem with cloudybollocks and especially enforcing cloudy SaaS"

          What does "enforcing cloudy SaaS" mean? Dictates from the customer CEO to outsource everything? Regardless, outsourcing core business-critical services is foolish. No one cares about your business like you do, and the SLAs and other contractual terms are never strong enough to compel competence. If you've decided to outsource something, it should be because you need temporary (with a known, committed, FIXED end date!), low-volume, or low-value applications that can't justify the capital cost of building out an owned solution. You must also have the discipline to reassess those attributes periodically and bring things in house if they start to matter, not only for reliability reasons but because the total cost of outsourced service is typically between 2x and 5x what you'd pay to run it yourself. Most companies' managers lack both the awareness and the discipline to do this successfully; only a CEO dumb enough to dictate all-SaaS would also be dumb enough to hire them.

          Expecting reliable service from an outsourced provider is silly; their basic mission is to get you to fire your IT staff and close your data centres, then hold you over a barrel while providing minimally acceptable service at the lowest possible cost to themselves. You should be thrilled with 3 9s and satisfied with 2; if 98% availability (or data integrity) isn't good enough, don't outsource. The best you can hope for is transparency when things go wrong and, within the limits imposed by cost-reduction rules, an effort to learn from mistakes. Trust is a relative thing and for outsourcing the bar is set quite low; somehow most still fail to clear it.

          1. Jellied Eel Silver badge

            Re: Ch11 calling

            I've not looked too closely at how it's structured, but it's common to keep a handy Delaware Llc for tax purposes, and access to Ch.11 protection. That tends to be more survivable than UK administration then liquidation once the administrators have extracted all the fees & expenses they can.

            As for contract protection, IANAL, but don't think a contract automaticaly overrules torts like negligence.

            On enforcing SaaS, previous article mentioned Atlassian had been changing their product line and removing stand-alone server licenses to force migration to a cloud or subscription model. I loathe that business practice. On a server, I can take steps to mitigate risks, in a cloud model, my business is at the mercy of the service provider. That's a bit foolish when it's a business critical service.

            It's also unnecessary, eg if a cloud provider's just spinning up an instance in a VM, there's no real reason why I shouldn't be able to do that on my own hardware. Which is where litigatation might help, so encouraging vendors to do a better job. Or just make damages high enough that risks become uninsurable.

            1. Fred Flintstone Gold badge

              Re: Ch11 calling

              I agree 100%.

              As an aside, I would also commend you for the term cloudybollocks :).

  10. Anonymous Coward
    Anonymous Coward

    Yeah, their DR window wasn't accurately or clearly communicated to their customers

    Not sure what was in their SLA, but clearly the whole architecture of their cloud is flawed if customers expected to be able to use the services again in a timely manner. This incident wiped out a small slice of their customer base. What if it wiped the lot? If 4-5 days for 60 tenants and 45 days for 400 is the window, how long to do a bare metal restore of services? What if they had a datacenter fire, flood, or other site disrupting event?

    Is waiting a year for your data to be restored reasonable? Who decides what the queue is? For businesses whose operations are tightly integrated with their software that could be a death sentence.

    Part of provisioning a cloud service at this point should be building a reasonable recovery window, communicating that with your customers, or working with them to build and test their own continuity of business plans to handle failing over to a locally hosted box or an alternate service. Since SaaS offerings may not be as easily swapable as compute workloads, the provider needs to be better prepared to react to issues like this.

    1. Malcolm Weir

      Re: Yeah, their DR window wasn't accurately or clearly communicated to their customers

      I think the issue is that DR backups are not a substitute for "archival"-type backups. The problem appears to be that since Tenant A was not impacted (because they didn't have the obsoleted tool) but Tenant B was (because they did), you can't do a DR recovery to make Tenant B "whole" because you'd wipe out everything that's been chugging happily along with Tenant A over the past 9 days.

      Personally, I reckon performing a full DR recovery on an isolated cluster and then transferring all the "Tenant Bs" (which you know, because you have the list of "zap these" IDs) is probably smarter than doing some kind of selective restore of the DR media, but this approach requires having a sufficient pool of hardware available, which is not always the case, because DR scenarios often work on the basis that everything is available for the recovery, with no accommodation for the issue that some tenants are still working fine...

      1. Androgynous Cow Herd

        the issue is that DR backups are not a substitute for "archival"-type backups.

        This - precisely - BACKUPS are NOT a DR/BC plan. They are "Part of a balanced breakfast" but the RTO for this kerfluffle is completely untenable.

        Archive data has no place in the DR/BC runbook. Full Stop. Compliance issues happen when they co-mingle.

        The worst joke in all of IT is "The backups are easy, but the restore is a bit tricky".... And there are so many solutions in place that can do snapshot cloning etc to reduce complexity and RTO for DR/BC runbacks.

        You archive data you hope to never see again...you build a DR play for data you must be able to see again.

    2. flibble

      Re: Yeah, their DR window wasn't accurately or clearly communicated to their customers

      "What if it wiped the lot?"

      They have an automated DR recovery for full recovery, so that would have gone much better. The problem here was the need to restore only some of the data, which they didn't have any automation for.

  11. Filippo Silver badge

    Shit happens. It's what you do when shit happens that counts.

  12. aaaaaargh

    Just asking, all honestly admitted mistakes aside, why does a single team even have the power to delete so many customer installations? Should the users and their installations not be separated into segments or something?

  13. Anonymous Coward
    Anonymous Coward

    Redmine is the better product anyhow.

  14. benderama

    well

    The initial deletion script did exactly as it was supposed to do. The error, as always, was human. Humans provided the incorrect IDs for deletion, humans failed to prepare a feasible restore process.

    They should take from this the skill to consider as many what-ifs as possible instead of only the expected outcome.

    1. Doctor Syntax Silver badge

      Re: well

      They should take from this the fact that paranoia is a basic requirement in system administration.

      1. Will Godfrey Silver badge

        Re: well

        Paranoia is a basic requirement for any design that comes into contact with wetware.

      2. Brad16800

        Re: well

        Amen. Bothered to sign in on my mobile just to update. Sys admins are the gatekeepers for good reason.

  15. Anonymous Coward
    Anonymous Coward

    I bet someone would have caught the mistake had they been given a list of human readable customer id's and app names (e.g. acme.com and app01) instead of uuid's (e.g. 856fd738-bc55-11ec-8422-0242ac120002 and 7fd2fb10-bc56-11ec-8422-0242ac120002).

    1. Warm Braw

      Given they're Australian, I imagine it's a case of TITSDOWN - Total Inability To Successfully Delete Over Wrong Nomenclature.

    2. Anonymous Coward
      Joke

      Oi! That's my Jira account you've just listed there. Use yer own!

    3. Down not across

      Or the deletion script would look up the ID mappings and bailed out or warned "You really sure you want to do this?" if the ID was for a whole site rather than customer.

      Likewise perhaps the mappings should include GDPR compliancy flag, and the "GDPR mode" would not engage if the flag for that ID was not set.

  16. Big_C

    They lost a lot of trust - and rightly so.

    While the CEO gave a nice explaination and most lilely anybody in IT can sort of relate to that blunder, it took them way to long to come clean.

    And that "hundrets of engineers" working around the clock statement was imho bs.

    I guess they make restore runs from a large full site backup but have capacity issues, so they can only recreate a limeted number of sites each time.

    And the slow speed points to tape systems...? Lets hope it does not break and that the current backups do not need to be stopped during the recovery.

    1. yoganmahew

      Tape is quick if it is distinct. Like other commentators, I suspect they had a backup of everything written horizontally (maybe even at disk level?) that they're now trying to extract little vertical slices from. If you backed up everything in a customer instance logically to tape, it would be trivial to restore.

  17. cjcox

    Atlassian is making the official report available

    You can get the official report if you install the Jira Oopsie Plugin, which is free for up to 10 users, then a mere $10/mo. per user greater than 10. There's also a Premium level that removes some important redactions at $20/mo. per user. The ability to go beyond page 1 of any report is supplied through the Atlassian Markeplace with plugins such as Money Money Money, you will have to check the marketplace for current pricing.

  18. sean.fr

    Scripts

    Is there anyone in IT how has not burned by a bad script story?

    Anything one off has a high change of bugging.

    So I test scripts on a small subset first -possible as small as one vm/switch/customer.

    Then a bigger subset..

    This does not look like that time my script got a "," when I was expecting a "." because I did allow for a mix of country setting.

    This looks like no testing at all.

    1. ecofeco Silver badge

      Re: Scripts

      This. All of this.

    2. OhForF' Silver badge

      Re: Scripts

      I agree that bad scripts are an issue. I have been there myself running a script doing something like "find $LOG/ -mtime +1 | xargs rm -rf" when $LOG was not set ...

      Only this was not an instance of a script being bad/doing the wrong thing.

      As other posters have already pointed out the script did what it was supposed to do.

      The may even have tested with a single site first and a small subset of sites after that - but using the correct (application and not site) id in those tests but providing the wrong id's when they went for the remaining sites in the batch.

      What the article does not state is if there is any instance that checks and approves the "deactivation request" or if that is just forwarded to another team for execution.

      This missing check for what was actually requested for deletion is where it went wrong - i don't think you can blame that on a script.

      1. ecofeco Silver badge

        Re: Scripts

        That was the OP's point. No matter what, testing the script is SOP.

        Brevity is also good.

    3. GermanSauerkraut

      Re: Scripts

      "This does not look like that time my script got a "," when I was expecting a "." because I did allow for a mix of country setting."

      As far as I understand, the script worked flawless and did exactly what it was told to do, without any issues.

      The problem was a single script used for three completely different operations. 1) to remove an app from a customer instance, 2) to mark a customer instance for deletion, and 3) to completely nuke a customer instance, with the script distinguishing between 1 and 2/3 only by the objects referenced by - most likely - completely synthetic IDs.

      While I can to some degree understand why one would combine 2 and 3, there is no justification for adding 1 to the mix. If you're using similar functionality in the background, put that into a library of some sort, but provide distinct front ends for the ops guys to use. What happened is the text book example why one should do that...

  19. Trotts36

    Lol

    “We know incidents like this can erode trust”

    - ahem; yeah you’re screwed. Incompetence personified

  20. Franco Bronze badge

    Twas a major cock up, but hearing how it happened I'm not surprised. I had a contract a while back where one of the things we did was setup Atlassian Cloud (Non-profit organisation so they got very favourable pricing, hence the choice of Atlassian by the higher-ups) and Insight was very clearly an afterthought. Jira and Confluence had lots of tools for importing data and customising it, Insight was almost entirely accessible only via a REST API if you wanted to do anything in bulk and had pretty much no documentation and support knew almost nothing about it.

    They added a web-based bulk import tool for assets a few months after I'd written a PowerShell script to convert everything from a CSV to a JSON and then import it via the REST API.

  21. I Am Spartacus

    I've said it before...

    Having your apps on the cloud is really having them on someone else's computer. You trust that person to do the right thing, all day, everyday. Exporting your business to the cloud does save you an IT department, but it does not alleviate you from risks: business, technical or human.

    I use Atlassian on the cloud. The biggest issue is that we can't back it up ourselves. There is no "Export and download" function.

    1. Jellied Eel Silver badge

      Re: I've said it before...

      Of course you can't do local backups. The objective is to maximise revenue, not availability. Backups will be 20c per KB. Restoration $5,995 plus 50c per KB.

      I try to talk clients out of stuff like that. If they have business critical data, they need to make sure they can access it. I've had numerous jobs were apps like SAP was used to control production in sites connected via cheap xDSL. They go down, business stops. Money was saved, briefly.

      Dedicated infrastructure means a CTO might need a bigger budget, but they should be more directly aligned to keeping the business alive. IT staff only have 1 customer (ish) vs being in 1/400th of a ticket queue awaiting restoration.

      It's bizarre to me that a lot of senior execs don't get this, and keep drinking the cloudy kool aid. OK, the business might save $500k a year in payroll, but if a day's outage costs $5m, it can be a bad investment.

      Plus there's other FUN! Telecomms seemed to be a Remedy (oh ARS!) shop. Often used for trouble tickets and tasks. Want me to do something? Send me a task. Manglement can then run reports to see what's going on. I was in one job where the system went down, and had to be restored. Which then meant previously closed tickets re-opening, tickets raised not being in backups, and it took a couple of weeks to get everything back in sync with reality.

      And if you've outsourced your IT to the cloud, you've got a lot less capacity to manage the fallout from a major cloud outage.

    2. lostinspace

      Re: I've said it before...

      There is an export function. There is also a REST API endpoint for it. We've scripted this to back our Jira instance up nightly. This is more in case we make a massive cockup like bulk deleting all tickets rather than expecting to be doing atlassians job for them though

  22. amacater
    Pint

    Lessons learned: backups, test DR, institute a two person rule - everything no-one does

    This is a "We'll keep some of our data in our own data centre, thanks" moment for anyone looking to move to cloud offerings from Atlassian.

    This is an "Instigate a two-person rule for major changes - and test, test and test again" scenario for anyone vaguely competent as a lessons learned.

    This is "Keep your instances of production, dev and reference on the same version" scenario for anyone using Atlassian data centre versions at the moment - and probably a "distrust the sales droid from Atlassian who will try to upsell you cloud".

    This is a "Run, don't walk, from a major purchase or future deployment of Atlassian" for many considering this.

    Backups and DR don't solve everything - a pint for the poor Atlassian folk saddled with unfsck'ing the mess in spite of everything.

  23. Pirate Dave Silver badge

    Fresh reminder that "cloud" doesn't mean it's properly backed up or replicated or will be available in the event of an incident on the hosting-provider's end.

    I got bit by this a couple of weeks ago - had a CentOS virtual server hosted at vultr.com to run my two websites. They had a "major hardware issue" on the storage hosting my server, and the vm for my server is gone. Poof. Apparently they only do backups/replication if you pay more. Otherwise, server's just gone. "Sorry, have 2 free months of hosting for your trouble. Goodbye."

    My error was that I didn't have my own backup, because, well ..."Cloud Hosting!" But turns out it was really just a relatively fragile, unprotected single box hosting it.

    1. Anonymous Coward
      Anonymous Coward

      Was your error not more the assumption that they had a backup for the lowest tariffs if they screwed up?

      I have been looking at some contracts and T&Cs of late, and the crap that they sometimes hide in the fine print is horrific.

      1. Pirate Dave Silver badge
        Pirate

        I guess my root error was assuming they would try to provide the same level of "always keep it going" that I do at work. Backups, replicas, hot copies, etc, all in an effort to not lose stuff that my users need. I mean, they're a "hosting" company, that's the kind of stuff they should do in the background as a basic part of their hosting, IMHO.

  24. Slow Joe Crow
    Mushroom

    This will affect purchase decisions

    I have one client using one Atlassian product, Bitbucket, which they are forcing into the cloud. We have already been unable to remediate Log4J because Atlassian won't sell a license so we install a patched version so I will recommend dropping it in favor of vanilla Git. We have a lot of stuff in the cloud and we have been selling a lot of backup products for Microsoft 365 and G Suite lately.

  25. Anonymous Coward
    Anonymous Coward

    Quite a lot to unpack really, just off the top of my head. Firstly, this sounds like something that should be done with a tool not random scripts. One that has had testing, can only grab relevant IDs and ideally has rollback built in. The second is even if the above wasn’t going to happen because the company was cheap, they should have meticulously tested the script in a prepped environment with the exact IDs and commands first. Thirdly there seems to have been a total lack of oversight, two or three people should be checking these kinds of big changes, did the script and IDs not even get checked by someone else? No approval process?

    Whilst I appreciate their honesty they have sever management and process deficiencies to rectify once they’re back up.

    1. Anonymous Coward
      Anonymous Coward

      I don't think using scripts is a problem. The take-aways for me are:

      1) Lack of roll-back. You should always assume a script will go wrong at any point during its execution sooner or later - if only because of the chance of random hardware failure. So there should be a mechanism to rollback from the point that it went wrong as well as rollback from a successful conclusion.

      Additionally, I think I would have pushed for "delete" puts accounts offline and a second script does the GDPR full nuke thing which could be run later after customers have had a chance to complain.

      2) Use IDs that are immediately obvious as to what they refer to. I'm making an assumption here but, as has been mentioned above, if the IDs were prefixed with "CUST" for entire customer accounts and "APP" for apps deployed then a Mk.1 eyeball would have easily detected the impending disaster.

  26. Anonymous Coward
    Anonymous Coward

    Been there seen that

    Having been involved in a situation where a principal engineer decided on his own to run a non reviewed cleanup script in max parallelism across the entirety of AWS on the weekend, creating a day long service outage in the US ... yeah. Started my Sunday morning hung over on a 12 hour technical conference call.

    People, please, just don't do that.

    Obviously he kept his job. Wasn't even chastised.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like