back to article At last, Atlassian sees an end to its outage ... in two weeks

The Atlassian outage that began on April 5 is likely to last a bit longer for the several hundred customers affected. In a statement emailed to The Register, a company spokesperson said the reconstruction effort could last another two weeks. The company's spokesperson explained its engineers ran a script to delete legacy data …

  1. oldtaku Silver badge
    Mushroom

    Cheers to that guy who hit the ENTER key!

    I've certainly done some eff-ups, but nothing that comes remotely close to this scale.

    1. John Riddoch

      Re: Cheers to that guy who hit the ENTER key!

      We eagerly await the "Who, Me?" Column....

    2. Peter D

      Re: Cheers to that guy who hit the ENTER key!

      Many moons ago I accidentally deleted the members database of the London Stock Exchange. I stayed in until 4am putting it back to avoid me being on the front page of the Evening Standard the next day.

  2. An_Old_Dog Silver badge

    chicken & pig

    "... we are fully committed to resolving this."

    In a ham-and-egg breakfast, the chicken is involved. The pig is committed.

  3. Falmari Silver badge
    Facepalm

    We maintain extensive... DMML

    "We maintain extensive backup and recovery systems"

    Backup maybe but not recovery.

    There does not seem to be a system in place to recover and restore this particular set of data. Because if there had been a well maintained and tested recovery system in place the data would have been restored days ago.

    It would not have taken a week to restore just 9% of their customers data.

    1. A Non e-mouse Silver badge

      Re: We maintain extensive... DMML

      An untested backup is a non-existant backup.

      It sounds like Atlassian haven't properly tested their backups.

      Hopefully regular backup testing will be part of their standard operating procedure.

      1. thondwe

        Re: We maintain extensive... DMML

        Or the Backup System is designed to recover from a System level failure - not deleted individual data sets? So they're having to rebuild a complete system from backups before they can pull the data sets?

        Suspect this setup common in cloud land, hence the need for customers to somehow keep backups of their own data sets (read the small print for MS365/Azure for example)

        Given the embarrassment caused by this outage, cloud providers need to look at their customer backup/recovery processes?

        1. Anonymous Coward
          Anonymous Coward

          Re: We maintain extensive... DMML

          Probably this.

          Working with Confluence in Jira a good number of years ago, the only way to give someone a copy of one workspace was to backup everything, then restore it all to a new instance, delete everything you didn't want and then export what was left over.

          Caught in their own incompetency it seems.

          1. Falmari Silver badge

            Re: We maintain extensive... DMML

            @thondwe “Or the Backup System is designed to recover from a System level failure - not deleted individual data sets? So they're having to rebuild a complete system from backups before they can pull the data sets?”

            I think you are very much on the money there. Probably made all the more difficult and time consuming by this @AC

            “the only way to give someone a copy of one workspace was to backup everything, then restore it all to a new instance, delete everything you didn't want and then export what was left over.”

            It looks like there is no way to export/backup import/restore at the workspace or even the individual customer level. If that is the case, then Jira/Confluence etc is missing a very important disaster recovery tool/feature for both Atlassian’s customers and themselves.

        2. ronkee

          Re: We maintain extensive... DMML

          Anything an errant script can do is still within the blast radius of a ransomware attack. In this day and age it's more likely than a catastrophic site failure of primary and standby.

          This is the disaster your DR plan should be tested for.

          1. jgard

            Re: We maintain extensive... DMML

            Any ransomware outfit would be delighted to come up with a script that does so much damage for so little effort - the destructive efficiency on show is unprecedented. I hope Atlassian are keeping the script under lock and key, because we would be well and truly f**ked if the Russians got hold of it!

            But on a more serious note, I would like to offer heartfelt and genuine sympathy to the person or persons responsible. It could happen to the best of us at some point in our careers. I can only imagine how they have felt the last week or two, chin up guys and/or girls!

        3. John 104

          Re: We maintain extensive... DMML

          @thondwe

          "Given the embarrassment caused by this outage, cloud providers need to look at their customer backup/recovery processes?"

          They won't. It's just easier to go cheap and write it off. This is clearly a case of an incomplete BCP, which is another area that businesses like to go cheap on. Maybe backup software but no recovery plan at all.

      2. Anonymous Coward
        Anonymous Coward

        Re: We maintain extensive... DMML

        >> It sounds like Atlassian haven't properly tested their backups.

        That's a hugely simplistic and disingenuous statement to make!

        Imagine your standard database from mainstream database vendor. You can have in place all the fault-tolerance, log shipping, full and incremental backups, offsite storage and the ability to restore (relatively) quickly that database.

        Now tell me how any of that will allow you to effectively restore data having had a script which has selectively removed thousands of records from dozens of tables.

        That's much more like what it sounds like Atlassian are facing that a simple "restore the backup" scenario.

        Not that this is intended to be a defence of Atlassian. When you have a process that may delete copious amounts of data, it's inherent on to take whatever time is necessary to ensure it does (only) the right thing. If that means taking three-times the time to have it run in a "what if" mode, then that's what you need to do. Yes it will take longer to do that, but I would suggest not as long as it's taking Atlassian to recover from this.

    2. Anonymous Coward
      Anonymous Coward

      Re: We maintain extensive... DMML

      Indeed. They better add some time criteria to their restore process or they might as well restore from punch cards and C60 cassettes..

    3. IGotOut Silver badge

      Re: We maintain extensive... DMML

      I don't know. It may be they have the data, just spread over 10,000 tapes, with a few mb on each.

    4. Anonymous Coward
      Anonymous Coward

      Re: We maintain extensive... DMML

      There never is. Non-technical execs cannot differentiate between backups and business continuity.

  4. Anonymous Coward
    Anonymous Coward

    slightly good news?

    We got our Confluence and Jira back yesterday after a most uncomfortable week on the coalface. Glad they got it back up, not so pleased it happened (to put it mildly)

    1. Anonymous Coward
      Anonymous Coward

      Re: slightly good news?

      Or - you look for a different solution where you can retain control of your data...

      1. Franco

        Re: slightly good news?

        Unfortunately those options are becoming fewer and further between. PHBs want cloud, vendors want to sell cloud and as stated in the article Atlassian are stopping sales and support for on-prem products.

    2. SgtPepper

      Re: slightly good news?

      "This data was from a deprecated service that had been moved into the core datastore of our products,"

      Out of interest, did you use the Insight asset management plugin?

      They recently moved Insight into the core offering, so I wonder if this was the cause.

  5. Anonymous Coward
    Anonymous Coward

    Unfortunate

    As we do have a big project coming soon and our on prem Confluence is on discontinued HW, and 0 support.

    So, either I chose a dead on prem platform or their ... service ...

    1. Doctor Syntax Silver badge

      Re: Unfortunate

      I've always said paranoia is the first requirement for any DBA. Maybe this has been a suitable paranoia upgrade for Atlassian.

      1. Arthur the cat Silver badge

        Re: Unfortunate

        Arguably paranoia is the first requirement for anyone in the computer industry. We're all fighting against Resistentialism.

    2. Anonymous Coward
      Anonymous Coward

      Re: Unfortunate

      Remember to allocate resources for user support if you move - the server & cloud variants are not the same.

  6. Abominator

    So their "Cloud First" strategy is going well. More like bork your business and burn everything to the ground.

    Why would customers trusts these fucking idiots with their data going forward?

    1. Pascal Monett Silver badge

      Because the CTO is the CEO's cousin and the CFO is his mistress, and they both say it'll be fine.

  7. Pascal Monett Silver badge

    "no authorized access to customer data has occurred"

    Yes, we know customers have been cut from their data.

    What about unauthorized access ?

  8. Kurgan
    FAIL

    The cloud is the future

    "The cloud is the future". Sure.

  9. Terje

    Yet another reason why I loath the drive for cloud, when it falls over (not if) you are fully at the mercy of cost optimised likely understaffed company to hopefully have a working backup and recovery system to sort it out at their leisure. When you have control over the system you are at the mercy of your own backup and recovery procedures which you yourself can influence.

    1. Plest Silver badge

      Very true but you don't allow the lead PHB to be blinded by the freebies and jollies and sign up with the first Flash Harry who says they can "do it all, whatever you need".

      1. An_Old_Dog Silver badge

        Gonna need a rug ...

        after someone "stops" my PHB, because his brain is permanently-wired into accepting vendor wooing/outright kickbacks, and loving shiny-new-trendy things. He gets away with it because he's a "people-person" and he works it like a god. All his higher-ups thinks he's the greatest thing since toast.

    2. Anonymous Coward
      Anonymous Coward

      Two incidents reflect the "benefits" of cloud.

      1. Had to stand up a Citrix CVAD environment in the cloud service a couple of years back. If it had been on premises, I could have had the basic service setup in a day, then just spend time on tuning and optimising the VDAs. In cloudland it took a week to get the basic service as whenever I hit an issue, I could not do any troubleshooting and had to log a job and wait for them to get back to me. The first ticket I had to log was the cloud service sign up portal wasn't working, then downhill from there.

      2. A client wanted to stand up some extra Azure VMs of a type they were already using. Unfortunately, due to capacity issues, these particular VM types were all reserved for large clients. Our client didn't fit into this category so had to build new servers using a different VM type. Basically "We are only giving these VMs to people we give a shit about, and that doesn't include you matey! Just pay your bill and move along."

  10. Anonymous Coward
    Anonymous Coward

    It’s almost like webex all over again….

    https://www.theregister.com/2018/10/03/cisco_webex_outage_script/

  11. Doctor Syntax Silver badge

    "Instead of deleting the legacy data, the script erroneously deleted sites, and all associated products for those sites including connected products, users, and third-party applications."

    With an SQL database any sufficiently paranoid DBA will run a SELECT with the same WHERE clause as the intended DELETE to check that what would be deleted is what's intended. I suppose this is all trendy NoSQL stuff. Doesn't that have the same facility?

    1. talk_is_cheap

      Considering the age of Jira and 16 year old tickets that are outstanding for complicated tasks such as changing the owner of a wiki entry I would say that NoSQL is right but in terms of pre-dating SQL rather than post-dating it.

  12. bazza Silver badge

    Isn't this embarrassing for a company that has encouraged users to go cloud only, to then delete the cloud?

    It's always good to be in a position where it's impossible to delete customer data...

  13. Captain Scarlet
    Facepalm

    This data was from a deprecated service that had been moved into the core datastore of our products

    Looks like changes were made, but not to the script.

  14. Franco

    Mass exodus from Atalssian Cloud in the post no doubt. I'm reminded of the mass Blackberry outage of about 10 years ago, that signalled the beginning of the end for Blackberry.

  15. tip pc Silver badge

    Don't important things use a multi step delete anymore?

    If that was my cloud thing i'd ensure that any wholesale deletes go through a multistage process.

    get a list of everything that will be affected by the delete process

    determine a process to undo what your about to do

    mark each of those entries/files with a will-be-deleted-on-x-date flag

    see what is actually still using those entries/files, if they are for delete then no user processes should be accessing them

    remove access to those identified entries/files & see what breaks

    if no one has complained in 4 weeks then fully remove those entries/files.

    its expensive, time consuming & no one will know when it goes really well, which is actually the point.

    You want to make absolute certain mistakes won't happen so you over engineer things, that's why people buy cloud solutions, they expect the careful dedication to ensuring things go right is being done by the cloud provider and included in whatever the fee is. The cloud provider benefits from their scale by having teams manage all that boring stuff instead of each customer having all the necessary teams to do that individually.

    1. Anonymous Coward
      Anonymous Coward

      Re: Don't important things use a multi step delete anymore?

      they expect the careful dedication to ensuring things go right is being done by the cloud provider

      Give me a moment until I have finished laughing, sorry. The big not-so-secret of these cloudy providers is that they have the same bookkeepers that moved your own company to the cloud in the first place, so I'm afraid you may have to tone down your expectations to "almost good enough" level, because that's what you're going to get - after all, the fog surrounding cloudy things ensures you have zero insight in how the vapour is actually produced.

      They too have to satisfy shareholders they have not left a cent/penny/dime on the table.

  16. rgb1

    Outsourced

    Is it known whether Atlassian outsourced this job or ran it in house?

    1. Terje

      Re: Outsourced

      Of course it's outsourced, see reference bellow (obligatory xkcd reference)

      https://xkcd.com/908/

  17. fidodogbreath

    15 seconds before the outage:

    > drop table PROD_MAIN: are you sure? Y[n]

    "Hell yes, I'm sure. How dare you question me?" Y <smashes Enter>

    >

  18. Auror

    *#ck the cloud version of Jira

    This is why I'm holding onto my Server instance of Jira as long as I can. At this point it's either selling data-center to the PHB or finding another platform. They're pushing a less capable, less available product as far as I'm concerned to try and make the execs more money.

    These kinds of screw ups in the cloud is why I won't trust core services to a cloud provider without multiple layers of redundancy I control. Inevitably someone will screw up and I'll be the one paying for their "learning experience".

  19. captain veg Silver badge

    As someone forced to use the Bob-awful Jira and, latterly, Confluence, by the PHBs, can I be the first to express regret that thusfar my employer has not been affected in the slightest by this upcockage?

    -A.

  20. Anonymous Coward
    Anonymous Coward

    Maybe next time, they should run the script with the -whatif switch.

  21. This post has been deleted by its author

  22. Fruit and Nutcase Silver badge
    Joke

    Hercules v Atlas

    Herculean effort needed to fix Atlassian blunder

  23. spold Silver badge

    Small typo....

    >>> <6 hours

    Sorry we meant months

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like