back to article 'Database failure ate my data' – Salesforce customer

An apparent database corruption is being blamed for nearly five hours of severe disruption for Salesforce customers last week. Users running their applications on Salesforce’s EU3 instance experienced blackouts and intermittent service between 6.21pm and 11.36pm on 5 August. Salesforce at the time promised customers a " …

COMMENTS

This topic is closed for new posts.
  1. Hollerith 1

    What do you mean...?

    I thought _you'd_ put in the back-up tape!

    No, that was _your_ job.

    No way, it's your job on Wednesdays!

    Aw, man.... we're screwed...

    1. Justicesays
      Facepalm

      Re: What do you mean...?

      They have backups , which are as recent as one hour before the failure.

      So I don't think putting the backup tape in is an issue here.

    2. Pedo Bear
      FAIL

      Re: What do you mean...?

      Well it's not gonna be any of their jobs much longer lol

    3. Anonymous Coward
      Anonymous Coward

      Re: What do you mean...?

      The cloud has failed?

      It can never fail......

      1. Destroy All Monsters Silver badge
        Childcatcher

        Re: What do you mean...?

        The cloud is based on Larry's stuff...

  2. Anonymous Coward
    Anonymous Coward

    Hmmm...

    Only one hour of lost data from a catastrophic db failure?

    I'm not a big believer in cloud services primarily because you have just about zero visibility into how well they're running the infrastructure - backups included - but in this case I can't image that many of us running our own infrastructure and suffering the same type of incident would have done much better.

    1. Shasta McNasty
      Boffin

      Re: Hmmm...

      I have to disagree.

      In this case the disaster recovery component also failed, which meant it wasn't fit for purpose. Having the latest data available when the primary component goes tits up is its ONLY true function.

      If I was cynical, I would suspect that these customer's DR servers also happen to be some other customer's primary servers. To the beancounters, this means that the cloud company could save money on hardware and sell the same thing twice.

      One server fails, services fail over, servers can't cope with the increased load and bang everything lost.

      1. R 11

        Re: Hmmm...

        I'm not so sure it's about over selling a server.

        When one database goes down and the replicated server also fails, I'd be suspicious of a bug. A command run on one server caused corruption. The problem was the command was replicated and run on the slave within a fraction of a second. Both databases being otherwise identical, you get the same corruption on your live server and the slave.

        what I'm wondering is why, if they have an hour old backup, there's no transaction log from which they can restore? Or did the command that compromised the entire data set actually run one hour before they shut things down to prevent further damage?

        A slave is like RAID for your discs. You get redundancy, but no backup. Your back is from the last image with the transaction log being used to bring your databases up to date. If they have an hour old backup, there's very little in the way of excuses that would justify not also having a transaction log to complete the restoration.

    2. John Smith 19 Gold badge
      FAIL

      Re: Hmmm...

      "Only one hour of lost data from a catastrophic db failure?"

      Since this is SaaS and t'Cloud the correct answer for the amount of downtime should be zero

      This is a paid services, so hardware and software should be provisioned (and, oh yes, tested) to a level to support the customers paying for this service.

      Fail is appropriate.

      1. cpreston

        Re: Hmmm...

        It's all about the SLA. If you have an SLA that promises zero data loss, then sure. it's reasonable to expect zero data loss. if you don't, then... I'd have to agree that an hour of lost data is better than most companies would do with their own data.

        1. fajensen
          Trollface

          Re: Hmmm...

          If you have an SLA, and bother to read it, there might be some unexpected but legally correct definitions of: "promise", "zero", "data" and "loss". Just Saying.

      2. Anonymous Coward
        Anonymous Coward

        Re: Hmmm...

        SaaS and Cloud may well be running on dual replicated MySQL* databases cared for by university students interested in Linux and in need of money! Just like with sausages no one likes to know the details on how they are provided.

        *Load comes up, the "backup" slips behind, once it is enough behind backup stops (and the primary has to be halted to allow the sync to catch up - possibly it ate all memory and halted itself, being shit and all). Happened ALL the time in My Worst Job Ever!

  3. cs94njw

    WTF!?

    Hourly backups is pretty good... but actual corrupted data? If a disk goes down, I'm expecting another disk in the array to carry on.

    If a stock exchange trading firm was using Salesforce to hold data, 1 hour could cost millions - and if it held trading data from customers, it could cost even more.

    Salesforce's prime functionality is for CRM. If they've binned a few customers, or even a large customer order....

    Not very reassuring if even a company this large can screw it up :(

    1. John Smith 19 Gold badge
      Unhappy

      Re: WTF!?

      "If a stock exchange trading firm was using Salesforce to hold data, 1 hour could cost millions - and if it held trading data from customers, it could cost even more."

      Stick a few noughts on that number.

      1. Destroy All Monsters Silver badge
        Holmes

        Re: WTF!?

        It would either cost millions, or it would save millions, depending on how the weatherwane of the casino was turning.

        "PHEW---WE HAD DOWNTIME!"

        We need an icon with a crazy monkey.

        1. John Smith 19 Gold badge
          Happy

          Re: WTF!?

          "It would either cost millions, or it would save millions, depending on how the weatherwane of the casino was turning."

          Actually I've been around when one of these events happened.

          And when the banksters call their lawyers the damages they will be suing for will always be losses IE profit their departs would have made.

          Probably the only time in my life (or his) I'll see a Board member of an investment bank run up 4 flights of stairs.

  4. Tom 38

    Phew!

    Thankfully we're on eu1 here, that could have been a real clusterfuck. We pull the most important data out of salesforce every 15 minutes anyway (don't ask), but that would just mean that someone would expect us to restore it to salesforce.

    1. Barts

      Re: Phew!

      'ever had to restore or done a dry run restore from one of your 15 minuters? do you think it would be hellish?

  5. Anonymous Coward
    Anonymous Coward

    Ah the joys of CDP, now you can have two copies of that corrupted database for a very competitive price.

    1. This post has been deleted by its author

    2. cpreston

      "Ah the joys of CDP, now you can have two copies of that corrupted database for a very competitive price."

      That's not how CDP works. True CDP can recover from any rolling disaster, including corruption. It does not appear that they were using CDP, or they would not have suffered this loss. They were apparently using near-CDP (AKA snapshots and replication), and had to go back to a snapshot from an hour ago.

  6. channel extended

    DR tested?

    The problem to me is the question of if or how often they checked their DR systems/procedure. A bank that I worked for tested some systems monthly and a complete systems failure once a year. By complete I meant nuclear meltdown of the mainframe level occurring in an instant. So how often do you check?

  7. joed

    NSA to the rescue

    100% disaster recovery coming soon, courtesy of good folks, the most trusted name in business

  8. Anonymous Coward
    Anonymous Coward

    Limited Disclosure?

    I never want to deal with a company that would make "limited disclosure" about my data.

  9. Anonymous Coward
    Anonymous Coward

    "...been unable to bring back either the primary database that was in trouble or the disaster-recovery instance backup that shadowed it..."

    What no transaction journals and roll forward from the last backup that they CAN ACTUALLY READ? I suppose I'm just too old to understand!!

    Oh dear.

  10. John Smith 19 Gold badge
    Unhappy

    So took a backup but didn't test it?

    Or (even bigger fail) backed up one live server to another?

    So it's like a mainframe (in that you don't have the data on site) but not like a mainframe in reliability and recoverability.

    Unimpressive

  11. ecofeco Silver badge
    FAIL

    I may be wrong

    But didn't Salesforce just partner with Oracle somewhat recently?

    I'm sure there's no connection. Just an odd coincidence, right?

  12. Herby

    Isn't the first rule of Databases to have...

    Transaction journals for maximum reliability?

    Seems to me that a nice journal of things would be close to mandatory if you were offering a "cloud" type service.

    Then again, who offers "reliable" services in the (public) "cloud"??

  13. Henry Wertz 1 Gold badge

    "Only one hour of lost data from a catastrophic db failure?"

    I think you mean "an *HOUR* of data lost?!?!?!" I mean, no, *I* wouldn't do any better, I have no high priority, highly important databases so I have not configured anything as such. But, you know, IBM mainframes average an uptime of *30* years. No downtime, they don't randomly lose data. They have redundancy and failsafes *that work*. These cloud technologies tend to still be immature and prone to bugs in comparison. As daft as it may sound, if I were going to provide some types of cloud services such as database or E-Mail, I may just spec out a mainframe and provide it with that; although mainframes are considered obsolete and stodgy, the types of bugs that down these cloud provider's services have been worked out in the mainframe world 30 or 40 years ago.

    1. Nate Amsden

      not worth it for 99.99%

      Go spec out that mainframe and then watch 99.99% of the customers decline to use the service because it is too expensive.

      I had a company decide to not go forward with a DR project *after* they had recently suffered a disaster. The reason? They had another project they hadn't budgeted correctly for and needed the budget from the DR (which they had just gotten approval to increase by 4x - after exhausting all alternatives to my proposal) to devote to this new project. I left shortly after, several years later the company still has no DR. Though I don't think they need it anymore, the company is spiraling down the drain.

      I was at another company where they too had a couple disasters, and then signed onto a DR plan from a service provider which they knew from day 1 would *never* work (and they were paying something like $20k/mo for something they could never use). They did it just to show the customers yes they had a DR plan.

      it really is sad to see such logic in action. I don't really have words to describe how I feel when I see that.

      losing 1 hour's worth of data to me isn't a big deal in the grand scheme of things, it could be far, far worse.. As another user noted they pull their stuff out of SF every 15 mins.

      If your data is *that* vital then your best off controlling it yourself, be prepared to pay for that though, likely it won't be cheap.

    2. Donald Becker

      You certainly don't mean an "uptime of *30* years"

      I'm pretty certain there isn't a system with an uptime of 30 years. That would be a system built in the early '80s which hasn't had a power outage.

      You might mean a predicted MTTF of 30 years. But as we say, that calculates predicted failures when most failures are unpredictable.

      Any system intended to be reliable will have a service schedule that precludes long uptimes. It's amusing to run a machine that has been up for several years. I've done it with Linux systems in the mid- to late-1990s. But when you really care about reliability you regularly shut down the machines to clean the dust out, replace the clock battery, check the UPS batteries, look for corrosion and popped caps, etc.

      1. Richard 12 Silver badge

        Re: You certainly don't mean an "uptime of *30* years"

        Not necessarily.

        A good design allows bits of the server(s) to be removed for inspection, repair and upgrade while the service is running, without ever taking the service as a whole down.

        Eg dual-or-more PSUs, pull one, check it over, replace, pull other etc. Same with critical software components.

        Mainframes are designed so you can do that, and you can do it with commodity-hardware as well.

        Costs a lot though.

        - And why would you even have a clock battery anyway? That's for keeping the clock going when the machine is off, and it never turns off!

        1. Anonymous Coward
          Thumb Up

          Re: You certainly don't mean an "uptime of *30* years"

          To Richard 12:

          "- And why would you even have a clock battery anyway? That's for keeping the clock going when the machine is off, and it never turns off!"

          You get an upvote for that. I'm adding that to my list of phrases... "Why do we need that, it never turns off/fails". :)

          But I would add and check and replace a watch battery myself. As if it does fail/turn off... I don't have to worry about setting the time and messing with clock bugs/conflicts if it does drop off. But then again, I'm only prodding old boxes with a screw driver, not even tried servers yet. But one things is for certain, if it can go wrong, I've seen it go wrong. (Like that time I mentioned to the cashier how the banks nice new mini desktops look nice, but will probably overheat. Cue the next week and a closed till for "computer repairs"! :D).

      2. Arbee
        FAIL

        Re: You certainly don't mean an "uptime of *30* years"

        @Donald Becker

        I'm pretty sure there is - the HP/Tandem NonStop system that runs the ATMs for Britain's biggest building society hasn't had any downtime in 20 years, so I suspect that there are plenty of other systems that have been running for 30 years (I just happen to know about that one).

  14. damian fell

    So is an hour data loss good or bad?

    No data loss is "good" but without knowing the RPO and RTO of the salesforce SLA's with their customers (I'm not one of their customers) it's hard to say if it was poor service or within the scope of expectations.

    If this was a high availability service with low or no RTO/RPO then they've failed, if it had a RPO of greater than an hour and if the RTO was less than 6 hours then it has probably met expectations (data loss experienced of an hour and recovery time of 5 and a bit hours).

    If however you'd chosen a cheap low-cost SLA to save money and the cost to your business is greater than the money you've saved by using Salesforce, you've probably made the wrong outsourcing decision...

    1. cpreston

      I'm a salesforce customer and I can tell you that there are no SLAs. None. Seriously. So I'd say that based on that, they handled this outage pretty well.

      And if you need to restore your salesforce data because YOU messed it up? That costs a minimum of $10K and takes weeks.

      1. John Smith 19 Gold badge
        Unhappy

        @cpreston.

        "I'm a salesforce customer and I can tell you that there are no SLAs. None. Seriously. So I'd say that based on that, they handled this outage pretty well."

        Really? I thought this was a charged for service?

  15. Anonymous Coward
    Anonymous Coward

    An hour of transactions lost? That might be a disaster

    Put yourself in the place of a business that relies on Salesforce to track customer interactions and orders.

    You've just lost an hour's worth of orders.

    That seems like a minor issue from the outside -- it's just an hour of re-entering the orders. But that's not how you run such a business. You track all customer calls and record all sales only through salesforce. You don't know what orders you've lost. And you now can't trust your customer interaction logs.

    I've seen contracts where the penalty for late delivery was 1%/day. It would might take two weeks to configure and test the system, and up to a week for truck freight, leaving a bit over a week of slack for a 30 day delivery contract. If the customer doesn't complain about the lost order until a week after the due date, it could add up three or four weeks of penalty fees.

    Just as bad, when a customer says that they have already complained about a problem, you can't trust the call log to know if that's true.

    1. cpreston

      Re: An hour of transactions lost? That might be a disaster

      We have a web app that talks directly to salesforce. While an hour of data would not mean millions of dollars, it would stink.

      But our app is designed to handle outages and a loss of some data on the salesforce's end. We would be able to replay what happened with our web app over the last hour. It's called designing for failure.

      When AWS had their big outage, there were customers who were down for a really long time -- and there were customers that experienced no downtime whatsoever. Why? Because they designed for failure. They had their systems running in multiple AWS zones and everything just moved when the one zone went down.

      People that think the cloud will solve all IT problems are nuts. People that think that the cloud is crap because it doesn't solve all IT problems are also nuts. Just saying.

      1. Pascal Monett Silver badge

        Re: "running in multiple AWS zones and everything just moved when the one zone went down"

        Interesting. Can Salesforce do that ?

  16. Anonymous Coward
    Anonymous Coward

    So, does this count as the first actual data loss by a major cloud firm?

    Of a *paid* service that is, i.e. not including stuff like free web email.

    Any other documented examples?

  17. Anonymous Coward
    Anonymous Coward

    @cpreston

    You're either

    1) Talking intentional rubbish / trolling

    2) Too small a customer to know

    3) Never asked the right questions

    4) Playing semantics with Service Level Agreement vs Service Level Objective

    The RTO is 12 hours after declaration (ie add up to 4 hours "assessment time" before this 12 hours kicks in)

    The RPO is 4 hours

    In this instance, they seem to be well within their obligations.

    As noted, redundancy and reliability cost and cost a hell of a lot the closer to perfect you try and go.

    SFDC off a service at a price. This service includes

    - Hosting from top tier, enterprise-class geographically diverse data centers.

    - Multiple levels of redundancies in the production systems

    - Each data center providing production services and a set of disaster recovery services for other production data centers (You back me up and I'll back you up)

    - Data replication provided in "near real time" between production and DR services. In reality this is copied at the block level hence some time lag and potential loss of data if you need to fall back to the offsite DR services.

    - A minimum of 4 copies of production data

    - Practiced fall overs to recovery sites

    Crap that it happened, but then crap happens in life. If you think you could design a system that was more perfect then go ahead and see the cost of it. If 1 hour of lost data is worth 10's of millions to you, then maybe you should be looking elsewhere.

    Its a service, you decide what level you need,... or alternatively "you pays your money, you takes your choice"

    1. John Smith 19 Gold badge
      Meh

      AC@10:01

      < long and interesting description of Salesforce's alleged practices >

      "- Hosting from top tier, enterprise-class geographically diverse data centers.

      Only not quite dispersed enough it would seem, eh Mr AC?

      For the record I worked with a Live/warm backup system that was drip fed off the live data. Estimated (and tested) time to bring back service was 15 minutes.

      But thank you for those comments from the Salesforce marketing department.

  18. Ian Moyse

    Options to Salesforce

    a) proves that bigger and big brand is not always better

    b) that not all cloud vendors are born equal

    c) that asking if a vendor has SLA's is a good start - Salesforce as standard does not provide SLA's

    Check alternative CRM solutions to Salesforce at this independent site www.g2crowd.com

    Ian Moyse

    Workbooks

  19. Anonymous Coward
    Anonymous Coward

    Not tested backups?

    If their backup and backup procedures were up to scratch they would not have lost so much data. Even if they had a good backup solution in place they need proper testing procedures in place and need to test regularly by simulating to ensure their backups do not fail so badly. I wander what backup solution they were using.

This topic is closed for new posts.

Other stories you might like