back to article Dropbox unplugged its own datacenter – and things went better than expected

If you're unsure how resilient your organization is to a disaster, there's a simple way to find out: unplug one of your datacenters from the internet and see what happens. That's what Dropbox did in November, though with a bit more forethought. It had been planning to take the San Jose datacenter (its largest) offline for some …

  1. John Doe 12
    Joke

    Let Met Just Say...

    Worst episode of "On Call" ever!!! Nothing went wrong and everyone was happy ha ha

    1. Paul Herber Silver badge

      Re: Let Met Just Say...

      Ha, BOFH's plan has worked. Smug mode never lasts for long ... I can't wait for the next stage, and I'm sure the timing will be perfect.

  2. Howard Sway Silver badge

    there's a simple way to find out : unplug one of your datacenters from the internet

    Or even simpler : use one of the major cloud providers such as AWS or Azure, and they'll run the test for you at irregular intervals.

    I was involved in running the same sort of test 20 years ago when working at a bank that had a complete failover datacentre. It was only one network zone in the datacentre that we tested, but it worked, although it too had been expensively designed from the ground up to do so.

    1. Anonymous Coward
      Anonymous Coward

      Re: there's a simple way to find out : unplug one of your datacenters from the internet

      And charge you massively for it. There are lots of companies that are moving off cloud providers due to the costs involved.

      Cloud systems may be simpler and allow you to spin up new instances quickly, but that doesn't necessarily mean they're cost effective. If your small then maybe it's not worth having your own infrastructure and team but at some scale it becomes much cheaper to run it yourself. Or a hybrid where your DCs handle base load and expand into the cloud. As long as you do it right. Dropbox is easily large enough to hire the right people to do it right.

  3. Anonymous Coward
    Anonymous Coward

    "According to the company, RTOs were reduced from eight to nine minutes down to four or five."

    That's not reducing time by an order of magnitude, Dropbox. Cutting 8-9 minutes down to 40-50 seconds would be an order of magnitude.

    Still, props to the crew for A) making the effort to build redundancy into their product. B) actually testing to see if it worked. C) improving when the first tests revealed weaknesses.

    Enjoy it while it lasts, I'm sure new manglement will show up soon and strip out fluff like "resiliency" and "engineers".

    1. Anonymous Coward
      Anonymous Coward

      well, it went from 23 minutes down to 22 minutes. That's like an order of magnitude, right...;)

  4. paddy carroll 1

    Paper resilience

    Better than the pretend DR & failover tests I have witnessed in large enterprises.

    Customers rep gets taken to dinner, smoke mirrors and creative reporting later a huge success is reported.

    It was actually an abysmal failure

  5. AlanSh

    I had that with a MS Exchange system many years ago - but by accident.

    I'd set up a clustered pair of Exchange servers. The customer wanted it live "NOW". So I made it live. Came in the next day and one of the servers had gone dead (or 'popped its clogs' as we say up here). The customer hadn't actually noticed and emails were still coming and going. So, I fixed the issue (dead PS), restarted it and then proclaimed we'd had a successful failover test. Customer was very happy.

    Alan

  6. matjaggard

    Is that all?

    Banks have to do essentially this every few months to satisfy the regulators.

    1. LateAgain

      Re: Is that all?

      What?

      They close a random branch at 5pm :-)

      [Anonymous because I don't want to even think about the stupidity of outsourcing your core business]

      [[that's the IT stuff]]

    2. Anonymous Coward
      Anonymous Coward

      Re: Is that all?

      Yeah, not a chance they actually do it. Least at our place, it's been kicked down the road several times.

      Anon for reasons

  7. Pete 2 Silver badge

    Corporate objective?

    > "Given San Jose's proximity to the San Andreas Fault, it was critical we ensured an earthquake wouldn't take Dropbox offline,"

    It sounds like they are planning that if / when northern California collapses into the ocean, nobody outside the region should notice.

    Although since their HQ is also in San Fransisco, that might be a little optimistic.

    1. Norman Nescio Silver badge

      Re: Corporate objective?

      None of the denizens of the C-suite will be at HQ - they will all be remote-working from ski-cabins in Colorado, beach 'huts' in Hawai'i and the like.

      In fact, this gives an excellent justification for remote working - if you cluster all your essential people in a single building, all it takes is a comic-book meteorite to wipe them all out in one swell FOOP! (Roy Lichtenstein eat your heart out.) and your business is gone. If everyone is working from home, then your business continuity problems are far smaller.

      (Off to write business case for permanent home-working for DR/BCP purposes.)

  8. hayzoos

    The only way to do it

    At an earlier job supporting maintenance at healthcare facilities, I saw the same first hand for the power at one of the facilities we managed. The maintenance manager was considered a renegade. He tested his facility power backup systems by shutting off the mains, regularly once a month at a random time and day.

    He had one of the best maintained facilities of all the ones we managed. Some said he was nuts, risking operations and ICUs with life support needs. He said does it matter if the power loss is "an act of god" or of my doing? He also said, "I do my best to make sure the systems will perform as planned." There was an issue once during one of his tests, he was able to revert back to the mains when things went south.

    I had done similar in an earlier programming job on an Apple II system. It was so resilient, that you could pull the plug in the middle of data entry data loss was limited to the field being entered. Pulling the plug during acceptance testing was how I answered the question of how does it respond to a power loss.

  9. Joseba4242

    Objective

    Looking at this at reducing the RTO from 8min to 4min is a bit of a dangerous angle.

    Imagine there was an earthquake that took out their SJ datacenter, and all services recovered with no data lost 8min later. Noone would complain; that would be celebrated as a success.

    Imagine, on the other hand, a dirty failure that introduced instabilities which causes replication to have issues and hence data to be lost. Technically the service might have recovered in 4min but the impact and fallout would be massive.

    Crucially their test was a planned failover. They FIRST drained all traffic away and THEN failed over. While this by itself is an impressive achievement, reality often doesn't allow you this luxury, and that's where things can go really pear shaped.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like