back to article Google Cloud's US-East load balancers are lousy with latency

Google Cloud is having a nasty not-quite outage in its US-East region. At the thoroughly indecent time of 05:56 AM on Sunday, Pacific time, Google reported an incident it called "Cloud L2 overload in us-east4 causing harm in us-east4 & nearby regions" and ascribed the cause to "an issue with Cloud Load Balancing." Google …

  1. that one in the corner Silver badge

    make a hasty move to another region

    I'm being horribly naive, but isn't one of the selling points of all this Cloudiness the ease, almost trivial ease, of creating/starting/stopping/migrating all these machines and their data?

    Aren't you supposed to be able to pretty much select a location on a dropdown, click Ok and sit back whilst the Mighty Machinery Leaps to Your Bidding and seamlessly migrates the instances, finishing with a flourish by directing traffic to the new location?

    And if any of that is true, at what point do SLAs indicate that Google should do that move automatically to work around issues in their own systems?

    1. v13

      Re: make a hasty move to another region

      Not exactly. You can indeed easily create the infrastructure elsewhere. Orders of magnitude easier than having to lease a coloc and put servers in by yourself. But you still need to setup your stuff. Many already do that and it's the standard practice. But those that operate from a single region may have trouble doing it because they haven't designed their systems like that.

      Google, Amazon and Microsoft won't move your stuff because that's a very sensitive thing to do with legal complications. They do offer some global services where you don't depend on a single location but region-specific services need to be replicated by the user.

      Those that have setup their infrastructure in a redundant way can stop using these regions for a day or two and just scale up their alternate locations.

    2. Doctor Syntax Silver badge

      Re: make a hasty move to another region

      "one of the selling points"

      You know what they say about being able to tell when a salesman's lying. His lips move.

      1. Jellied Eel Silver badge

        Re: make a hasty move to another region

        You know what they say about being able to tell when a salesman's lying. His lips move.

        And they'll move faster. It's the standard drug dealer's approach to flogging cloudybollocks. Yes, you can do all that fancy stuff you saw in the marketing materials. But additional charges apply. Especially when marketing realise that everyone knows that customers should always read the small print, and their print can be very small. Or that the average human can react to something in around 300ms. So blink and you may miss the small print. But it's there..

    3. Nate Amsden

      Re: make a hasty move to another region

      That's not really how the big IaaS clouds work. The customer is responsible for movement of data, configs, servers etc, if you really want to move to another zone or region. This can be a significant amount of work unless you prepared for such a situation in advance by doing the work beforehand. You can certainly choose whichever zone or region you want when provisioning a resource(and be aware of any excess costs from cross zone/region data transfers), but "vmotion-like" moving is not possible with the standard stuff(by design).

      I haven't checked recently but at one point Amazon's SLA was such that they didn't consider it a breach unless you were unable to spin up resources in another zone in the region. So if you lose a dozen systems in zone A(due to say power loss or any reason really), but can build new systems in zone B then they didn't consider that a SLA violation.

    4. two00lbwaster

      Re: make a hasty move to another region

      Moving infrastructure might be easy. Google's forwarding rules come in two types, regional forwarding rules and global forwarding rules (anycast). The latter would be reasonably easy to set up a new backend to the LB in another region, that's what they're designed for to serve from the closest region to the client. But your persistent data is now going to be in another region incurring latency and financial penalties, or you need to move your persistent data too and incur downtime.

      With the former regional forwarding rules you're screwed, you'll need to build another LB with another IP address and update your DNS as well as the former mentioned practical issues.

    5. Zippy´s Sausage Factory
      Devil

      Re: make a hasty move to another region

      You're forgetting the main reason for cloud: to create a revenue stream, billed monthly, where captive users can be subjected to regular, above-inflation price increases and are effectively locked in to one vendor for all time.

      Oh, you mean you wanted a reason for the customers to use it? Er, because all the cool c-suite executives are doing it?

    6. Claptrap314 Silver badge

      Re: make a hasty move to another region

      I learned SRE at G in 2015-6 supporting (primarily) hangouts. This overlapped the rollout of GCP, I was not involved with that effort at all.

      We had a number of rules that were involatile. Rule #1 on that list: the minimum number is three. Three DCs (with non-overlapping maintenance schedules), and on at least three servers in each, we would not talk to you (you being an internal G team).

      If you want resilience, you MUST be able to handle simultaneous scheduled & unscheduled outages, both at the DC level & at the level of the individual servers.

      This is NOT cheap. SRE can tell you how to do it without exploding your cost, however.

      Set up this way, and you can simulate scenarios like this one as training. (We preferred Tuesdays.)

  2. Anonymous Coward
    Anonymous Coward

    Isn't that where it rains right now?

    Might explain why its clouds have problems :)

  3. Anonymous Coward
    Anonymous Coward

    I'm being horribly naive, but isn't one of the selling points of all this Cloudiness the ease, almost trivial ease, of creating/starting/stopping/migrating all these machines and their data?

    Ah yes, but that's what the Marketing says. The reality is that it's simply someone else's computer and you're stuck with whatever they do in reality..

  4. forbiddenera

    Wtf

    I don't think it's just marketing, it shouldn't be too hard for someone competent enough to be running the infrastructure in the first place.

    Literally moving to another region should be no harder than cloning your IaC, changing the region variable and running a terraform apply or something. I only even use cloud providers dashboards during development and design of infra to verify and check things and occasionally test something before it gets put into tf files. It i had to do it all through their console, then yeah itd be frustrating and take a few hours maybe. But using terraform or similar, it's one command away at worst and at best if you've designed the resiliency well then you don't have to do squat unless multiple providers die.

    Sure the IaC can be a bit of work in the first place to get going but the results will save you enough time to never regret it, plus you know what you deploy is perfect and exactly how you wanted..no clicking the wrong things adding the wrong role, racing through a dashboard to try and deploy things quickly because things broken..

    If its harder than that, then you fail at infra. In fact unless you have strict budgetary or other concerns for being in a specific region than you've already failed. As someone mentioned above, minimum triple redundancy. Ideally with multicast IP so your floaters aren't stuck in a dead location and you're stuck relying on DNS changes with excruciating long TTLs and propagation. Better is fo use multiple providers and maybe even keep an edge provider (eg. Cloudflare) at least in a ready state if not fully proxying. Last year AWS had a huge outage in Canada which took out all azs in the region, it affected many huge companies here with even half of Canadas debit card system down for like 80% of the couuntry and it persisted for, IIRC, like well over 16 hours - GCP in the same region was totally fine though so multi provider is worth considering, even if only in an active passive failover configuration where nodes spin up automatically if the other provider is offline or lhigh latency etc.

    The longest part of a deploy or redeploy for me is waiting on AWS or others API to take their sweet time with certain resources sometimes.

    As a person from Canadia, all the big cloud providers currently only have ONE region here and we have to keep all data on Canada. AWS is building a west in Calgary but tbh I'm shocked that AWS, GCP, Azure, IBM etc don't have anything in or near Vancouver.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like