Google BigQuery TITSUP caused by failure to scale-yer workloads

Monday 14th November 2016 02:38 GMT Anonymous Coward

Google's service outage notifications are woeful PR fluff

"There is an issue with X affecting only 0.00Y% of customers."

"The issue is resolved for most users of X."

"The issue with X is resolved, we value your business, we hope you have warm fuzzy feelings, we strive to be awesome."

That is NOT communication with customers about an outage, that's PR BS on par with "due to higher than normal call volumes" and "your call is important to us".

I work for a large multinational, where, every time there's an outage, we put out a business outage notification through the technical operations centre, and the lead team needs to write up an "RFO" (Reason-For-Outage). Regular notifications are made to stakeholders, as the outage happens, what the symptoms are, what the outage is determined to be, when and how it's mitigated, and when and how it's resolved. It's then all written up, chapter and verse in the RFO and made available to stakeholders.

The whole "oops, but it's only affecting 0.000x%" of people thing is PR BS on par with "95% fat free" (5% fat people, do the math).. That could still be, at cloud provider scale, many, many, many people and businesses.

I accept problems happen, but geez, when they happen, enough of the fluffy "your call is important" crap.. Detail please.

12 0 Reply
1. Monday 14th November 2016 07:49 GMT Anonymous Coward
  
  Re: Google's service outage notifications are woeful PR fluff
  
  Here are the _actual_ updates they gave out during the outage:
  
  https://groups.google.com/forum/#!topic/bigquery-downtime-notify/We-PRncjM4U
  
  I think this (which I received by email as it happened) is actually pretty decent in that it told us enough to know the impact on our service:
  
  0 0 Reply
Monday 14th November 2016 02:53 GMT Anonymous Coward

…zipped lips gave users the … heebie jeebies

I'll bet it did… surely they can use something more modern than that DOS relic ZIP like xz or something… Ohh you mean that kind of "zipped"… never mind.

0 0 Reply
Monday 14th November 2016 05:04 GMT Anonymous Coward

umm

One of the supposed characteristics of public cloud is elasticity and that relates to capacity on demand. If they have an architecture that hits the capacity wall easily, it isn't much of a public cloud. This tells me Google isn't in the big leagues like Amazon or Microsoft or even IBM. Sorry Diane, but enterprise cloud my ass.

2 1 Reply
1. Monday 14th November 2016 08:56 GMT Adam 52
  
  Re: umm
  
  You got down-voted but I'd tend to agree for different reasons. The whole arrangement, but especially the commercial and support side just isn't in the same league as AWS.
  
  This incident highlights the support side and you only have to look at the Cloud Service agreement to realise that they just aren't mature enough work in the Enterprise space (e.g. we lose all your customer data we'll pay you $5 compensation or this contract is with Google Ireland but under US jurisdiction.
  
  1 0 Reply
Monday 14th November 2016 07:42 GMT Anonymous Coward

The Cloud...

Other peoples computers you have no control over.

1 2 Reply
1. Monday 14th November 2016 08:26 GMT Anonymous Coward
  
  Re: Other peoples computers you have no control over.
  
  No control? Not full control, obviously, but if I can spin up an instance of a server and configure it in Google's cloud that counts as having some control.
  
  1 0 Reply
Monday 14th November 2016 08:09 GMT Pascal Monett

"the premise of cloud is that it will just scale as demand increases"

Cloud theory is like military strategy : as soon as the battle starts, you can throw the plans out the window.

1 0 Reply
Monday 14th November 2016 08:39 GMT monty75

If only there was some kind of documentation on how to avoid causing yourself an embarrassing DoS http://www.theregister.co.uk/2016/11/10/how_to_avoid_ddosing_yourself/

0 0 Reply
Monday 14th November 2016 09:25 GMT yoganmahew

All hands in the way...

The all hands on deck response is part of the problem with MTTR. 200 people on a call, running in different (or no) directions.What happened to expertise in operations? Oh yeah, it got outsourced and now requires senior VP approval before anything can be fixed...

0 0 Reply
1. Monday 14th November 2016 09:33 GMT Dave Pickles
  
  Re: All hands in the way...
  
  In my BOFH days, when there was a outage the least useful PFY du jour would be tasked with answering the phones and keeping visitors out of the way.
  
  0 0 Reply
Tuesday 15th November 2016 00:56 GMT batfastad

60% of the time

... it works every time!

0 0 Reply