back to article Microsoft's Azure Kubernetes Service mucked my cluster!

Microsoft's Azure Kubernetes Service (AKS) was launched to world+dog in June, however, a few disgruntled customers say the managed container confection isn't fully baked yet. In a blog post published on Monday, Prashant Deva, creator of an app and infrastructure monitoring service called DripStat, savaged AKS, calling it "an …

  1. Pascal Monett Silver badge

    "the customer’s workloads had been overscheduled"

    And how exactly is that possible if the interface does not allow for it ?

    Shouldn't there be a warning, a popup with a message, and a limitation of the scheduling abilities ?

    Because if Microsoft's team can determine in post that that was the problem, then it's not the customer's fault. You open a platform to anyone and anyone will come in, I'd have thought that Microsoft would know this by now. So the onus is on Microsoft to ensure that Joe Anybody cannot put himself in a bad situation in the first place.

    And having helpdesk people blaming the client to his face is never a good point, even if it is true.

    1. Martin M

      Re: "the customer’s workloads had been overscheduled"

      Most deployments to Kubernetes aren’t via web UI, they’re via standard Kubernetes command line tools. If you’re using it, you probably aren’t (or shouldn’t be) the type of admin depending on point and drool handholding. As for preventing people getting into trouble - well, I’ve never met a technology that can stop a determined idiot from doing this.

      Regarding limiting over-scheduling, it can absolutely be a valid user decision. Particularly in non-prod environments where you may burn a lot of money if you don’t contend the workloads, and probably don’t care too much if there are very occasional problems if everything gets busy at the same time.

      If the user tried to deploy to production without using the very rich set of primitives Kubernetes has for controlling scheduling, I’d definitely say they bear a significant portion of the responsibility. It’s like massively over committing a VMware cluster. RTFM, know your workloads, and test properly in a prod-like environment.

      What I do think was bad was that the user’s poor decision was allowed to affect the system level services. This would have made it difficult for them to debug themselves in a managed cluster, and it shouldn’t have taken a day’s debugging by the Azure team to locate this fairly basic problem. That bit Microsoft should definitely shoulder the blame for. Still, at least they’ve fixed it (according to the HN thread).

    2. Martin M

      Re: "the customer’s workloads had been overscheduled"

      Oh - and just because the forensics team can determine that the cause of the fire was that you had 10 three-bar heaters plugged into the same gang plug overnight, that doesn’t make it their fault.

  2. Anonymous Coward
    Anonymous Coward

    No Suprise

    For people working with M$ products for a while, this should come as no surprise. For years Sys Admins would wait for Service Pack 2 being released, before even thinking about upgrading to new products. Microsoft has always released products early before they are ready, and Azure is no different.

    M$ always have a great Marketing story, but it takes years for their products to catch up.

  3. Martin hepworth

    blame

    "The worst part is them trying to blame the user for issues on their end."

    This is a common issue with Azure support - your app isnt cloud ready , it not our fault the underlying OS/hardware failed and took a bunch of data with it...is a common retort from MS support.

    1. Martin M

      Re: blame

      I have some sympathy with them here. If your application depends on expensive high availability servers and storage, with every cluster node being in the same rack and connected to the same ToR switch pair, you should not deploy it to the cloud.

      It has to be able to cope with an unplanned node failure and recover swiftly and in an automated fashion. It has to be able to cope with transient network connectivity problems including partitions, one way packet loss, variable latency etc. Ideally, it needs to be capable of distribution across multiple availability zones or even regions, as failures at these levels are not unknown.

      What I do not like is the marketing from most major cloud vendors saying that you can migrate your entire legacy data centre to the cloud. But if you believe every bit of marketing you read, then you’re being naive.

      1. elip

        Re: blame

        "It has to be able to cope with an unplanned node failure and recover swiftly and in an automated fashion. It has to be able to cope with transient network connectivity problems including partitions, one way packet loss, variable latency etc. Ideally, it needs to be capable of distribution across multiple availability zones or even regions, as failures at these levels are not unknown."

        ^^^ Show me such an app. I've been doing this many years, and haven't seen it.

        Azure is a pile of unfinished garbage (like most software these days seems to be), take it from someone who's migrated multiple data centers to it (including HPC workloads) and was not in the least bit impressed by any single part of the experience.

    2. DrBed

      Re: blame

      "The worst part is them trying to blame the user for issues on their end."

      This is a common issue with Azure support - your app isnt cloud ready , it not our fault the underlying OS/hardware failed and took a bunch of data with it...is a common retort from MS support.

      "advice not to use excessive memory and CPU resources" :D :D :D Microsoft (tm)

      ^ "Please, don't use your hardware at all, it will guarantee you the best cloud experience"

  4. SouthernMonkeyMan

    User at fault

    The end user here has to shoulder some of the blame here. If the platform is as unstable as he said then surely he picked that up in testing before moving his entire production workload to AKS? Yes the support teams could have been more helpful from the sounds of it, but he should have tested the stability of the service and discovered this before production got there.

  5. asdf

    Totally irrelevant but ..

    Dumpster fires are beyond nasty no doubt but Microsoft deserves more of a tire fire picture imo. Dumpster fires don't burn for months or years even.

  6. Disgruntled of TW
    Stop

    Production on an alpha service ...

    ... he should have known better. By his own admission, promoting to prod, after declaring AKS as "alpha released to GA" his CI/CD is letting him down.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like