back to article Companies flush money down the drain with overfed Kubernetes cloud clusters

Cloud optimization biz CAST AI says that companies are still overprovisioning resources and paying too much as a consequence. It claims that in Kubernetes clusters of 50 or more CPUs, only 13 percent of provisioned CPUs and 20 percent of memory is typically utilized. CAST develops a platform to monitor use of Kubernetes …

  1. elsergiovolador Silver badge

    Time

    Turns out it takes time to scale up when there is surge of traffic, so companies prefer to overprovision.

    Usually those surges of traffic never happen though.

    I hope someone sees the irony here.

  2. Pascal Monett Silver badge

    "analysis of more than 4,000 clusters [..] prior to optimization"

    Sure. It's called hindsight.

    Once the project is rolling in production, it's easy to know what resources you need after a while.

    It's a lot more difficult to forecast what you need before the project is started, especially when you have no experience managing projects in the cloud.

    Companies will adjust their resources soon enough - the beancounters will see to that.

    1. two00lbwaster

      Re: "analysis of more than 4,000 clusters [..] prior to optimization"

      Some companies just don't look into this stuff. There job fails with an oom error and so they double the memory rather than spending dev time to sort out the code. I've seen pod allocated 10s of GBs of memory which us MBs to single digit GBs except for a few seconds at the end of the job for only some jobs where it uses 10s of GB. That leads to huge instances with one or two of these pods on them yet they're using a few GBs of RAM.

      I also saw the same with requests but I got a change through to remove CPU limits which led to better utilisation. I also started to introduce more stringent measurement based values for memory and CPU on the non-prod clusters. None of this required much data a few months worth at most; though it's a constant thing you need to do as you gain customers and as part of performance testing of releases and post release. I just don't think that smaller companies do this.

    2. mpi

      Re: "analysis of more than 4,000 clusters [..] prior to optimization"

      > Once the project is rolling in production, it's easy to know what resources you need after a while.

      That "while" has come and gone long ago, and the resources are still provisioned. What's the next explanation?

      1. Anonymous Coward
        Anonymous Coward

        Re: "analysis of more than 4,000 clusters [..] prior to optimization"

        Slow ramp times leading to over provisioning as another poster mentioned.

        1. mpi

          Re: "analysis of more than 4,000 clusters [..] prior to optimization"

          Yeah, that argument doesn't really fly with me.

          Because one of the MAJOR selling points that powered the entire cloud-hypetrain, was and is it's flexibility. Need more compute/storage/bandwidth/servers? No problem, very flexible, much cloud.

          So how does that fit with constant, and apparently long running, overprovisioning? Shouldn't the oh so flexible cloud make it easy for companies to adapt their provisions (and thus their costs) to the actual usage? Shouldn't it make it easy to ramp up if and when the need arises, or downsize when it doesn't?

          1. TheWeetabix Bronze badge

            Re: "analysis of more than 4,000 clusters [..] prior to optimization"

            In essence, you are suggesting that a (particularly greedy) corporation help us to spend less money on it than we normally would… That kind of foot gun is never going to get popular in business circles, even if its the right move.

  3. Charlie Clark Silver badge

    Price of complexity

    The driver behind this is the continued to desire to live without sysadmins but Kubernetes introduces heaps of complexity that, while shielding the users from having to buy and manage boxes, brings its own possibly bigger problems with it. I'm sure we will at some point see tools that can automate and simplify this kind of configuration. Who knows, we might even see a return to using more monolithic systems because in many cases, as long as you can easily reproduce them, they're all many projects need and will use considerably less resources than a highly abstracted "microservice mess"™.

  4. Pete Sdev Bronze badge
    Meh

    Average vs Peak

    If it can be afforded, I like to have capacity 1.5 times what I'm expecting the peak load normally to be.

    Depending on how bursty a system is, there can be a big difference between peak and average. The average also has to be measured over an appropriate timespan.

    Firing up on-demand instances just for peak time can save money, though as someone else mentioned, it's not instant.

    YMMV.

  5. Henry Wertz 1 Gold badge

    Yeah no kidding

    Yeah no kidding. They sell configurations with x GB of RAM and y CPU power. And typically, if you need more CPU power, you'll automatically get more RAM; if you need lots of RAM, more CPU power comes with it. I have an in-cloud site I set up for someone that is just like that; it's not intensive so it's on the smallest available system, the 2GB RAM is fairly full but the CPU usage is probably 1%.

    I think what probably happens in many cases is just that they set up Kubernetes, and Kubernetes itself may be fully capable of spinning up new instance on demand, and removing ones on demand. But if whatever they are running in Kubernetes isn't then they just spin it up with some excess capacity so they don't have to keep babysitting it. And probably go for redundancy/failsafe by just having enough spare that if one or two instances go down they're still OK.

    I imagine some would leave the auto-scale-up off as well because they want a predictable bill; I mean, they could run fewer instances, let it scale up to perhaps even less than what they are now running and save money. But there's always that concern that something goes haywire and fires up like 100 instances, making for a big bill.

    Finally, there's the matter of spinup time. If you ran these systems very busy and you get load spikes, it may not be particularly helpful if it takes like 5 minutes for a new instance to spin up and be ready to help with the load. Or, you may go to spin up a new one and find out there aren't any available!

  6. Anonymous Coward
    Anonymous Coward

    Firing staff instead of streamlining resources

    Firing the entire network team because AWS networking doesn't need a CCNA :smh:

  7. Kevin McMurtrie Silver badge

    Crap code too

    I've worked at places where pushing a server near its limit crashed it. Race conditions happened, threads deadlocked, errors caused resource leaks, and excessive buffering ate memory. The study goes along with what they were doing - running at 20% capacity because everything immediately dropped dead at 100%.

    I think some in management still believe that "computers are cheaper than engineers." They need to check their hosting costs again.

  8. IGotOut Silver badge

    The paradox of choice.

    "Users can also be confused by the sheer choice available, with AWS offering 600 different EC2 instances"

    If there are say 2 to choose from and it's wrong, it's their fault for lack of choice.

    If their are 600 to choose from and it's wrong, then it's YOUR fault

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like