back to article Cloud customers are wasting money by overprovisioning resources

Organizations are overspending on containerized workloads in the cloud by overprovisioning the resources needed, and could potentially cut costs by as much as 60 percent. That's according to a report from cloud monitoring and optimization biz CAST AI claiming organizations on average provision a third more cloud resources than …

  1. Nate Amsden

    cloud architecture is the problem

    I realized this about 12 years ago myself. Best to move away from the model of fixed provisioning of resources(fixed meaning provisioning a VM with a set cpu/mem/disk), and towards pooling of resources and provisioning from that pool (how ESXi works, and I assume how other hypervisors like Xen/HyperV work on prem). Same with disk space/IO. Nearly 70% of the VMs in my internal environment this year were 1 CPU. Memory ranges from 2GB to 32GB for most things.

    Disk space for most systems less than 10GB/ea, some have 300-500GB(couple have more), some have 1TB. But every Linux VM gets (by default) 1.8TB of thin provisioned storage, controlled via LVM(so I don't have to touch the hypervisor again if I need more space), and I have discard/trim enabled end to end(and it works, except for things that use ZFS, even though ZFS claims to support trim(and autotrim is enabled at the pool level), my experience shows it is completely ineffective with non-test workloads at least when compression is enabled). All storage pooled from the same back end, and of course I keep close tabs on what is using disk I/O. Though disk i/o hasn't been an issue since switching to all flash in 2014. There was a time with spinning disks that a single "bad" MySQL query would consume more disk I/O than 500+ other VMs on the same storage array combined. Fortunately Percona wrote pt-query-kill, so I used that to keep those queries under control.

    This pooling approach to VMs is easily 15 years old at this point it just blows my mind that people don't seem to understand this in 2022 (some do for sure, but most do not).

    1. katrinab Silver badge
      Meh

      Re: cloud architecture is the problem

      On Hyper-V, resource pooling works for Windows and Linux, unless you have nested VMs, but not FreeBSD under any configuration scenario.

  2. Claptrap314 Silver badge

    Issue of scale

    When your total AWS spend is less than $5k/mo, as is often the case for small businesses, the cost of doing a review, even quarterly, is likely to be higher than the savings. In a larger, more mature, organization, SRE's role of cost savings comes to the front. I would need to understand not just the percentages, but the actual dollar amount involved before drawing conclusions from this study.

    Of course, what Nate Amsden said above is precisely correct--if you have the scale and maturity to support it.

    We're looking at the possibility of having our business grow by a factor of 30 near the end of next year. If that happens, I'm going to be turning off Heroku instances every night. Right now---not so much.

  3. Anonymous Coward
    Anonymous Coward

    Oversight

    When there is no oversight, people will happily spend money that isn’t theirs, sometimes even without realising it.

    Give people enough rope, and they will make a swing

    1. Anonymous Coward
      Anonymous Coward

      Re: Oversight

      This happens in on-prem as well.

      My last main client had an IT team where every new system and its servers were spec'd with minimal cpu/ram/disk based on system provider's specs and common sense. The whole VMware infrastructure was monitored and analysed with Foglight and PRTG, so resources were adjusted as needed. It was in a well regulated industry and had SOP's for almost everything imaginable... Tight ship.

      My current main client is the complete opposite with understaffed IT dept and almost no control for server provisioning. System owners have spec'd something like 8CPU/64GB for servers that have never utilised even a quarter of that. They could (...and will!) reduce servers by 50% and still get the same performance.

  4. msknight

    I have wondered about de-dupe

    Something that's been on the back of my mind for a while is if customers are being charged for de-duped data. I mean, multiple copies of Windows servers must result in a heck of a lot of common data across instances.

    1. Nate Amsden

      Re: I have wondered about de-dupe

      I don't believe most IaaS clouds do dedupe for storage at least not the big ones. The enterprise clouds I'm sure do. I'd expect customers to not see any line items on their bills related to dedupe, the providers would just factor in what their typical dedupe ratios are and figure that into the cost to the customers.

      But forget dedupe, I'd expect most cloud providers to not even do basic thin provisioning and reclamation (except enterprise clouds again for same reasons). Thin provisioning AFAIK was mainly pioneered by 3PAR back around 2003ish time frame, I started using them in 2006, thin reclaim didn't appear until about late 2010 I think(and took longer to get that working right). Then discard at the OS/hypervisor level took time to implement as well(3PAR's original reclaim was "zero detection" and so I spent a lot of time with /dev/zero writing zeros to reclaim space, also sdelete on windows prior to discard being available).

      For my org's gear we didn't get end-to-end discard on all of our Linux VMs (through to the backend storage) until moving to Ubuntu 20 (along with other hypervisor VM changes) in late 2020. I had discard working fine on some VMs that used raw device maps for a while prior. I know the technology was ready far before late 2020, but to do the changes to the VMs it was better to wait for a major OS refresh (16.04->20.04 in our case) rather than shoehorn the changes inline. Wasn't urgent in any case.

      I remember NetApp pushing dedupe hard for vmware stuff back in 2008-2010 time frame, I never really bought into the concept for my workloads. I'm sure it makes a lot of sense for things like VDI though. When I did eventually get dedupe on 3PAR in 2014 (16k fixed block dedupe, I don't know what NetApp's dedupe block size was/is) I confirmed my original suspicions, the dedupe ratio wasn't that great since there wasn't that much truly duplicate data(which would of been OS data and typical OS was just a few gigs in Linux). I expected better dedupe on VMware boot volumes(boot from SAN), initially the ratio was great(don't recall what exactly), my current set of boot LUNs were created in 2019, and now the current dedupe ratio is 1.1:1, which is basically no savings, so next time around I won't enable dedupe on them. (ESXi 6.5 here still, I read that ESXi 7 is much worse for boot disk requirements). Average vmware boot volume is 4.3GB of written data on a 10G volume.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like