back to article AMD downplays risk of growing blast radius, licensing fees from manycore chips

As AMD pushes to extend share of the datacenter CPU market, it's pushing CTOs to consider how many of their aging Intel systems can be condensed down to just one of its manycore chips. However, there are legitimate concerns about the blast radius of these manycore systems. A single Epyc box can now be had with as many as 384 …

  1. ecofeco Silver badge
    Pirate

    The charge by core scam

    Charge by core, like SaaS and cloud, are all scams that corporate loves to embrace. Why? Lots of backhanders, brown envelopes and outright conflict of interests. i.e. VP bobblehead actually owns the VAR reseller that sells to their own company.

    And it's perfectly legal because... the board approved it. For their own greedy ulterior motives.

    1. DS999 Silver badge

      Re: The charge by core scam

      How is that any different than licensing per physical server? If you have 256 servers with one core or one server with 256 cores, why should you be charged differently?

      1. katrinab Silver badge
        Megaphone

        Re: The charge by core scam

        Because you won’t install that software on all 256 servers, and won’t use all 256 cores on that server for that software.

        1. DS999 Silver badge

          Re: The charge by core scam

          Depends on what the software is, but if you don't want to pay for 256 cores then partition a VM with a limited number of cores that uses that software and get charged on that basis.

          Well, other than Microsoft which is charging per physical core because they want to assume all the VMs on that server are Windows and punish you if they're not lol

          1. LuxZg

            Re: The charge by core scam

            Microsoft isn't the only one. Eg Oracle requires all PHYSICAL cores to be licensed for their database, even if you are running it inside a limited VM. Only way around is to use Oracle Linux as hypervisor, which is just illegal tie-in with their OS, which isn't any different than many others (like literally - Red Hat clones?). So if you have VMware or Hyper-V cluster with 1000 cores and each host having 300+ cores, you'd need to license 300+ cores even if you are using Oracle DB in a 2-core VM. So you're forced to have a single server outside cluster just for this DB, be it so you have a small quad core CPU in that physical machine, or ao you install Oracle OS to limit Oracle DB VM to less cores. And this is just an example, there's others with similar scenarios. In our company we have physical machines only as domain controllers, backup, or cluster. When we briefly toyed with Oracle we HAD ro have a single separate machine just for single application. Waste of server, cabkes, ports, power, rack space, well, basically everything, just a big waste.

  2. Anonymous Coward
    Anonymous Coward

    112 of the 128 cores fused off

    Wow! I guess those cores are truly commodity items when one can just fuse-off 87.5% of them without flinching, or cringing (going from 9755 to 9175F). And not only is the huge L3 cache great but the base clock jumps to 4.2 GHz for those remaining 16 cores ... the highest in the Turin lineup, impressive!

    I wonder if they could do an Intel SST-PP on these cores, dynamically reconfiguring the chip between 16, 32, 64 and 128-cores, rather than factory-fusing-off the extra ones completely, and yet maintain the enhanced clocking of the reduced running-cores version?

    1. chasil

      Re: 112 of the 128 cores fused off

      This also allows otherwise defective parts (with failed cores) to be repurposed.

  3. Roo
    Windows

    Bigger is better

    HPC has *always* faced this challenge of Blast Radius - and it's survived just fine - if not thrived with many-core boxes... Typically workloads are partitioned and can fail-over to other machines or resume from a checkpoint.

    I am now in a weird new world were I'm seeing HPC running on VMs (at multiples the TCO of running this stuff on bare metal). The manglement haven't quite got their head around the fact that HPC workloads *exceed* the capacity of the box, you are actually *reducing* the utilization of a machine by running it under a VM... There's no improvement to hardware resiliency at all - in fact there's an additional cost & (huge) overhead of the redundant fail-over mechanisms provided by the VMs...So while many-core blast radius hasn't really affected us guys running distributed (HPC) workloads, the VM *software* blast radius has. It turns out that a hypervisor upgrade has to happen across a cluster of nodes all at the same time - so rather than losing a single box we lose several racks at a time. Another "learning" for the VM Poindexters is that migrating 2TB active working sets across a network is a *lot* more expensive than simply resuming on another host from a checkpoint...

    HPC on VMs is literally double the cost of bare-metal - requires *far* more manpower to keep that oh-so-clever-but-unncessary VM layer running.

    TL;DR

    Many Cores = great, more please for HPC, blast radius be damned - but serve them up as bare metal so we don't waste cycles & run-time maintenance overhead on the superfluous VM crapola.

    In most cases I don't think VMs are actually needed - software that is fit-for-purpose, packaged and deployed properly isn't a problem that can be fixed by VMs - ameliorated perhaps, but not remedied. We're not running software on DOS boxes any more folks.

    1. Wexford

      Re: Bigger is better

      I might be misunderstanding your post, but your first paragraph seems to be contradicted by your final one?

      1. Roo
        Windows

        Re: Bigger is better

        It was a bit of a ramble-rant. I'll try and slice it differently... :)

        1) Our HPC type workloads are the opposite, they vastly exceed the capacity of a single host.

        + Thus the blast radius of a host spans a fraction of the workload.

        + Checkpointing and rescheduling the affected portion of workload to another host is how we addresses these failures (quick & efficient).

        2) Typically VMs (and zillion core hosts) are used to aggregate lots of workloads that are *much* smaller than the capacity of the host together.

        + The blast radius of a host spans a lot of workloads.

        + VM migration can be used to mitigate this problem if you have some spare capacity (slow and burns compute host resources in the background).

        The Blast Radius of a Zillion Core host is not a problem for HPC because your workload vastly exceeds the capacity of that one host and it will inevitably been engineered to tolerate a bare-metal host failure (this *will* be battle-proven because the MTBF of a set of hosts is *MUCH* lower than the MTBF of a single host).

        Meanwhile in the land where you are using Hypervisors to aggregate multiple workloads together and use VM migration to provide resilience to hosts exploding, the use of Hypervisors can actually multiply the blast-radius because they are necessarily tightly-coupled in order to do the complex task of migrating across a few thousand processes & TB's of working set... Vendors / operators are reluctant or won't support the migration of workload between VMs at different version/patch levels. Thus if you need to patch the Hypervisor layer you will need to patch hosts in batches - thus multiplying your effective blast radius as that patching happens (in practice that happens at rack level - so the blast radius is 16x bare-metal in the event of VM migration breaking or Hypervisors needing an upgrade).

        ... rant bit ...

        Suffice to say, in practice, the real blast-radius problem wasn't a problem for our HPC workloads on bare-metal (because the hosts are loosely coupled and migration is cheap) - but it *has* become a problem for our HPC workloads that run under VMs - because the Hypervisor layer has multiplied the blast radius from taking down one host to taking down a rack at a time (because the hosts are now tightly coupled at the VM layer). We've seen this manifest itself as outages when running our workloads under VMs - something that the bare metal hosted workloads haven't experienced since a DC burnt down several years ago.

        I suspect the main driver for our org forcing us to run stuff under private Cloud is to increase their utilization figures as our HPC workloads achieve > 90% 365x24 on bare metal. The Cloud utilization figures when from ~17% to ~32% when a small portion of our HPC workloads were migrated to Cloud. ;)

        1. Anonymous Coward
          Anonymous Coward

          Re: Bigger is better

          "Vendors / operators are reluctant or won't support the migration of workload between VMs at different version/patch levels. Thus if you need to patch the Hypervisor layer you will need to patch hosts in batches - thus multiplying your effective blast radius as that patching happens (in practice that happens at rack level - so the blast radius is 16x bare-metal in the event of VM migration breaking or Hypervisors needing an upgrade)."

          I'm confused, what hypervisor would need that. If its the vendor of the software running in the VM, what has the hypervisor version differences got to do with the workload running in the VM for migration. That a hypervisor compatibility problem, so which hypervisor doesn't allow / recommend migrating VM between different versions? If its an operator not wanting to do it, I think a new operator is needed,

          1. Roo
            Windows

            Re: Bigger is better

            Migration as in the physical host goes down and the workload needs to find a new home. By design and definition there is an an intricate and tight coupling between the fail-over partners - consequently a vendor would be entirely correct to be circumspect about mixing and matching software versions across fail-over partners.

            I'm not talking about sensibly designed & operated systems here, I'm talking about real-world apps and systems. :)

  4. Jimmy2Cows Silver badge

    it's actually more resilient and tolerant as you go up in terms of core counts

    Well, you would say that, wouldn't you.

    Seems highly counterintuitive that putting way more eggs into way fewer baskets is somehow more resilient when a basket breaks. Unless having AMD CPUs magically makes the non-AMD parts more resilient, less prone to failure. But that's obviously bollocks.

    More cores = more power consumed = more heat in a confined space = more chance for themal stresses and instablities to make something go pop. Hard to see how it could be any other way.

    1. Roo
      Windows

      Re: it's actually more resilient and tolerant as you go up in terms of core counts

      It's not written in stone of course, but generally less components -> less opportunity for mechanical mishap -> better MTBF. Those big old wardrobes full of TTL chips running a handful of text editors weren't any more reliable than a 128 core AMD64 box running a few hundred Monte-Carlo simulations. Really this comes down to choosing the correct tools for the job, besides which if you genuinely have a need for small blast radius for some workloads you can always *under utilize* your hosts...

      1. Bitsminer Silver badge

        Re: it's actually more resilient and tolerant as you go up in terms of core counts

        I recall one "wardrobe of TTL chips", known as a VAX-11/780, being down for a week, with parts all over the floor, and day-long phonecalls (longdistance mind you!) to Massachusetts.

        The wirewrap was one theory. After changing the backplane (3 feet on a side), that theory was, hmmm, misproven.

        The problem turned out to be a cache memory card.

        So yes, the wardrobes of the past were way less reliable and had a lower likelihood of finishing a week-long batch job.

        The integration of several hundred billion transistors onto a small ceramic chip is an amazing accomplishment and it's also quick to fix. And it's reliable because it's built that way.

        1. Roo
          Windows

          Re: it's actually more resilient and tolerant as you go up in terms of core counts

          I was actually thinking of VAX-11/78xs when I wrote "wardrobes of TTL chips". I actually had fairly happy experiences of using those boxes (perhaps luckily).

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like