back to article Stratus ships latest batch of fault-tolerant Xeon servers

Fault-tolerant computing veteran Stratus has released the latest generation of its ftServer systems, which offer zero downtime for mission-critical applications, but lag behind the rest of the market in terms of the latest technology. Stratus has announced availability of the 12th generation of its ftServer platform, claiming …

  1. Nate Amsden

    interesting concept

    Wonder how it works in real life. Protecting against component failures is nice but there's no real mention of software reliability. If for example you are running vSphere on it, and you need to install patches that is still downtime(could mitigate that more with a pair of systems but seems quite overkill). Or worse if the OS crashes(host or guest).

    Closest comparison I can think of off top of my head https://en.wikipedia.org/wiki/NonStop_(server_computers)

    But that system seems to have software tolerance as well

    "NonStop OS is a message-based operating system designed for fault tolerance. It works with process pairs and ensures that backup processes on redundant CPUs take over in case of a process or CPU failure. Data integrity is maintained during those takeovers; no transactions or data are lost or corrupted. "

    But of course you probably can't run things like vSphere on NonStop.

    vSphere itself has had fault tolerance for a long time which would cover a lot of use cases where you have to be protected against component failure but there are of course limitations (less now than originally it was limited to 1 vCPU, looks like current limit is 8 vCPU) - https://www.vmware.com/products/vsphere/fault-tolerance.html

    1. Sunset

      Re: interesting concept

      Stratus has their own OS, VOS, which is a Multics-like system that runs on the ftServer hardware and competes directly with Nonstop.

      Most ftServers run Linux or NT, but the "absolutely must have nine nines of reliability" lot run VOS.

      1. Steve Aubrey

        Re: interesting concept

        Gotta be a niche market - not many need it, but if you need it, you need it bad.

      2. Jellied Eel Silver badge

        Re: interesting concept

        Most ftServers run Linux or NT, but the "absolutely must have nine nines of reliability" lot run VOS.

        If being Friday, I immediately though "Or Crysis". Then started wondering about licensing and DRM. I'm guessing software businesses would stay 2 copies running simultaneously, two licences. But then whether DRM can mess this up. Assuming not because it's cloning it at low level, but DRM and licence management often seems like a headache.

      3. Anonymous Coward
        Anonymous Coward

        Re: interesting concept

        A colleague of mine related a story about their ex employer. CFO for the firm is a bit of a laugh and is in one meeting with new CTO (and a few others) to discuss new servers and the funding thereof. CTO outlines the very high spec models they shortlisted and the one they’ve chosen, running through the reasons for doing so. CFO says at the end of the presentation well that’s all lovely but will it play Doom?

        Stunned silence especially from the tech side of the table until the finance side roared with laughter and said they were joking and much preferred Quake. The Tech Director was horrified at the thought that hardware costing many thousands would used to just play shoot em up games

    2. Anonymous Coward
      Anonymous Coward

      Re: interesting concept

      I own four Stratus boxes running ESXi. It works well in practice. Stratus does take steps to improve software reliability too. I know that they harden drivers, so a system running on a Stratus box should have greater software reliability than if was running on another piece of hardware, even if you used the same components.

      It is a FT system, but isn't designed to stay running through scheduled maintenance events. If you have to patch your OS, or add RAM, then you will still need t schedule downtime. What it does do is reduce unscheduled downtime. What is cool, and a bit scary, is that you can do BIOS updates without needing to take any downtime for your workloads. For my super critical workloads, I run them on ftServer. If I need to do maintenance on ftServer, I vMotion the VMs over to my regular cluster.

      I looked at VMware FT when I was looking at Stratus ftServer. The VMware was and still is limited by how many vCPU a VM can have, and it requires a SAN, and is much more complicated to setup. You have the server hardware, ESXi, network switches, and shared storage. That's potentially 4 or 5 different vendors you have (hypervisor, Ethernet, server hardware, storage hardware, and maybe FC switch). With Stratus ftServer, it's all within the chassis and the Stratus hardware and software. Stratus can even help support your OS.

  2. Graham Cobb Silver badge

    Nice to see they're still around

    I worked for Stratus for a while about 25 years ago. Same idea back then. In those days they had 2 target markets: telecoms and financial services.

    It worked well, although it was expensive - with proprietary hardware, OS and software - and not high performance. But if you needed the fault tolerance, you would pay.

    The company got split: the telecoms business was acquired by another company which bought it for its international customer base and for the software and telecoms expertise. They weren't very interested in the fault-tolerant hardware.

    I knew the financial services side had retained the FT hardware design but I will admit to being a bit surprised it is still around. Good luck to them!

  3. Will Godfrey Silver badge
    Meh

    Close (but no cigar)

    If you are really that desperate for total reliability, you need something like the ancient system Racal had for one of their remote transceivers Two identical but completely separate units with independent power supplies and separate aerials. As far as I'm concerned dual processors (or anything really) in the same box doesn't cut it.

    1. Andy Tunnah

      Re: Close (but no cigar)

      But anyone can supply that ? That's just twice the amount. And already incredibly common.

      This is redundancy within the same system, that is the specialised bit. This isn't redundancy so that a failure can happen and then be resumed. This is redundancy so that a failure doesn't interrupt anything.

      People who have this sort of redundancy will ALSO have extra systems in place for total failure.

      1. Graham Cobb Silver badge

        Re: Close (but no cigar)

        This is engineered for particular requirements. It is when you need 5-nines availability (less than a few minutes a year) meeting the committed response times, No opportunity for failover or retries.

        Two examples back from my day:

        1) Banking transaction processing.

        2) Mobile network number portability lookups (needed to allow call requests and other transactions to be sent to the right destination network once it was no longer possible to use the first few digits of the number to know where to send it).

        Both required response times measured in milliseconds, with 99.999% availability. Particularly important when the transactions are designed such that no retries are allowed - it either succeeds or fails and the end customer is pissed off if it fails.

        1. Blank Reg

          Re: Close (but no cigar)

          Sometime last century when I worked for a mobile operator we got our furst fault tolerant system. As I had some involvement in the selection of the machine I wanted to see it in action after it was installed. So down to the switch room I go and in talking to the guys that installed it I found out that our very expensive, fully fault tolerant system had both of its power supplies fed from the same UPS.

    2. Roland6 Silver badge

      Re: Close (but no cigar)

      > If you are really that desperate for total reliability

      I would use a similar pattern as the “ancient Racal system”, however, I would increase the number of nodes to 3, use a Stratus in each node for processing and closely couple the units with fibre optics to ensure electrical isolation, naturally, I would implement a voting system over this.

      This was what was effectively at the heart of solid state interlocking deployed on the railways back in the 1980s… writing the distributed OS etc. was challenging and fun…

    3. Terafirma-NZ

      Re: Close (but no cigar)

      That is exactly what it is. Two servers fully independent and able to run in a chassis. They are then connected via an interposer that makes sure all CPU input is sent to both systems and any output matches on both systems. Every part is redundant.

      Funny thing about this is if you need newer hardware you can get their OS and run it on two standard servers. It runs KVM and keeps the VM's on both in sync or can do normal failover like std VMware HA. Have deployed lots of this in the past and it worked well for what it was.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like