
Re: 1998
"and I've already won."
You've won at punching yourself in the nuts and making a fool of yourself in public. I applaud you. Congratulations.
1740 publicly visible posts • joined 21 Sep 2010
Steve Ballmer called, he wants his FUD back.
Just in case you weren't aware macOS has been at the Desktop game since 2001 - and it's market share is in the same ballpark as Linux - despite wrapping it's developers and users in straitjackets and having many $bns of development budget lavished on it.
Besides which mobile phones are where the action is these days, and Android (aka Linux) is in a lot more people's hands than iOS or whatever they want to call it.
I honestly don't see why you bother FUDding something that has such a small piece of the market you care about. Let the kids have their fun producing software they like using, they aren't in the way of you enjoying a random collection of 90s vintage UIs known as Windows or the different random collection of 90s vintage UIs running on macOS (not to be confused with the 80s vintage UI stuff that ran on Mac OS).
If you are running at any kind of scale you will need a bunch of well qualified SRE/Sys Admin/Operator types to keep your compute estate pointing in the right direction - wherever it runs. With respect to your last comment, I refer you to a paraphrase of quote attributed to John Rollwagen: "A computer is like an orgasm - it's better when you don't have to fake it.".
Cloud SLAs are kinda meaningless when the infrastructure and Cloud services are being continually being overhauled, thus forcing continual development churn just to keep the AWS instances alive. Case in point some of our internal clients migrated to a major Cloud a three years ago - and basically they have an "incident" every other month, much of which is down to the churn and people making mistakes (the odd AWS outage or liquidity shortfall on our 'reserved' capacity happens too).
By contrast the on-prem clusters (even with rolling OS updates, service updates etc) have had *zero* downtime for two years solid (the last incident was literally caused by a back-hoe - but happily the workloads continued executing uninterrupted in the DC regardless). In addition to requiring less overheads within the internal client teams (no Cloud SAs/Operators/SREs needed) the on-prem clusters running costs are less than a third per CPU hour. It really depends on your workload at the end of the day, and ours are very definitely not the classic AWS use-case. Our workloads tend to run 18x6, saturating 100s-1000s servers at a time and the networks get eaten in bursts - we are very bad neighbours. The classic Cloud benefits of consolidating multiple workloads onto a physical host, and multi-region availability don't apply (our workloads are HA by default because you don't get a big enough maintenance window to bounce an entire cluster at once).
It is a case of the right tool for the job... For (our) workloads that span multiple hosts and saturate them for the majority of the day - Cloud has zero benefit - even the liquidity argument doesn't cut it as we found out when our internal clients tried to actually *use* their *paid for* reserved capacity (6 month lead time vs 3 months on prem - and of course that wasn't covered by the Cloud SLA). For the sake of fairness I have to point out that there are plenty of internal workloads that have happily moved to the cloud - just not the big ones we look after. :)
The sad thing is that Gates had promised a database style filesystem with "Cairo" (which was axed in 1996). Longhorn seems to have retrod the path taken by Cairo a full decade before it. Just add it to the pile of ephemeral vaporware that MS used to kill off the competition.
Part of the challenge with OSes in particular is that they are built on hardware - which doesn't necessarily have a formal model, proofs and a means to verify implementations against that model. Thus the OS runs atop silicon quicksand - it can never be proven to be "correct" or "bug-free" because the hardware model it is written against is not adequate - let alone complete in 99.99% of cases. There have been efforts to plug this hole - eg: RSRE's VIPER microprocessor, but as far as I know these methods haven't been applied comprehensively to common or garden x86 type gear.
The Linux codebase doesn't exist in isolation, it's a function of the community around it. Porting the codebase to a language that is familiar to a tiny minority of that community is what would cause the damage, all those folks who aren't familiar with the new language are frozen out of the development effort - and all their domain knowledge is effectively lost with it. Thus the result would *likely* not be as good / useful to people as the original codebase (ie: it would be incomplete and it would suck balls). Ultimately it's a daft argument to have - Linus, bless him, appears to be more pragmatic.
With VAX-11/78x's you needed *physical* access to the machine in order to update the microcode (as it was uploaded by the "console" when the machine was bootstrapped), so it wasn't (as easy?) to compromise the machine - even though it the vendor actively encouraged customers to write microcode. :)
There is very little that is genuinely new in computing - the majority of the legwork was already wrapped up by the mid-70s. I remember reading that article in the early 90s, having used late 70s vintage machines for some time, what he wrote wasn't radical or obscure back then - it was a simple statement of fact that was obvious to anyone who had booted up a minicomputer and compiled up a UNIX system. That isn't taking anything away from Ken, his genius was seeing it as a problem and spelling it out clearly for PHBs.
You have to put your balls in a vice whenever you run your code on someone else's hardware anyway, this just shows (yet again) that the Emperor has no Clothes.
With respect to shagging microcode, the attack surface could be severely curtailed by separating the *control* plane from the data plane (aka code the user can load onto the box). In the old days this was an 8" floppy hooked up to a service processor. These days I guess it would be a *physically* separate network cable (or USB port) on a BMC (or maybe a 8" floppy controller hooked up to a service processor USB port for traditionalists).
Quit trying to make everything *too* easy.
I have several bookshelves groaning under the weight of documentation about old machines - an Digital Equipment Corporation RM03 Printset for example, but realistically there is very little that I think *needs* to be kept for our descendants. If I had to pick 3 of the items from my bookshelves, I'd go with:
1) "The Art of Electronics" by Horowitz & Hill
2) "The Transputer Databook 1992" - because it covers parallel programming theory very well - in addition to packing in a lot of info about some very well designed bits of silicon.
3) "The Design and Implementation of the 4.4BSD Operating System" - still relevant today given the influence of BSD, explains why things are the way they are in UNIX land, and teaches the kids a few new-old tricks along the way.
From that lot I reckon you've got enough material to tackle most problems you're likely to meet in the real world - and maybe create a few more interesting problems on the way. ;)
"Being a non-drinker is more likely to make you a user of other recreation substances - alcohol is a drug, a lot of people like it. Trump doesn't seem to, neither do I, except for the occasional Guinness (because it's good for you)."
Trump can't even keep a marriage vow for more than a couple of minutes, he can't even remember what his wife looks like, he talks trash all the time, rants like an angry drunk and lies about everything ... The chances are he is lying about being teetotal as well.
I confess that I loath Intel for what they've done and their malign influences on the industry - but I did have hopes for Phi before the details of their design started to appear... When the details appeared - such as it being organized around a ring bus and they were absolutely hell bent on using a Pentium core I figured they had screwed the pooch - but maybe they'd get away with it through them magic of their processes & fabs... They didn't get away with it.
I should correct myself... The P6 core was in fact the first Intel core to have the CISC front end / RISC (micro-op) back-end approach.... I still like it, but it would be more correct to say it was the first "great" CISC frontend/RISC backend rather than the last great CISC design. :)
RISC is really about the approach you take to designing the ISA and implementing it - specifically using real-world metrics to direct your design and implementation - rather than "feels" and "vibes". I chose to contrast System 360 vs CDC6600 because (even though they were shipped before RISC was a thing) they reflect the contrasting design philosophies embodied by CISC & RISC. IBM's T.J.Watson Jr.'s rage-post from 1963 (https://images.computerhistory.org/revonline/images/500004285-03-01.jpg) illustrates the advantages of applying RISC style principles (CDC 6600) vs CISC ISA approach (IBM System 370) in terms of performance and development cost.
As an aside it's worth studying the CDC 6600 and the System 360's development - plenty of interesting stuff there and lessons to be learnt (and you can see those lessons being applied - or not - in subsequent generations of machines too). :)
"There's a reason they call x86 a CISC and it's because it can do complex integer set computing."
No. That's really not the reason.
Firstly the term "CISC" only came into being as a counterpoint to "RISC".
The point of CISC (before that term was coined) was to minimize the cost of fetching an instruction from (very slow) main memory. It was a hunch that was not actually based on any empirical evidence - and the instructions chosen to be implemented were primarily driven by marketing and implemented by a mixture of microcode and kludgy expansion modules. The emphasis was on making writing assembly & machine code easier - the VAX-11 had stuff like string search and replace for example. RISC changed the focus to making the compiler writer's life easier (I always found writing RISC assembler easier than CISC too for that matter - far fewer corner cases to worry about).
Reduced Instruction Set Computers were designed for efficient implementation of hardware and software (ie: compilers). The first machine designed this way was the 801, the team was led by John Cocke. That guy worked on the IBM S/360 - the original multi-billion-dollar development budget CISC machine, and that work showed him that compilers were not using the instruction set efficiently. So for the 801 project he set out to develop an ISA with the goal of keeping the hardware *AND* the compilers simple and efficient. He achieved that goal - to the point where his tiny 801 processor + compiler outperformed the bigger (and more expensive) CISC S/360 contemporaries. The prize for his success was for IBM to hide the 801 in channel processors and other ancillary gear that wouldn't erode their humongous margins on mainframes.
Even when CISC was king (and IBM were raking in the cash from S/360 and it's descendants) the fastest and most efficient machines of the era were rather RISCy: case in point the CDC6600 and early CRAYs - which were an order of magnitude faster and more energy efficient than their CISC (IBM) contemporaries.
These days all high end CISC chips are implemented as a "front end" instruction decoder feeding a bunch of RISC-style "micro-ops" to a very RISC/VLIW style backend. There aren't any true CISC chips left - they are all RISC style back-ends with a CISC style front end on them. This entire argument is over marketing really - the engineering battle was lost to RISC in the 90s whether folks want to accept it or not. :)
From my PoV, engineering wise, the last great (pure?) CISC was the P6 core - which first appeared as the Pentium Pro in the mid 90s. Everything after that has either sucked (eg: Pentium IV) or had a CISC front end grafted onto a RISC/VLIW backend.
All the super-intense compute stuff is done with GPUs today - which consist of many RISC style cores operating in parallel. Intel tried a CISC alternative to that with Xeon Phi - suffice to say it sank without trace (because it was sooooo slow - and inefficient).
Going forward I suspect the future will look like Fujitsu's A64fx - RISC cores with a monster vector unit and a large helping of HBM memory on the side... Just like what NVidia are punting right now in fact. ;)
Sure you can implement anything with a single instruction and a lookup table... That doesn't make it easy to produce something that you can depend on.
The *validation* is a huge part of chip design & implementation...
Try coming up with a validation suite for System Management Mode based on Intel's architecture reference manuals and see how far you get. I'll wager that not even Intel they can't come up with a single consistent interpretation of their own spec as stated in their published literature... And that's just one mode - without even considering all the ways it interacts with any other of the many modes - and the whacky instructions such as "lock"... Translating that stuff into an alternate instruction stream doesn't actually help you validate it at all.
They can always license the ARM architecture and get into the game at any time - for relatively little investment. They already have customer relationships in place to punt ARM gear if they wish to go that route. There are still big margins in x86 space - cutting down the ISA to size makes a lot of sense (especially the myriad of weird security & system management modes) from the point of view of being able to validate designs & implementations cost-effectively.
ARM implementations have the big advantage of being (relatively) trivial to implement and validate - case in point Fujitsu's A64fx - which represented a huge technical leap forward (HBM memory, hi-speed interconnect, variable length vector support) - but was implemented at a fraction of the budget that AMD & Intel lavish on their latest incremental iteration of x86/AMD64. When it came to shipping the hardware they already had a software eco-system ready to go as well. They could *not* have done that with x86/AMD64 at the time.
It's not written in stone of course, but generally less components -> less opportunity for mechanical mishap -> better MTBF. Those big old wardrobes full of TTL chips running a handful of text editors weren't any more reliable than a 128 core AMD64 box running a few hundred Monte-Carlo simulations. Really this comes down to choosing the correct tools for the job, besides which if you genuinely have a need for small blast radius for some workloads you can always *under utilize* your hosts...
Migration as in the physical host goes down and the workload needs to find a new home. By design and definition there is an an intricate and tight coupling between the fail-over partners - consequently a vendor would be entirely correct to be circumspect about mixing and matching software versions across fail-over partners.
I'm not talking about sensibly designed & operated systems here, I'm talking about real-world apps and systems. :)
It was a bit of a ramble-rant. I'll try and slice it differently... :)
1) Our HPC type workloads are the opposite, they vastly exceed the capacity of a single host.
+ Thus the blast radius of a host spans a fraction of the workload.
+ Checkpointing and rescheduling the affected portion of workload to another host is how we addresses these failures (quick & efficient).
2) Typically VMs (and zillion core hosts) are used to aggregate lots of workloads that are *much* smaller than the capacity of the host together.
+ The blast radius of a host spans a lot of workloads.
+ VM migration can be used to mitigate this problem if you have some spare capacity (slow and burns compute host resources in the background).
The Blast Radius of a Zillion Core host is not a problem for HPC because your workload vastly exceeds the capacity of that one host and it will inevitably been engineered to tolerate a bare-metal host failure (this *will* be battle-proven because the MTBF of a set of hosts is *MUCH* lower than the MTBF of a single host).
Meanwhile in the land where you are using Hypervisors to aggregate multiple workloads together and use VM migration to provide resilience to hosts exploding, the use of Hypervisors can actually multiply the blast-radius because they are necessarily tightly-coupled in order to do the complex task of migrating across a few thousand processes & TB's of working set... Vendors / operators are reluctant or won't support the migration of workload between VMs at different version/patch levels. Thus if you need to patch the Hypervisor layer you will need to patch hosts in batches - thus multiplying your effective blast radius as that patching happens (in practice that happens at rack level - so the blast radius is 16x bare-metal in the event of VM migration breaking or Hypervisors needing an upgrade).
... rant bit ...
Suffice to say, in practice, the real blast-radius problem wasn't a problem for our HPC workloads on bare-metal (because the hosts are loosely coupled and migration is cheap) - but it *has* become a problem for our HPC workloads that run under VMs - because the Hypervisor layer has multiplied the blast radius from taking down one host to taking down a rack at a time (because the hosts are now tightly coupled at the VM layer). We've seen this manifest itself as outages when running our workloads under VMs - something that the bare metal hosted workloads haven't experienced since a DC burnt down several years ago.
I suspect the main driver for our org forcing us to run stuff under private Cloud is to increase their utilization figures as our HPC workloads achieve > 90% 365x24 on bare metal. The Cloud utilization figures when from ~17% to ~32% when a small portion of our HPC workloads were migrated to Cloud. ;)
HPC has *always* faced this challenge of Blast Radius - and it's survived just fine - if not thrived with many-core boxes... Typically workloads are partitioned and can fail-over to other machines or resume from a checkpoint.
I am now in a weird new world were I'm seeing HPC running on VMs (at multiples the TCO of running this stuff on bare metal). The manglement haven't quite got their head around the fact that HPC workloads *exceed* the capacity of the box, you are actually *reducing* the utilization of a machine by running it under a VM... There's no improvement to hardware resiliency at all - in fact there's an additional cost & (huge) overhead of the redundant fail-over mechanisms provided by the VMs...So while many-core blast radius hasn't really affected us guys running distributed (HPC) workloads, the VM *software* blast radius has. It turns out that a hypervisor upgrade has to happen across a cluster of nodes all at the same time - so rather than losing a single box we lose several racks at a time. Another "learning" for the VM Poindexters is that migrating 2TB active working sets across a network is a *lot* more expensive than simply resuming on another host from a checkpoint...
HPC on VMs is literally double the cost of bare-metal - requires *far* more manpower to keep that oh-so-clever-but-unncessary VM layer running.
TL;DR
Many Cores = great, more please for HPC, blast radius be damned - but serve them up as bare metal so we don't waste cycles & run-time maintenance overhead on the superfluous VM crapola.
In most cases I don't think VMs are actually needed - software that is fit-for-purpose, packaged and deployed properly isn't a problem that can be fixed by VMs - ameliorated perhaps, but not remedied. We're not running software on DOS boxes any more folks.
"2. There was potential to go SMP very early on,"
Discrete Microprocessor SMP was a dead end in practice. You never got anything close to linear scaling - maybe 30-50% more oomph for > 2x the component count (remember all that extra glue logic). The other problem was that process improvements would lead to single processor clock speeds and cache capacity would pretty much eclipse any SMP solution within 6-12 months regular as clockwork. That
did change with NUMA, and then process economics combined with NUMA made multi-core chips worth the extra development & manufacturing cost.
I suspect the biggest barrier to ARM's adoption was reliance on third party fabs, in those days the those third party fabs were a step or two behind the bleeding edge proprietary foundries - so from the PoV of a desktop/server vendor there was a risk your entire line of business would be dependent on foundries that are a generation or two behind the competition. The DEC Alpha ran into this problem later on with Global Foundries and Samsung.
in retrospect I believe ARM actually did a pretty good job of navigating the swamp in the 90s (and of course Apple + VLSI Tech were major shareholders from day one of ARM).
... and a 5Mhz crystal ... which was a lot cheaper & widely available compared to the faster ones (in the day). The other nice thing was that you didn't need a common clock or intricate distribution schemes when networking Transputers together - unlike say trying to get a pair of 68K's to run in lock-step. ;)
"transputers could have been killer GPUs. But alas no."
I spent a few months exploring a B419 - which had a T800 and some VRAM onboard. It was great for driving a nice high-resolution display - but the processor to video memory bandwidth just wasn't sufficent to do full screen animation, if I had been cleverer maybe I could have got some mileage out of VNC style compression to boost the effective bandwidth. The VRAM did offer a mechanism to do row-wise bitwise ops - but that was not really what I needed. The competition at that time was pretty awful too any one remember the i860 ? ... Thought not.
At that point in time with DS links, there was a near future where a Transputer hooked up to VRAM could offer just enough bandwidth to the display to do some high resolution real-time rendering, could have been interesting if it materialized.
Typically Transputers had four (low end Transputers had 2) 'OS' links ran at 20Mbit/sec (and you could get close to peak on that), at the time the best Ethernet you could wangle was 10 Mbit, and you got a fraction of that bandwidth - if you were lucky. FWIW the INMOS engineers did know about transmission lines - but the emphasis was on low-cost interconnect and implementing it on a process that was optimized for *digital* circuits rather than analog.
For the ill fated T9000 they implemented 'DS' links that ran at 100Mbit/sec, same deal again - they were intended to be cheap... IIRC those were being tested in the lab by 1991 - 100 Base T Ethernet turned up in 1995... They also implemented wormhole routing (that might even be another of their innovations) for the (32 port!) DS-link crossbar switch. As it happens one of the chip designers I lived with at the time explained to me about how they were using incremental power up of line driver transistors (something I saw being touted as an innovation *20* years after INMOS did it), thus shaping the signal on the transmission line and reducing the ill effect on the ground plane (eg: ground bounce).
Some of the INMOS refugees went on to promote Transputer style links, see IEEE 1355, and I heard rumors that PCIExpress was somewhat inspired by DS links. While there may be some truth in that, I figure something like DS links and wormhole routing are pretty much inevitable if you want to scale up a system and speed up comms links because Physics. It's just that INMOS tackled it in the 80s, while Intel et al took another 10-15 years to get around to PCIExpress. :)
DEC Engineers actually did build stuff people wanted - everything from LSI-11s, MicroVAXen, (normal) PCs, Ethernet (remember DIX?) ,MIPS, and eventually Alphas. The problem was the manglement kept pushing weird old shit on customers as well - like VAX-11s (74 series TTL discrete logic in the 80s - seriously folks ?), VAX9000s, Rainbows, and VMS. Every time they'd see sense eventually - eg: shipping normal PCs, Ultrix, OSF/1, NT - but always too late and too half-arsed to make a difference. Folks also under appreciate the crushing weight of the FUD that crushed the investment, innovation and life out of any company not toeing the "No one ever got fired for buying Wintel" line.
Love him or hate him, Stallman's GCC project has opened a lot of doors over the long run - and given folks a way of the Wintel trap.
I also worked at INMOS. As a PFY I found Transputers were very easy to write code for (in comparison to say PCs, 68Ks, 6502s, NS32016s, VAX-11s), whether that was in OCCAM, C or even ASM. For the record I am categorically *not* a boffin and at least the folks I worked with weren't prideful, arrogant or smug - so you're tarring them with an overly broad brush IMO. FWIW I think some of the engineers at INMOS earned the right to a bit of pride - formally verifying a FPU from paper to silicon was amazing (and paid off - judging by the hit & miss efforts of the competition at the time), coming up with a decent PLL on a digital process was a big deal too (something Intel eventually caught up some years later with the 486DX2). Also easy to forget that the power/performance ratio of INMOS gear was in another league in comparison the opposition at the time...
As you worked there, you would know that the design team for the 486 was bigger than INMOS. Yet INMOS produced cutting edge SRAMs, RAMDACs, Microprocessors, Compilers, DSPs, Video Controllers, developed FLASH & DRAM IP, and ran their own fab.
The brick wall for INMOS was securing sufficient investment to continue developing their products, which is something that all the clever engineering in the world isn't going to fix - not even a T800 with an MMU and a UNIX port would have been enough to fix the lack of investment.
TL;DR : Give the engineers a break - smug or otherwise.
Intel jumped the shark with Broadwell IMO. I recall being distinctly unimpressed with it when asked to benchmark it by Intel sales-folks. Drank power at idle, drank power at full chat, same throughput as our existing Haswells, but hey you could get a very slightly higher peak clock rate - that you would never see in practice because you are maxing out every core anyway... To be fair to them they didn't push their luck - presumably because they had a lunch date with some purchasing bod who was short of a couple of yachts later that week.
"Take it offline and keep using it with proper software and local storage. Move your work to a tablet on a memory card as required."
Or run Mint and skip the faff - which you can also run in a disconnected mode, and can lawfully make copies of the distribution media to store under your mattress if you insist.
Really doesn't matter if it was planned or otherwise IMO. Folks still get to vote in November to decide whether they want a convicted Felon, Rapist and suspected kiddie fiddler running the world's most powerful Superpower with total immunity from prosecution - or something else. The Gropey Old Perverts could try actually winning an election with popular policies instead of slinging mud, preventing folks from casting their votes, throwing lawfully cast votes out and attempting coups. Sadly appear to be deliberately tanking their election chances with blatantly mentally unstable Presidential and VP nominees while yelling out their pre-excuse that the election will be rigged.
It is sad to see America getting taken up the rear end by a corrupt bunch of right whinging theocrats, rapists, crooks and grifters, and even sadder to see folks rooting for it. :(
"I mean, I liked NT 3.51. That was fast, efficient, and nearly bulletproof. "
YMMV as they say...
I found NT 3.51 to be much slower, less efficient and *really* flakey in comparison to Redhat Linux - on Pentium & Pentium Pro machines at least. Hardware support was pretty poor as well relative to Redhat, at least for the hardware I had to work with. At the time I was working on CPU & I/O intensive applications that ran for 8+ hours solid at the time - so I was paying close attention to what the compiler was producing and crashes were *expensive* in time and money. That was with GCC 2.7.x - a compiler noted for *not* producing fast or efficient binaries in comparison to the alternatives.
I'd agree that 3.51 was better than 4.0 for running code that you cared about - but 4.0 was nicer to use (when it worked).
FWIW the GPUs targetted at HPC aren't much better in my experience. ECC errors are common, swapping boards because they are flakey is a routine occurrence. Folks even switch ECC off because it flags up errors so often that they'd prefer to have the wrong results rather than wait for an error free run. Yeah - and your observation about these cards being sensitive to heat also applies to the setups I work with - quite often we end up idling or underloading GPUs to keep them within an acceptable failure/error rate. None of this inspires much confidence in the QC of the vendors involved.
Back in the day one of the key differences between a RISC "Server/Workstation" CPU vs a PC CPU was the ECC logic on most (if not all) the internal datapaths, register files and caches - for precisely this reason... ML code tends to be run at scale on GPUs, which may or may not have that old-school "Server" grade ECC logic on every datapath, cache and lump of memory in them - and of course they are substantially larger dies and larger chunks of memory than those old school RISC servers. My experience with large jobs run across hundreds of GPUs is that it's rare that an hour goes by without a single digit percentage of them registering ECC errors (often double bit) - this is across several generations of a certain GPU vendor's products - it hasn't got better as time as gone by. That is assuming that the ECC is actually enabled - some folks like to switch it off because they don't like having their workloads stopped due to GPU errors (indicating that errors are too frequent to tolerate and the folks in question are cretins).
Good of Meta to draw attention to this issue - but anyone running large jobs on GPUs should *already* be aware of this phenomenon. :)