I've got a bad feeling about this....
Curious that this happens just after the putsch that ousted Dianne Greene et al. Would someone with such a reputation as a stickler for detail have had this happen on their watch?
Irate VMware customers were left unable to power up their virtual servers this morning because of a bug that killed their systems when the clock clicked round to 12 August. The bug was sent out to customers in ESX 3.5 update 2, VMware's latest hypervisor, which went out on 27 July. The version could have been downloaded and …
Being in Australia, we have a short reprieve still ahead of us, but after reading the thread on VMware's site, I already dread going into the office tomorrow.
I canceled scheduled maintenance on all virtual hosts and their VMs on pain of immediate sacking (OK, not really... I'm not that horrid a manager).
But, I will be dancing around the data centre, naked and in body paint with bone-filled rattles to drive out the spirits of VM crashes, which have surely been waiting for just such a moment.
If we make it through the next few days without the board of directors demanding out heads, I'll be hosting an open bar for the entire team.
Given the relatively low cost of hardware, I do believe the benefits of virtualization are oversold.
Also what's magical about Aug 12 2008, does it cross some boundary in a bit field, as when Windows hangs every 49.7 days ..
Computer Hangs After 49.7 Days
http://support.microsoft.com/kb/216641
VMware and all other virtualization "solutions" in the same category are nothing more than crutches for incompetent system administrators that can not properly load and/or tune their OS and applications. We have had their consultants in here, and sure, they found stuff to virtualize, but when pointed to the servers I am running, they just say "oh, well, those systems would not be suitable for our product " [insert excuse here]. The reason is that they were properly planned, and laid out ahead of time and are running at 75% load, right in the sweet spot where I like it ;)
Of course, that being said I should also point out, I don't do windoze ;)
So even if I'd had a test plan, and run through it and it all worked fine - how would that save me if the problem is date-related?
Unless you're telling me that part of your test is to run your vmware servers with every possible date - or that you just had a suspicion that August 12th was likely to be a problem....
I wanted to look up the article on poor timekeeping in a Linux guest (http://kb.vmware.com/kb/1420 ?) on a Windows host, having this morning seen my kernel rebuilt to 1000Hz lose seven hours since 6pm yesterday... but the kb website also seems to have a responsiveness problem...
Yesterday would have been a good time to short those VMware shares.
For everyone (including those at VMware), how do you test for a date problem? Do you set the date forward one day at a time until you have covered a five year period? Maybe we could step forward one hour (minute, second,) at a time until we know that the product will run at all dates and times? Clearly this is not a case where proper testing would have caught a problem.
Think before using the "you should have tested" out.
Whilst the application thats running in my VMWare server 1.0.5 is still running nicely - any attempt to log into it using the VMWare Console is failing.
Trying to boot the vmware instance from the console fails. Fortunately, its set to start automatically when the host system reboots - so yes - I had to reboot this workstation in order to restart it.
Thank goodness I can ssh into it to stop or start it - but I'm wondering if this is co-incidence.
Regards
Neil
I doubt strongly that testing would be effective in this case.
That said.......
It is important to always test things before putting them into use on production servers. My thought has always been test it for a couple days, then hook up a couple of test computers (or a small somewhat ISOLATED segment of the network) and let (L)users brea... err test it some more. If it isn't COMPLETELY ballsed bup by a day or two of that then push it to production machines. All told a week or less from download to production or the round file bin aka file 13.
Since this has been out two weeks, that wouldn't have prevented this. I agree with other posters that claiming a test program would prevent this is at best debateable (unless you are running some REALLY protracted testing).
That said, if you have a network environment and aren't testing updates from all vendors before implementing then you are asking for trouble.
All hands, cause given the wide implementation of VMware we are all in for shit for a couple days I suspect.
How much more incidents like this have to happen, before somebody in an IT ministry somewhere in the world decides it's high time that software vendors were obliged by law to supply Source Code with every product precisely in order to prevent precisely this sort of scenario?
Testing for date problems is not exactly rocket science. I would suggest having a list of random dates you try during testing. In addition to that have several systems running with dates set in the future in test. Maybe 1 week in the future, 3 months in the future and 6 months. Hopefully the 3 and 6 month systems will catch date related bugs before shipment. The 1 week system will give you 1 weeks warning of a test escape.
I wonder why VMware did not do this.
Stephen
I saw this this morning, but didn't have time to comment. Now I see still no one as mentioned the DRM angle. Surely there is no functional reason for this. It has to be some bit of crap they tried to add in to ensure no evil pirates would run their product - odd considering the far most likely audience for their product are huge corporations that don't dare do such things.
But as their attempts to virtually take over the world (pun intended) falter they probably blame pirates rather than the fact that their product adds precious little functionality to a data center. Sure it's neat and all, but when companies are cutting programmers it's probably a hard time for gee-wiiz software sales.
One of the few marketing terms I know is "backlog". Best I've ever been able to figure out, it's the term sales people use for sales they "should have made", but honest boss "we'll close them next quarter". So, as people figure out their product is not a silver bullet to replace skilled system admins, and it costs a bloody fortune anyway - they probably put 2 and 2 together and got 22. Obviously the problem is we need a more aggressive DRM system.
...got left in, which is what caused the product to expire.
I guess the lesson to learn is that it wouldn't hurt to have a stage where you whack the clock forward an arbitrary amount of time (e.g. twice the length of a typical update cycle) and make sure it still runs in your test environment. Particularly given software with subscription based licensing, you should definitely be testing with operating system dates either side of the point where you expect the licence to expire, as they mark a known change in conditions.
No, I wouldn't have thought of this myself. ;-) In any case I believe the smugness above was due to the fact that the bug was made public before that poster's testing cycle happened to finish, so he was just lucky.
No, no, no, my good man. The approved witchdoctor garb is a grass skirt; no nudity. In your part of the world, however, a long penis sheath is the de rigeur accessory, worn either on its own or with the skirt
Sheesh! Geeks! Especially managerial geeks! No fashion sense at all!
Yes, thats broken too.... all those freebies given out to convince the Hyper-V maybes that VMware is better are now broken as well... Shot themselves in the foot there.
PG
Paris because she is high Quality Ass(urance)
(no, I don't mean that, honest)
There's no good reason not to ship the Source Code to the paying customer.
It doesn't do anything to prevent piracy. And code plagiarism would be obvious anyway, if your competitors were also obliged to ship their Source Code.
All it does is create problems for users.
Until it becomes law to supply Source Code, or a decompiler exists, issues like this -- and worse -- will keep on happening.
Oh, here we go, here come the freetard gang again, with their clarion call of "open source is a panacea for everything". Well it isn't, so STFU.
For a start, not one in a thousand people have the skills, shit, not one in a thousand programmers have the skills, to read through the source listing of a hypervisor and spot a bug like this, unless it's something really glaringly obvious like a great big commented section that says "THIS CODE WILL CAUSE THE SYSTEM TO FAIL ON AUGUST 12". Certainly the average sysadmin has neither the time nor the inclication to do this kind of thing, even if they do have the appropriate skill set.
Plus, if you think someone with a market share like VMWare don't have a code review and testing process that would catch something that was easy to spot, you're clearly living in la la land.
And as a case in point, I'm currently filling in a bug report for a Debian upgrade that totally FUBAR'd my wireless IDS/IPS box, and that code was supposedly QAd by about a thousand developers, so clearly your suggested panacea doesn't work. Period.
If anything, this incident illustrates an issue with VMWare's QA process (although frankly, software is complex, and shit happens), _not_ with the closed source model. So put down your cheerleading pom poms and go back to downloaning pr0n for your umbongo desktop. Spankard.
No longer unresponsive, now (deliberately) inaccessible: "This section of the VMware website is currently unavailable while we make important user improvements and upgrades to the site. We apologize for any inconvenience this may cause."
Someone's bonus seems to be at risk.
"if you think someone with a market share like VMWare don't have a code review and testing process that would catch something that was easy to spot, you're clearly living in la la land."
Er, anyone who thinks there's any reliable connection between a company's size/market share/visibility and the quality of their processes and products is surely living in La La Land, no? VMware aren't the only example... one classic (the 49.7 day crash) has already been mentioned here though iirc that was in a Win9x of some flavour.
"the average sysadmin has neither the time nor the inclication to do this kind of thing,"
Which is why enterprise-critical systems shouldn't be designed or deployed by "average" people (not that it stops most companies), they should be designed and deployed by that rare commodity, People With Clue (not me, but I know a few).
"a bug report for a Debian upgrade"
How does a one-off (?) failure of one Linux flavour in one set of circumstances to meet your requirements of the day suddenly mean the whole "open source" model is kaput? There are plenty of happy Linux users out there too (and a few unhappy ones, just like with Microsoft).
Anyway, access to source isn't just an issue of FOSS vs closed source. Back in the day, VMS customers with money and interest and competent (not average) techies could buy the source listings on machine readable media. No FOSS there, but if something catastrophic like this were to happen, the smarter customers would likely be in a position to fix it PDQ if the suppliers didn't.
Got out of bed the wrong side this morning did we?
This is too bad, but I think a lot of folks do not have a realistic understanding of software engineering or systems management. Everyone has bugs, and there is always a chance something bad will slip through.
Shipping source code is kind of a silly idea, it is nearly impossible to find a bug by inspecting a huge source code base, except during focused code reviews by knowledgeable co-workers as it is being developed. Customers don't want to spend resources trying to do that, and the Raymond-esque notion that an army of amateurs can do it is just ridiculous.
You really need a test organization, people who will run the code, stress test it and sleuth out bugs and process reports from customers. You also need a database system to remember and prioritize bugs. When we were working with UNIX programmers from ___ on a project years ago, they were just completely baffled by the this concept, they never understood or used the bug tracking system, routinely left the code base in a state where it wouldn't even compile. Didn't leave us with a very positive impression about hacker culture.
@"The Other Steve" -- don't forget to take your meds!
It's true that openness is not a magical solve-all panacea, but no-one said it is. It beats "play and pray" though! The point is well made that mandatory source-code disclosure would serve the interests of those who deploy and use computing resources.
-- begin quote --
"Also what's magical about Aug 12 2008, does it cross some boundary in a bit field, as when Windows hangs every 49.7 days ..
Computer Hangs After 49.7 Days
http://support.microsoft.com/kb/216641
-- end quote --
If you're still running Windows 95/98, hanging every 50 odd days is the least of your worries.
The first Patriot missle batteries deployed during Desert Unpleasantness Part I had a timing error that was only exposed if the system was left on for a sufficient length of time, allowing for decimal to binary fraction rounding errors to accumulate through repeated addition. This had at least two consequences:
1) The missle performed perfectly according to its lights, and went a number of meters to one side of the target instead of hitting it.
2) The magnificent explosions hailed by the media as Scud interceptions were really Patriot self-destructs to avoid mischief on ground impact.
The problem was later solved by a software update.
In this particular case, code inspection plus numerical analysis might have reasonably been expected to reveal the problem.
What is disturbing to me is Niemer’s cavalier attitude that nothing major is wrong and that if your organization is affected it’s your own fault for trusting their software. Personally, I would have liked to have heard VM’s marketing manager explain how important their customers are, how serious they take any problem and how they will spare no resources in fixing the problem. Niemer left me with the impression that if they can find the problem and fix it they will, but otherwise they’re not going to lose any sleep over it.
...it just doesn't make sense to do that. Takes up a server, which may be Big Iron (thus costly), it won't run all the production stuff anyway and then what kind of bugs is one supposed to catch? And would one even recognize them? One might as well test the CPU adder circuit.
Hell, anyone who has been through a Y2K planning session knows the glazed look across the room when the questions "so what are we looking for" and "so what is the test plan and where are the people to implement it" comes up. And in that case, the exact moments of interest were actually known.
It's alien...
In any worthwhile application suite where dates are of any great significance (where shift changes matter, week/month/quarter/year ends matter, leap years matter, etc), the application date (and time) should arguably be isolatable from the OS date+time, specifically so the application's date+time handling can be properly tested without screwing up date+time on the rest of the system.
But where the application design doesn't permit that, you fiddle with the OS date+time for those tests where it really truly matters (or, occasionally where appropriate and available, use a bit of clever software that intercepts selected date+time related system calls without actually really changing the system-wide OS date+time).
There's no guarantee that such testing would have spotted whatever caused today's VMware hiccup; competent code review sounds more promising.
Another reason for shifted time testing is the small matter of the transitions to and from daylight savings time, especially in applications which may be used across multiple time zones, zones which may not all be changing at the same time, and some of which zones may not even use whole-hour offsets. Maybe here you *do* want the OS to be running on the relevant date+time.
Otherwise you can take the Microsoft/VMware-compatible approach you seem to recommend: write the code, take the money, ship it, and hope.
Have we all done our Y2038 testing by now?
So at least 2.5k people have worked out a reason to use a VM... please enlighten those of us who don't think having yet another layer of slow software in the way is a good idea for anything close to production.
I know the VM chaps that keep bugging me in the day-job wheel out a huge list of so called benefits, but none of them seem to stand up to any real serious scrutiny.
Such as..
* Cost - what's that? Equipment is a business expense, and commodity Iron is cheap.
* Isolation - Err that's otherwise known as chroot, permissions, security.
* Standardisation - Err buy the same Kit / OS (now that's standardisation done properly).
* Consolidation - Err don't buy to much crap in the first place.
* Testing - I'll give them that one, they are slightly useful for testing.
* Mobility - Err that's called redundancy (hot/cold-spare) in the real biz world, or a Disaster Recovery plan. Or better still Load Balanced with capacity.
* Hardware Support - Err that's why you choose your hardware carefully, and even more carefully choose the OS with the driver support. Come on dummies.
One of the more startling problems with VM's that the sellers of VM's neglect to mention is that by using VM's you have all you egg's in one basket. Now that is dangerous.
In my day-job VM's were considered by the high'n'mighty, but I soon put the kybosh on that with some well placed questions to the VM software sales / technical meetings. Everyone came away knowing VM's are for companies that are downsizing.
I have advocated for 20 years that Redundancy & Resilience can not be met with of the new fangled stuff that comes to market. Good old fashioned planning and preparation is what counts, not being able to move a OS from one box to another because the first has died - Hey, isn't that a Hot/Cold Spare? ... so why have Slow-ware(that's VM's to the un-initiated) in the way?
Guys, invest in a Load Balancer (they call them Application Switches now BTW), you wont regret it, and with a little bit of programming thought, your programmers will see the benefit of being able to scale-out in a very big way.
I know VM's are not the way forward, its a shame so many others have yet to discover this :(
And no, clustering ain't the answer for the other end of the spectrum (nor cloud computing).
Good luck suckers, you'll need it with any Slow-ware.
win98 nice love it however not used now for about 9 weeks. 50 days mmmm normaly the memory leak gets you first lol
Will VM it sometime.
Funny thing is the microsoft site lists "Microsoft Windows 98 Standard Edition" not the first time I have seen that there.
BTW if you are from Microsoft SE is Second Edition
OK, so maybe it wouldn't have been noticed in time if the Source Code had been out there with the customers. We have no way to know.
But the fact remains that there is absolutely no good reason that will stand up to the briefest moment's scrutiny why customers should not be given full access to the Source Code of any applications that they intend to run on their computers. Not one single reason.
Therefore I stand resolutely by my position, calling for mandatory Source Code disclosure and stiff penalties for non-compliance. Vendors who have nothing to hide, have nothing to fear.
Note, I'm NOT saying people should necessarily be allowed to distribute copies of software at will; although since the absence of Source Code has done nothing to prevent this, it is unreasonable to suppose that the presence of Source Code will make this any easier. I AM saying that people should be allowed to examine and modify the Source Code to any software they are properly authorised to use, to delegate such activities to third parties and to pass on details of any modifications they may make to other authorised users of the same software without let or hindrance from the vendor.
This would open up a lucrative secondary market, creating jobs within the IT sector: certifying software as fit for a particular application, and adapting it to the way people do business, as opposed to vice-versa.
Nobody would eat a cake if it didn't have on the packet a list of the ingredients and how much fat, protein and carbohydrate it contained, would they? And I don't think many people would buy a car if the manufacturer refused to allow them to fit fluffy dice, transfers, beaded seat covers or anything that plugs into the cigarette lighter, but forced them instead to trade in their car for a brand-new model with ever-so-slightly-different controls because the old one would not drive down roads that had already been driven on by one of the spiffy new ones.
I am convinced that the only reason anybody puts up with this sort of behaviour around computer software is that most people just haven't been around computers long enough to have seen that there used to be a better way.
(Oh, and by the way: I don't download pr0n. When you've seen one naked body, you've seen them all; and when you've actually seen a real one, computer graphics don't cut it anymore.)
I just get this feeling when using VMware that it's buggier than it should be and the company just seems to accept that state of affairs. So something like this probably has to be expected. Their reaction that, basically, it's not that big a deal just confirms for me that I wouldn't want to run anything really critical on it. That's why I don't.
Rarely do I get a good laugh out of the commenters on El Reg, but today, my thanks go out to The Other Steve with this down-to-earth comment:
"For a start, not one in a thousand people have the skills, shit, not one in a thousand programmers have the skills, to read through the source listing of a hypervisor and spot a bug like this, unless it's something really glaringly obvious like a great big commented section that says "THIS CODE WILL CAUSE THE SYSTEM TO FAIL ON AUGUST 12"."
Virtualization needs to be done well, with the same level of planning and preparation as any other deployment solution. Some of you seem to think VM is a cop out for lazy or unskilled admins to not have to tune their boxes, but anyone effectively using VMs will tell you they had to do plenty of tuning to get the VMs humming.
The benefits of virtualization are obvious to people who are good candidates for virtualization. Not a good fit for you? OK. I run my shop with half the people I would need without VM's and I save my company tens of thousands of dollars every quarter and can deploy new servers in minutes.
Bottom line, we spend a lot less on hardware, electricity, and payroll. And our response times have gotten better every month. And our disaster recovery could not possibly be simpler and faster. We do a full DR simulation every year. Full rebuild from backup tapes. It's a breeze. Can't imagine doing this without VMs. Or maybe I can and don't want to.
Totally agree with AC here. In my organisation, we have 300+ physical boxes, and we are undergoing a "consolidation" regime to move them ...to 300+ *virtual* boxes. It totally flabbergasts me - we still have to pay umpteen squillion for each of the Windows licences, and then there is the VMware licence on top of it. Sure, there are some hardware savings, but there is no cost reduction for installation and maintenance. The hardware savings would be the same (if not better without the overhead) if you did a normal kind of consolidation.
When I mentioned such arcane things as running more than one web-app on a box, or more than one server app (with a local client) on a server, I got mutterings of "compatibility issues" and "performance". Hello? You sort out your compatibility problems by installing apps on the same box that play nicely with others, and as for performance, if you double your RAM, CPU and disk spindles (after eliminating obvious memory leaks and the like), your performance will no doubt improve and cost less than all those stupid VM and Windows licences. Gah!
Time related bugs are indeed hard. Although this may well not have been a bug in the code, more a bug in specification, if indeed, it is a licence issue. Code inspection of the licence code would have revealed a perfectly working sub-system.
Time issues are caught in New Zealand. Companies used to ensure that they had very good relationships with customers in the land of the long white cloud. Because that is the first place the bugs come to light. Gives them nearly 12 hours before it hits Europe and up to 18 hours for the US. (Note to earlier poster - today is the 13th in Australia - we get the bug 2 hours after New Zealand and well before most of the world - NOT after.)
Very very often bugs are not in the code. They are in the specifications. The majority of very well known big time bad bug examples can be traced to the specification, and thence to perfectly correct implementation. The Patriot Missile is a perfect case in point. There was no bug in the Patriot code. The specification called for a missile system that was intended to be highly mobile, and would be set up roughly once a day in a new location. The required drift spec for the clock was derived from this. During Desert Storm the missiles were set-up in fixed locations and no-one realised that this would result in the system remaining operation for longer than the time the clock was specified to remain within bounds. The fix was as simple as rebooting the system daily.
Building and managing large code systems is hard. There is a lot of snake oil out there that claims to provide magic (silver) bullets to cope. Most are a waste of time, or only useful in very constrained environments. Building an accounting system is a very different beast to an operating system. But it sounds as if VMWare need to get their release QA process sorted. This one should have been caught.
i was at a customer site today in Oz where 50% of their infrastructure was DOWN HARD. what i overheard they couldn't get through to the Virtual Support Drones and spent the majority of the entire Oz business day rebuilding literally hundreds of servers to remove the patch.
regardless of platform, imagine most of your servers down all day. and imagine the urge to not piss off to the pub at 0900 when they came in and found Nightmare on Virtual Street.
i just happened to have to hang around most of their day as we were trying to get some apps installed, and i'd be the first to say that this won't be the last that the VM guys hear of this. it smelled worse than a bad Blooper Patch Foolsday from Redmond.
For anyone that doesn't want to use VMware but wants virtualization (or consolidation more accurately), they should take a look at OpenVZ. It's free and works very well. It only runs Linux, so don't plan on using Windows with it. That's actually a feature: you save a ton of money on Windows licenses.
VMware is for the birds.
except maybe the single point of failure ... and unknown hardware contention but we'll see that one later.
I still can't believe production systems are running using this thing.
One extra layer of crap that is very handy during functional/business testing ... but heavy load, stress & volume, network & disks. I'll be watching from the sideline.
Storage virtualisation ... when you are used to precisely locate data on your spindle to max the perf, Trix let us know when it goes tits up, so you don't feel alone when the "I told you so" moment comes.
Of course they're using it - It's ESX - it's the enterprise virtualisation system with a proven track record... thousands of businesses rely on it.
When the free one came out I even moved to it at home because it was amazing how much more efficient ESXi was than VMWare Server (it's able to do far better resource management as it's a custom OS with a tiny footprint).
VMWare haven't been exactly forthcoming on this bug though. They started OK.. emailed everyone on their list and said they'd update 'every two hours'.
Somewhere between sending that and actually working on it they changed their mind.. not only did they not update every two hours they deleted their kb article referring to the bug so it's impossible to find out what the state is now. Not even microsoft attempt that kind of news management.
Programs have bugs. Linux has bugs. Firefox has bugs. VMWare has bugs.
If you sysadmins really believe that all software on your systems is 100% bug free you should be fired. Why is this 'a really big deal' for VMWare? They're embarrased, the developer responsible is embarrased, the code reviewers are embarrassed, the static analysers are embarrassed. Apart from that they really couldn't give a shit.
Shit happens. It'll happen again.
@benefits of virtualisation
yeah, good URL: it states:
RESOLUTION
To resolve this problem, use the appropriate method:
>Back to the top
(as n there is none... meh - its for Win95/Win98...)
@VMs why!!!
Hooray for someone who has their head screwed on... this AC distilled it all in one bucket: iron is dirt cheap, standardise on one OS, plan+prep, and use applic layer balance Xrs to mitigate the user load - dude, you one top man! Welcome at my place any time for a beer....
lastly... did VMware screw itself by itself, or was it change of date by an app that did it? Either way what a hilarious fuck-up... what else is waiting in the wings - system call to query the OS type and it deletes the boot partition? ... caveat-emptor for you boys+girls who want to cut corners....
[hint: never employ Ex-MS execs - MS implant funny devices in their heads b4 they leave....'resistance is futile' is replaced by 'prudence is infantile' as a disguise]
skull+crossbones coz you gotta have a pirates mentality to survive in the IT marketing world of utter bullshit... (and quoting 'parlez' when you're caught out will only attract a quick walk down the plank)
My company was looking (was being the big word) at using VMWare, had a salesman in yesterday telling me how great VMWare was and how stable etc (usual salesman bit).
Weirdly he didn't mention this bug, anyone care to draft me a response to his email today asking when we were thinking of signing up for VMWARE ?
You can only use the following word - Hell, Freezes, Over, When....
Why do you think so many companies have long adoption cycles for new operating systems and software?
Cliches exist for a reason. "Fools rush in", and so on.
When Windows 98 came out, a company I worked for decided it was finally time to upgrade all of their desktops and laptops to Win95. The company I am at now, and all of our clients, are still using XP. Why? Because you don't need the latest and greatest updates for everything.
Whether it's open or closed source software, there will be bugs.
As for the immediately previous poster's comment (unless some got squeezed in before I clicked 'Post'; I am referring to the Paul who hates salesmen), if you were to take the same policy towards every piece of software (rejecting it due to one bug), you would at best be using MS-DOS if not manual typewriters or pen and paper for everything.