Sounds familiar
Didn't the US POTS phone system go down for a similar reason years ago.
They introduced an upgrade into a switch which failed - after it had sent the upgrade to the next switch, and so on.
Chastened by its pre-Christmas mega-FAIL, Skype on Wednesday explained in detail why the titanic titsup takedown happened and how the company plans to ensure that the globe will never again go VoIP-less for an extended period of time. For those of you tuning in late, last Wednesday the online telephony service began to wobble …
As I remember it, the reason for that failure was a missing "break" keyword at the end of one branch in a C switch statement. A common programming error in C, and one not caught by the compiler, because according to the rules of C, the execution then continues in the next switch branch if there is one. This is one of the worst design flaws in C (and amazingly, aped by many related languages)!
It IS desirable for a switch() statement to have the "fall through" capability, as there is quite often the need to group several possible cases together and process them with identical or nearly identical code.
Look at this:
http://codesearch.google.com/codesearch/p?hl=de#1oUPVh-C1Wg/trunk/eval/gx/symfony/web/sf/prototype/js/controls.js&q=switch&sa=N&cd=5&ct=rc
Read some daft posts on here, but dismissing C (one of the most elegant and constantly relevent languages on the planet today) as being '...a bit old now', and stating that the use of C++ (a different language altogether) shows that limited progress has been made, just took the biscuit.
I'm not normally confrontational on these forums, but to harbour such beliefs indicates a pretty narrow field of personal expertise. I have to say that anyone even reasonably knowledgeable in computer language theory, or simply having practical expertise in many different programming languages, whether compiled (C, Delphi, VB, assembler), interpreted (BASIC, Java etc) or JIT semi-compiled (VB .Net, C#, J#), would have found these comments unbelievably crass and ill-informed.
..."simply having practical expertise in many different programming languages" would no better than to place Java in the interpreted group, instead of "JIT semi-compiled".
Every heard of Hotspot? And do you even know how Java programs are executed?
On another note - you criticize the predecessor's arguments as not relevant (which is true), yet you don't seem to provide your own - e.g. just how is the fall-though switch statement "elegant"?
@AC - Do I even know how Java is executed? Having worked on compilers and interpreters most of my working life, I think so, a little. Pure java is pseudo-compiled down to P-code, which is a form of super-tokenisation really, not JIT-precompiled. Not the same thing. And the elegancy of C code is in its many other aspects such as being able to both modify and test variables at the same time, as in (very simply):
if (y--) {
// Do this
}
Of course this can be achieved in most other languages with a little more code, but the particular beauty of C is that it was built to do work this way, keeping statements succinct.
However, to answer your valid point, default fall-through is indeed elegant and should be the default in all cases. It allows you to create cascaded tests. Understandably, not everyone will agree though.
50% of the userbase on a flawed version caused a cascade failure that rendered the network fully inoperable, so the solution is make sure 100% of the userbase is on the same release? Seriously?
I would SLOW the dessimation of releases, so that 20% of users are two releases behind, 30% are one release behind 35% are "current" and 15% are "experemental." Any release that was fatally flawed could be marked as "KOS" at the same time a new "experemental" release is declared. This way at MOST 35% of users wll be on a flawed release.
Not upgrading everyone at once is actually a good move, for the exact reason stated in the story. Their plan is not bad, it just didnt actually account for a failure on the scale that was possible. The correct solution is to DECREASE the scale of potental failures, not INCREASE it.
The best way to make a distributed arch stable is to deliberately make it more diverse. Having no more than let's say 50% of clients on the same release is the right approach here. It is not a guarantee of course, but it tends to make things more reliable. Having sufficient numbers on different platforms is also a good idea. And so on as long as a complete failure of one release can allow the rest of the network to function.
I'm sorry, but this is a pretty poor idea. Bugs tend to persist across multiple versions until they are fixed. A bug like this which doesn't crop up until a particular failure mode triggers it (delayed messages in this case) might lie dormant for years, meaning it is remarkably easy to have the bug exist in 2 or more of the versions you have in the wild. Not being able to rapidly upgrade the software you have there means you can't fix major bugs quickly because people don't upgrade quickly. If you follow this methodology you end up with half the Internet running a bugged, security disaster like IE6!
The obvious "right thing to do"TM is pretty simple. The Skype Client and Skype Server should be separate processes on every machine. The server should do very little other than talk to the P2P network and pass messages on to the client. The client can parse all the messages and do the stuff that will likely crash. Hopefully over time the server becomes very stable and is rarely updated and hence likely to have very few bugs; the client becomes the thing you keep changing as new features come on line.
Fixing and upgrading everyone at once sounds like a good idea on the surface, but is only so if you can be completely sure that you do not introduce a MORE problimatic bug during the operation. From what was said in the article they use their software to update itself, so by forceing an update to everyone at once, a newly introduced bug (or did you think that your fixes couldn't introduce new, or exacerbate existing bugs?) is focred onto the entire network with no systems making up the network that can exist to provide service to the onlineing systems as they are repaired.
From what Mr. Rabble has said, having forced EVRYONE onto 5.0.0.152 would have made the network less resilent, as everyone would have experenced the problem immidiately, rather then some time for it to cascade. If less peers had to fail over at the same time, because they where running a different release (from his report, ANY different release) the network would have only likely suffered the hosts running the affected versions falling off, and a minor slow-down for everyone else, rather then a complete cascade failure.
Yes, I recognise that you can have bugs across multiple releases, that is exactly WHY i put it all the way up at four, and. Your IE6 compairison is a complete non-sequitior, as I EXPLICITLY INCLUDED a way to force the elimination of particularly problematic releases from the network. I am not saying users decide when to upgrade (on a normal basis), but the network does. While this may leave some users vulnerable for some time, the objective is to protect the network from catostrophic failure.
Frankly, dividing the software into multple interacting programs (I'm not sure stopping at two makes sense) would probibly make transparent upgrades far more seemless, and transparent. and works well with makeing sure there are always multiple versions in the network.
In addition, if you were on windows, and having problems with skype, what would be high on your list of things to try... maybe reinstalling skype? in which case most of the affected users pull the latest version OOB (from your web site) and it doesnt matter.
simply put, there was a small flaw in the architecture there. if the cluster controller went away, the rest of the cluster looked for a new arbitrator. DEC had determined and enforced that the earlier a MAC address you had, the better qualified you were to serve as the cluster controller, since of course it would always fall back onto a classic VAX.
until.... the physical MAC address pool ran out, and they needed to start reusing hardware MACs for PC controllers.
if you've had a VAXcluster fall onto a 286 PC as cluster controller, you'd know empirically that you have to define a class of trusted systems that you always look for first.
way too early for the Sky Hype guys, although they could have read about it.
Actually, that could be a real pain if you had a mixed bag of a cluster. Our first (V4) cluster included a 11/785, 11/750, an 8600 and an 8700. You didn't want the 750 to own any of the master capabilities (including mastering any distributed locks) if you wanted any decent performance!
Happy (and simpler) days!
@swschrad
The underlying design flaw there was to have the network address be the MAC address, and to decide to override the hardware MAC address with the DECnet network/MAC address.
Most dumbass design I've ever seen. Fell off a chair when I learnt of it.
Made my BICC MPS (Multi-Protocol Support) DECnet driver hard to get to play well with the ISO/OSI driver. And had to jump through some hoops I'm quite ashamed of when the card driver for the 16-bit card didn't allow the MAC address to be changed - like scanning through the driver code looking for the MAC address and then changing it there.
Only in MessyDOS.......
...But why?
Now ask yourself why would a specific version, especially a new version, cause such a catastrophe?
The questions we should be asking are (a) exactly what was the bug, and (b) what's this latest version doing that's so radical that no previous version (over all those years) has ever caused?
Seems to me that Skype needs to become an open standard. Final questions: who has vested interests in keeping Skype closed, and why would they want to keep it closed?
Encrypted end-to-end it might be, but I still don't trust it.
There are any number of companies providing SIP to PSTN gateway services at prices similar to skype. They don't care is you use open or closed software clients, or a physical SIP handset. Since many (most?) home ADSL routers now include transparent SIP proxying the old problems of getting it through the firewall should be gone.
You could even be your own gateway provider with a box like the Linksys SPA3000 plugged into your landline. Then there are things like Asterisk and FreeSwitch, but they're probably beyond most home users.
At least here in Germany people normally have free fixed-line-calling as part of the telephone/DSL contract.
The OS community could simply allow other people to use their land lines to call fixed-line numbers when they don't need it themselves. IDSN cards can be fully programmed.
Actually it is. The Linux version of Skype has been stalled at... let me check... 2.1.0.81 BETA for quite a while now. Because it hasn't been upgraded for ages, it just couldn't pull such a stunt as the Windows version. Programmers need to "improve" and "fix" the software for shit to hit the fan. Although I'm not happy that the Linux version is 3 major versions behind the Windows version, it seems it's not entirely a bad thing :)
Would some form of back off on clients trying to reconnect have helped with this? I'm thinking that this way, the supernodes would have had time to reestablish themselves without being slammed with a huge amont of traffic?
Any idea where they found the bandwidth/processing for their mega-super-duper-nodes to fix the system? I guess it'd be one of those things that processing on demand would be pretty handy for?
--Any idea where they found the bandwidth/processing for their mega-super-duper-nodes to fix the system?--
Presumably one of those virtual instance resellers, Amazon AWS, Rackspace, Azure et al. At least that's one advantage of virtual machines is you can copy/start almost ad infinitum until either the cloud can hold no more or your bank balance holdeth no more either.
From the above..." - globe will never again go voipless" is a crazy thing to say, after all Skype is only one of literally thousands of voip networks. However Skype already has the severe limitation of being proprietary and thus does not interface with anything else, hardware or software. This is exacerbated by the P2P system it relies upon which as described above lends it self readily to an "avalanche" type of failure. The VoIP industry's open standard SIP is far more widespread and far more flexible and so the majority of global VoIP users were not affected by the Skype catastrophy.
You clearly know little about the SME sector. Very many use Skype as an excellent and low cost communications system, let alone all the personal users. As I write this there are nearly 18m online.
Whilst big business has its big IT budgets and largely wasteful IT deparments (and yes I speak from experience) the small guys make their hard-earned money work hard. Some will have learned though that putting their trust in Skype alone without a fallback, like any thing else in IT, was a stupid thing to do.
But most of us missed Skype because it is to us, an excellent comms tool despite the fact that since v3.8 the GUI has gone, well, gooey, like all other apps it seems.
Maybe ... since they intoduced Mega SuperNodes to help alleviate the problem, they should run some of their own (or use some VMs in the EC2/Azure/Whatever cloud) so they can help stave this off in the future?
"We found the fix. We added more computers to handle the increased load."
"Great. Now that we have a band-aid on it, what are we doing?"
"We're going to take all those bloody extra computers offline!"
Mine's the one with the mega super node in the cloud.
Just a thought that popped into my head, but you have to where the heck they pulled 1000s of mega supernode servers from on basically zero notice. Seriously, where and how did their engineers activate so many in such a short time? Provisioning thousands of servers not any network isn't a trivial task.
Then I wondered, why doesn't Skype, with it's wonderful P2P model that generates revenue on Internet capacity paid for by someone else, have a server farm of really big supernodes to handle this kind of thing? And if they do, and this is how they activated 1000's of mega supernodes so quickly, why are they so keen to withdraw them from service as soon as possible?
Then it struck me. Those clever chaps had activated their own botnet of Skype clients and promoted thousands of ordinary customer peers to be mega supernodes. I can't think of any other way they could so quickly provision so many servers on a distributed basis in such a short time. It's no wonder they want to retire as many as possible as quickly as possible, I might be somewhat miffed if my PC and Internet bandwidth were suddenly being eaten alive to serve Skype.
Two suggestions above are strikingly logical. 1) stagger the software releases so that you don't have a high predominance of a single version of your peer server code, just in case, and 2) alter the back-off code so that when a new peer server attempts to join and finds the network is busy it doesn't just hammer the servers into submission, nor do all peers back off for the same time.
Seriously though, where can I get 1000's of supernodes at zero notice?
Either, they provisioned a bunch of virtual devices from an online provider, or your suggestion of promoting "ordinary" users was what they did. To be honest though, as I understand it *any* Skype user can be promoted to supernode - although there is a way to disable it if you want... It's all part of the Skype experience that you sign up for though.
Though IIRC even being a supernode isn't a massive drain on your network.
Similar mechanism to the Great Northeastern Blackouts!
Here we have another unpredictable 'complexity' failure in a gargantuan system. Like the Northeastern Blackouts, they strike when least expected, never get fully understood and cannot be properly analyzed with the tools available.
State analysis methods, if possible, would require a computer the size of 'Deep Thought' and take just as long as it did to calculate '42'. The fact is many of our large engineering systems are vulnerable to 'complexity/scaling' failures and we shouldn't quite as surprised when they happen.
Let's fact it, we've all been aware of them for over 40 years.
It's unlikely they 'promoted' lots of 'ordinary customer peers', as it isn't something you choose when you sign up to Skype whether you become a supernode or not, it's done based on the type and quality of your connectivity, so if you fire up Skype on a very good internet connection (no NAT, lots of bandwidth etc) you'll almost certainly become a supernode and start handling directory info and routing other people's phone calls...
I suspect they got these thousands of 'mega supernodes' from somewhere like Amazon EC2, or other cloud providers - with something like that once you've set up one image firing up thousands of copies of it is trivial, it's not like they had to set each one up manually.
But still,
if they used amazon, or a other cloud... or just 1500 servers with 4 virtual machines,
why did it took such a long time before everybody was online again?
We just talk about... what, 20million users. That's less then 3333 req/sec everybody should have beenm online in no time, not +12 (or more??) hours!!!
As if you were you'd realise that fixes don't happen instantly. First they had to identify the problem, then they had to create a solution (possibly going down a few blind alleys before hitting on a working solution), then they had to come up with a method to implement that solution on a scale large enough to alleviate the problem, they then had to actually go and implement that method, then once they had found it to work they had to monitor the situation and tweak their method as necessary.
If you're not impressed that they did that in the time that they managed to do it in (can't be bothered to go and check the article again, you say 12+ hours, if that's correct then I think that's a miracle on par with turning water into wine, but then I live in the *real* world) then you are clearly not, and have never been, an engineer.
I thought the client detected the amount of bandwidth available to it, and became a supernode automatically over a certain threshold. This is the reason networks in the UK like SuperJANET don't allow Skype. There's a report on their site somewhere, I remember reading last year when working for a LA that bought an overflow egress from them.
> Routing traffic through a "supernode" is a technical requirement in a inter-nework
> that can contain 1-to-many NATed intra-networks.
No it isn't.
Routing the session initiation through a controller is a requirement - that's how the clients find each other. Once that is done, the session can be handed off to the clients to handle.
That's how SIP works, if you want it to.
Vic.
And, given the example described, on what socket do the clients talk to each other?
Neither host can establish a socket to the other, because both can ONLY talk to public IPs. For new incoming sockets to the public IP it is unknown what internal IP they go to.
dealing with this is the reason for TURN. To understand the problem, you can use the wackipedia article as a decent jumping off point:
http://en.wikipedia.org/wiki/Traversal_Using_Relay_NAT
Even using TURN, an external TURN relay is still required, and therefor I stand by my statement (it is being used in place of the "supernode," but the point remains that someone is still in the middle relaying your messages, for a technical reason, not (just) to spy on you).
> And, given the example described, on what socket do the clients talk to each other?
That depends on the configuration of the clients. If there is no pre-existing config, you'd use a SIP server to set up the session. Just like I said in my post...
> Neither host can establish a socket to the other
Yes they can.
> because both can ONLY talk to public IPs.
You've not read up on STUN, then?
> Even using TURN, an external TURN relay is still required
So don't use TURN. Use something that does what you want to do, rather than something that doesn't.
The method I described is one way of using SIP. Telling me it can't work is rather silly - I have working phones to demonstrate the efficacy of the procedure (even if I rarely use that exact method - but that's for other reasons). It's the old gag of "those who claim something is impossible shouldn't get in the way of those of us doing it"...
Vic.
How would quality of service be impacted if the servers got overloaded?
How would quality of service be impacted if some of the clients contained buggy software?
How would quality of service be impacted if significant proportion of clients restarted?
How would quality of service be impacted when supernodes shut down?
After all, Skype has been about since, oh, May 2003 if memory serves me right.
Back then I recommended a certain large telco bought the technology, and add IPt-PSTN gateway, central directory servers, and so forth. And proper destruction testing - something teleco's *do* do right insofar as voice is concerned. The SVP Global Product Management blocked the suggestion. 18months later I had an email from the CEO regretting that decision :-)