back to article Skype's mega-FAIL: exec cops to cause

Chastened by its pre-Christmas mega-FAIL, Skype on Wednesday explained in detail why the titanic titsup takedown happened and how the company plans to ensure that the globe will never again go VoIP-less for an extended period of time. For those of you tuning in late, last Wednesday the online telephony service began to wobble …


This topic is closed for new posts.
  1. Yet Another Anonymous coward Silver badge

    Sounds familiar

    Didn't the US POTS phone system go down for a similar reason years ago.

    They introduced an upgrade into a switch which failed - after it had sent the upgrade to the next switch, and so on.

    1. Chad H.


      Kinda reminds me of the Morris worm...

    2. Disco-Legend-Zeke

      The PSTN...

      ...cascading failure was apparently caused by the omission of a single semicolon at the end of a C statement. (Western Electric ESS switches are UNIX driven.)

      1. MacroRodent

        Actually, it was a missing "break"

        As I remember it, the reason for that failure was a missing "break" keyword at the end of one branch in a C switch statement. A common programming error in C, and one not caught by the compiler, because according to the rules of C, the execution then continues in the next switch branch if there is one. This is one of the worst design flaws in C (and amazingly, aped by many related languages)!

        1. Anonymous Coward
          Anonymous Coward

          One man's flaw...

          It's not a flaw at all - it's as designed and works very elegantly when used properly as a cascading switch.

        2. Graham Dawson Silver badge

          Design flaw?

          Sure it's not a feature? It's very handy if you have a couple of operations that are very similar, but where one requires a little extra activity beforehand. Very, very handy. Almost like relay logic.

        3. Anonymous Coward

          @MacroRodent: Nope, Please Go Back To School

          It IS desirable for a switch() statement to have the "fall through" capability, as there is quite often the need to group several possible cases together and process them with identical or nearly identical code.

          Look at this:

          1. Anonymous Coward
            Anonymous Coward

            It is a flaw

            The "fall through" feature should be a option, not the default. Lousy design, sure, but C is pretty old now and some progress has been made in language design theory. Although the continuing use of C++ suggests that such progress has had limited impact.

            1. Blitterbug

              C 'a bit old now'? Good grief...

              Read some daft posts on here, but dismissing C (one of the most elegant and constantly relevent languages on the planet today) as being '...a bit old now', and stating that the use of C++ (a different language altogether) shows that limited progress has been made, just took the biscuit.

              I'm not normally confrontational on these forums, but to harbour such beliefs indicates a pretty narrow field of personal expertise. I have to say that anyone even reasonably knowledgeable in computer language theory, or simply having practical expertise in many different programming languages, whether compiled (C, Delphi, VB, assembler), interpreted (BASIC, Java etc) or JIT semi-compiled (VB .Net, C#, J#), would have found these comments unbelievably crass and ill-informed.

              1. Anonymous Coward

                Someone, as you say...

                ..."simply having practical expertise in many different programming languages" would no better than to place Java in the interpreted group, instead of "JIT semi-compiled".

                Every heard of Hotspot? And do you even know how Java programs are executed?

                On another note - you criticize the predecessor's arguments as not relevant (which is true), yet you don't seem to provide your own - e.g. just how is the fall-though switch statement "elegant"?

                1. Blitterbug

                  Java 'semi-compiled'?

                  @AC - Do I even know how Java is executed? Having worked on compilers and interpreters most of my working life, I think so, a little. Pure java is pseudo-compiled down to P-code, which is a form of super-tokenisation really, not JIT-precompiled. Not the same thing. And the elegancy of C code is in its many other aspects such as being able to both modify and test variables at the same time, as in (very simply):

                  if (y--) {

                  // Do this


                  Of course this can be achieved in most other languages with a little more code, but the particular beauty of C is that it was built to do work this way, keeping statements succinct.

                  However, to answer your valid point, default fall-through is indeed elegant and should be the default in all cases. It allows you to create cascaded tests. Understandably, not everyone will agree though.

            2. Anonymous Coward
              Anonymous Coward

              @ It is a flaw

              No it's not.

              Just because you don't understand it doesn't mean it's wrong.

              This is well documented and as designed.

              I guess kids nowadays need stabilisers on their coding too :)

  2. Mage Silver badge

    It's been going down hill in quality.. Seems to be last decent version

  3. slooth
    Thumb Down

    says it all

    Skype for Windows version

    says it all

  4. Oninoshiko

    yer doin' it rong...

    50% of the userbase on a flawed version caused a cascade failure that rendered the network fully inoperable, so the solution is make sure 100% of the userbase is on the same release? Seriously?

    I would SLOW the dessimation of releases, so that 20% of users are two releases behind, 30% are one release behind 35% are "current" and 15% are "experemental." Any release that was fatally flawed could be marked as "KOS" at the same time a new "experemental" release is declared. This way at MOST 35% of users wll be on a flawed release.

    Not upgrading everyone at once is actually a good move, for the exact reason stated in the story. Their plan is not bad, it just didnt actually account for a failure on the scale that was possible. The correct solution is to DECREASE the scale of potental failures, not INCREASE it.

    1. Anton Ivanov


      The best way to make a distributed arch stable is to deliberately make it more diverse. Having no more than let's say 50% of clients on the same release is the right approach here. It is not a guarantee of course, but it tends to make things more reliable. Having sufficient numbers on different platforms is also a good idea. And so on as long as a complete failure of one release can allow the rest of the network to function.

    2. Anonymous Coward


      I'm sorry, but this is a pretty poor idea. Bugs tend to persist across multiple versions until they are fixed. A bug like this which doesn't crop up until a particular failure mode triggers it (delayed messages in this case) might lie dormant for years, meaning it is remarkably easy to have the bug exist in 2 or more of the versions you have in the wild. Not being able to rapidly upgrade the software you have there means you can't fix major bugs quickly because people don't upgrade quickly. If you follow this methodology you end up with half the Internet running a bugged, security disaster like IE6!

      The obvious "right thing to do"TM is pretty simple. The Skype Client and Skype Server should be separate processes on every machine. The server should do very little other than talk to the P2P network and pass messages on to the client. The client can parse all the messages and do the stuff that will likely crash. Hopefully over time the server becomes very stable and is rarely updated and hence likely to have very few bugs; the client becomes the thing you keep changing as new features come on line.

      1. Oninoshiko

        not as poor as running everyone on a release that has not had adiquate testing.

        Fixing and upgrading everyone at once sounds like a good idea on the surface, but is only so if you can be completely sure that you do not introduce a MORE problimatic bug during the operation. From what was said in the article they use their software to update itself, so by forceing an update to everyone at once, a newly introduced bug (or did you think that your fixes couldn't introduce new, or exacerbate existing bugs?) is focred onto the entire network with no systems making up the network that can exist to provide service to the onlineing systems as they are repaired.

        From what Mr. Rabble has said, having forced EVRYONE onto would have made the network less resilent, as everyone would have experenced the problem immidiately, rather then some time for it to cascade. If less peers had to fail over at the same time, because they where running a different release (from his report, ANY different release) the network would have only likely suffered the hosts running the affected versions falling off, and a minor slow-down for everyone else, rather then a complete cascade failure.

        Yes, I recognise that you can have bugs across multiple releases, that is exactly WHY i put it all the way up at four, and. Your IE6 compairison is a complete non-sequitior, as I EXPLICITLY INCLUDED a way to force the elimination of particularly problematic releases from the network. I am not saying users decide when to upgrade (on a normal basis), but the network does. While this may leave some users vulnerable for some time, the objective is to protect the network from catostrophic failure.

        Frankly, dividing the software into multple interacting programs (I'm not sure stopping at two makes sense) would probibly make transparent upgrades far more seemless, and transparent. and works well with makeing sure there are always multiple versions in the network.

        In addition, if you were on windows, and having problems with skype, what would be high on your list of things to try... maybe reinstalling skype? in which case most of the affected users pull the latest version OOB (from your web site) and it doesnt matter.

  5. swschrad

    also familiar to us ancients who had VAXclusters

    simply put, there was a small flaw in the architecture there. if the cluster controller went away, the rest of the cluster looked for a new arbitrator. DEC had determined and enforced that the earlier a MAC address you had, the better qualified you were to serve as the cluster controller, since of course it would always fall back onto a classic VAX.

    until.... the physical MAC address pool ran out, and they needed to start reusing hardware MACs for PC controllers.

    if you've had a VAXcluster fall onto a 286 PC as cluster controller, you'd know empirically that you have to define a class of trusted systems that you always look for first.

    way too early for the Sky Hype guys, although they could have read about it.

    1. maartent

      cluster controller ?

      What is a cluster controller? I suppose you mean the node that holds the locking database?

      That was governed bij the system parameter LOCKDIRWT but that was on VMS 5 and higher.

      Anyway, I never hit on that flaw in 16 years, must have been solved a long time ago.

      1. Anonymous Coward

        Classic VAX?

        Actually, that could be a real pain if you had a mixed bag of a cluster. Our first (V4) cluster included a 11/785, 11/750, an 8600 and an 8700. You didn't want the 750 to own any of the master capabilities (including mastering any distributed locks) if you wanted any decent performance!

        Happy (and simpler) days!

    2. Mike Pellatt

      DECnet design


      The underlying design flaw there was to have the network address be the MAC address, and to decide to override the hardware MAC address with the DECnet network/MAC address.

      Most dumbass design I've ever seen. Fell off a chair when I learnt of it.

      Made my BICC MPS (Multi-Protocol Support) DECnet driver hard to get to play well with the ISO/OSI driver. And had to jump through some hoops I'm quite ashamed of when the card driver for the 16-bit card didn't allow the MAC address to be changed - like scanning through the driver code looking for the MAC address and then changing it there.

      Only in MessyDOS.......

  6. Anonymous Coward
    Anonymous Coward


    Bug in Windows version 5.xxxxxx, simple solution would be to ban Windows me thinks..... Linux for the Win

    1. midcapwarrior

      of course since few use linux

      You'd never have an issue. Problem solved.

      1. Anonymous Coward


        So who provided the 'thousands' of 'mega-supernodes'?

    2. Graham Wilson

      @A. Coward. Can't blame Bill and cronies this time, it's Skype's problem....

      ...But why?

      Now ask yourself why would a specific version, especially a new version, cause such a catastrophe?

      The questions we should be asking are (a) exactly what was the bug, and (b) what's this latest version doing that's so radical that no previous version (over all those years) has ever caused?

      Seems to me that Skype needs to become an open standard. Final questions: who has vested interests in keeping Skype closed, and why would they want to keep it closed?

      Encrypted end-to-end it might be, but I still don't trust it.


        open source VoIP

        but how would you make a call to a PSTN? The client is, or will be, open sourced -- but that's just the GUI. At the end of the day, someone has to pay for those telephone calls, so, no, skype can't just open source the whole thing and remain a business.

        1. Anonymous Coward
          Anonymous Coward

          Use a SIP service provider?

          There are any number of companies providing SIP to PSTN gateway services at prices similar to skype. They don't care is you use open or closed software clients, or a physical SIP handset. Since many (most?) home ADSL routers now include transparent SIP proxying the old problems of getting it through the firewall should be gone.

          You could even be your own gateway provider with a box like the Linksys SPA3000 plugged into your landline. Then there are things like Asterisk and FreeSwitch, but they're probably beyond most home users.

        2. Anonymous Coward

          Suggestion Regarding Open Source Voip

          At least here in Germany people normally have free fixed-line-calling as part of the telephone/DSL contract.

          The OS community could simply allow other people to use their land lines to call fixed-line numbers when they don't need it themselves. IDSN cards can be fully programmed.

    3. yosemite
      Thumb Down

      Oh for the love of God!

      This ISN'T about Windows you useless troll.

      1. Uplink
        Gates Halo


        Actually it is. The Linux version of Skype has been stalled at... let me check... BETA for quite a while now. Because it hasn't been upgraded for ages, it just couldn't pull such a stunt as the Windows version. Programmers need to "improve" and "fix" the software for shit to hit the fan. Although I'm not happy that the Linux version is 3 major versions behind the Windows version, it seems it's not entirely a bad thing :)

    4. Anonymous Coward


      If it was a linux-only client, it could have been down for a week before anyone even noticed!

  7. oopsie

    exponential back off?

    Would some form of back off on clients trying to reconnect have helped with this? I'm thinking that this way, the supernodes would have had time to reestablish themselves without being slammed with a huge amont of traffic?

    Any idea where they found the bandwidth/processing for their mega-super-duper-nodes to fix the system? I guess it'd be one of those things that processing on demand would be pretty handy for?

    1. Tom Chiverton 1


      "I'm sorry Dave, I'm afraid you can't login right now". Yeah, that'd be 'better'.

    2. ElNumbre

      Copy; Start; Goto 10.

      --Any idea where they found the bandwidth/processing for their mega-super-duper-nodes to fix the system?--

      Presumably one of those virtual instance resellers, Amazon AWS, Rackspace, Azure et al. At least that's one advantage of virtual machines is you can copy/start almost ad infinitum until either the cloud can hold no more or your bank balance holdeth no more either.

  8. Will Godfrey Silver badge


    Don't you just love them.

  9. Anonymous Coward
    Anonymous Coward

    skype failure - who really cares?

    From the above..." - globe will never again go voipless" is a crazy thing to say, after all Skype is only one of literally thousands of voip networks. However Skype already has the severe limitation of being proprietary and thus does not interface with anything else, hardware or software. This is exacerbated by the P2P system it relies upon which as described above lends it self readily to an "avalanche" type of failure. The VoIP industry's open standard SIP is far more widespread and far more flexible and so the majority of global VoIP users were not affected by the Skype catastrophy.

    1. Anonymous Coward


      You clearly know little about the SME sector. Very many use Skype as an excellent and low cost communications system, let alone all the personal users. As I write this there are nearly 18m online.

      Whilst big business has its big IT budgets and largely wasteful IT deparments (and yes I speak from experience) the small guys make their hard-earned money work hard. Some will have learned though that putting their trust in Skype alone without a fallback, like any thing else in IT, was a stupid thing to do.

      But most of us missed Skype because it is to us, an excellent comms tool despite the fact that since v3.8 the GUI has gone, well, gooey, like all other apps it seems.

  10. Aaron Guilmette

    Take Mega Nodes offline?

    Maybe ... since they intoduced Mega SuperNodes to help alleviate the problem, they should run some of their own (or use some VMs in the EC2/Azure/Whatever cloud) so they can help stave this off in the future?

    "We found the fix. We added more computers to handle the increased load."

    "Great. Now that we have a band-aid on it, what are we doing?"

    "We're going to take all those bloody extra computers offline!"

    Mine's the one with the mega super node in the cloud.

    1. Pablo

      I know but...

      Presumably running thousands of mega-super-nodes costs alot of money.

  11. Highlander

    Where did they get 1000s of Mega supernodes at zero notice?

    Just a thought that popped into my head, but you have to where the heck they pulled 1000s of mega supernode servers from on basically zero notice. Seriously, where and how did their engineers activate so many in such a short time? Provisioning thousands of servers not any network isn't a trivial task.

    Then I wondered, why doesn't Skype, with it's wonderful P2P model that generates revenue on Internet capacity paid for by someone else, have a server farm of really big supernodes to handle this kind of thing? And if they do, and this is how they activated 1000's of mega supernodes so quickly, why are they so keen to withdraw them from service as soon as possible?

    Then it struck me. Those clever chaps had activated their own botnet of Skype clients and promoted thousands of ordinary customer peers to be mega supernodes. I can't think of any other way they could so quickly provision so many servers on a distributed basis in such a short time. It's no wonder they want to retire as many as possible as quickly as possible, I might be somewhat miffed if my PC and Internet bandwidth were suddenly being eaten alive to serve Skype.

    Two suggestions above are strikingly logical. 1) stagger the software releases so that you don't have a high predominance of a single version of your peer server code, just in case, and 2) alter the back-off code so that when a new peer server attempts to join and finds the network is busy it doesn't just hammer the servers into submission, nor do all peers back off for the same time.

    Seriously though, where can I get 1000's of supernodes at zero notice?

    1. Annihilator

      Think virtual

      Either, they provisioned a bunch of virtual devices from an online provider, or your suggestion of promoting "ordinary" users was what they did. To be honest though, as I understand it *any* Skype user can be promoted to supernode - although there is a way to disable it if you want... It's all part of the Skype experience that you sign up for though.

      Though IIRC even being a supernode isn't a massive drain on your network.

      1. Michael C


        Per Skypes comments other places online, super-nodes don't work unless you have skype on the external IP though. Super-nodes won't NAT by design.

    2. Phil Endecott

      Re: Where did they get 1000s of Mega supernodes at zero notice?

      > I can't think of any other way they could so quickly provision so

      > many servers on a distributed basis in such a short time.

      Amazon EC2.

    3. Matt Piechota

      Cloud 2

      "Then it struck me. Those clever chaps had activated their own botnet of Skype clients and promoted thousands of ordinary customer peers to be mega supernodes"

      Just go to Amazon (or another cloud provider) and ask for them? Or a non-cloud traditional hosting company.

  12. Graham Wilson

    Similar mechanism to the Great Northeastern Blackouts!

    Similar mechanism to the Great Northeastern Blackouts!

    Here we have another unpredictable 'complexity' failure in a gargantuan system. Like the Northeastern Blackouts, they strike when least expected, never get fully understood and cannot be properly analyzed with the tools available.

    State analysis methods, if possible, would require a computer the size of 'Deep Thought' and take just as long as it did to calculate '42'. The fact is many of our large engineering systems are vulnerable to 'complexity/scaling' failures and we shouldn't quite as surprised when they happen.

    Let's fact it, we've all been aware of them for over 40 years.

    1. Anonymous Coward


      I read the computer name as "Deep Throat" and was wondering: who'd name a computer that and why? It took a few passes to read it correctly.

      1. Anonymous Coward
        Anonymous Coward

        Only if your an IT illiterate who never read the Adams books.

        The rest of us see the 42 and don't have to read the name.

  13. Charlie Clark Silver badge

    You're all missing the point

    Rik actually wrote "normalcy"!

    For this he should be locked in El Reg's darkest cupboard for a week.

    1. Steve 114


      Better placed in the irony cage.

  14. Anonymous Coward

    The title is required

    So, if you are running Skype version for Windows you were part of the problem, if you were running a different version of Skype you were part of the solution?

    Nice bit of pre-release testing, BTW.

  15. zb


    Versions on Linux and on a Mac gave me no problems. Maybe I should stop pestering Skype for a newer version

  16. Doug Glass

    The One Thing You Can Always Depend On ...

    ... is if exists in the natural world, it will break sooner or later. Old but still true, "never put all your eggs in one basket". Oh frelling well.... shit happens.

  17. Alex Brett

    Re: Where did they get 1000s of Mega supernodes at zero notice?

    It's unlikely they 'promoted' lots of 'ordinary customer peers', as it isn't something you choose when you sign up to Skype whether you become a supernode or not, it's done based on the type and quality of your connectivity, so if you fire up Skype on a very good internet connection (no NAT, lots of bandwidth etc) you'll almost certainly become a supernode and start handling directory info and routing other people's phone calls...

    I suspect they got these thousands of 'mega supernodes' from somewhere like Amazon EC2, or other cloud providers - with something like that once you've set up one image firing up thousands of copies of it is trivial, it's not like they had to set each one up manually.

  18. ratfox
    Thumb Up

    Good for them

    They came out with the explanation pretty fast and without finger-pointing.

    If only all companies acted that way...

  19. json

    interesting to know..

    that skype rides free on unsuspecting skype clients set as 'super nodes' while it gives 'free' service to most of us... also why did a significant number of supernodes reset all of a sudden and at the same time? that's rather strange.

    1. Anonymous Coward
      Anonymous Coward

      Try reading the article...

      as it says in the title! Jeeze!

    2. Anonymous Coward
      Anonymous Coward

      Re: interesting to know..

      If you want to know why the supernodes reset then try reading this register article:

  20. rvt

    why so long

    But still,

    if they used amazon, or a other cloud... or just 1500 servers with 4 virtual machines,

    why did it took such a long time before everybody was online again?

    We just talk about... what, 20million users. That's less then 3333 req/sec everybody should have beenm online in no time, not +12 (or more??) hours!!!

    1. Alex Rose

      You're clearly not an engineer

      As if you were you'd realise that fixes don't happen instantly. First they had to identify the problem, then they had to create a solution (possibly going down a few blind alleys before hitting on a working solution), then they had to come up with a method to implement that solution on a scale large enough to alleviate the problem, they then had to actually go and implement that method, then once they had found it to work they had to monitor the situation and tweak their method as necessary.

      If you're not impressed that they did that in the time that they managed to do it in (can't be bothered to go and check the article again, you say 12+ hours, if that's correct then I think that's a miracle on par with turning water into wine, but then I live in the *real* world) then you are clearly not, and have never been, an engineer.

  21. lk1d

    Janet & Skype

    Here's the report I was on about earlier:

  22. steve hayes

    Simple Answer

    Stop using your customers as Beta testers on 'Release' versions. Get some proper testers with some proper testing schedules.

  23. lk1d

    RE: supernodes

    I thought the client detected the amount of bandwidth available to it, and became a supernode automatically over a certain threshold. This is the reason networks in the UK like SuperJANET don't allow Skype. There's a report on their site somewhere, I remember reading last year when working for a LA that bought an overflow egress from them.

  24. Anonymous Coward

    Very amusing!

    So please correct me if I'm wrong, this sounds like an a inadvertent DDOS by their own software on their own servers?

    1. David Dawson

      Yes, you're wrong

      Its an inadvertent DDOS by their own software on their own customers...

  25. doperative
    Big Brother

    mega-supernodes and P2P networks

    It defies logic to route p2p traffic through nodes, unless you wanted a method of monitoring the traffic. All VoIP needs is a method of identifying clients a kind of super DNS service, something that would be trivially easy to impliment.

    1. Oninoshiko


      so how, pray tell, under your system do two users on different networks with the IP talk to each other?

      Routing traffic through a "supernode" is a technical requirement in a inter-nework that can contain 1-to-many NATed intra-networks.

      1. Vic

        Reply to post: Very Wrong

        > Routing traffic through a "supernode" is a technical requirement in a inter-nework

        > that can contain 1-to-many NATed intra-networks.

        No it isn't.

        Routing the session initiation through a controller is a requirement - that's how the clients find each other. Once that is done, the session can be handed off to the clients to handle.

        That's how SIP works, if you want it to.


        1. Oninoshiko
          Thumb Down

          Re: Vic

          And, given the example described, on what socket do the clients talk to each other?

          Neither host can establish a socket to the other, because both can ONLY talk to public IPs. For new incoming sockets to the public IP it is unknown what internal IP they go to.

          dealing with this is the reason for TURN. To understand the problem, you can use the wackipedia article as a decent jumping off point:

          Even using TURN, an external TURN relay is still required, and therefor I stand by my statement (it is being used in place of the "supernode," but the point remains that someone is still in the middle relaying your messages, for a technical reason, not (just) to spy on you).

          1. Vic


            > And, given the example described, on what socket do the clients talk to each other?

            That depends on the configuration of the clients. If there is no pre-existing config, you'd use a SIP server to set up the session. Just like I said in my post...

            > Neither host can establish a socket to the other

            Yes they can.

            > because both can ONLY talk to public IPs.

            You've not read up on STUN, then?

            > Even using TURN, an external TURN relay is still required

            So don't use TURN. Use something that does what you want to do, rather than something that doesn't.

            The method I described is one way of using SIP. Telling me it can't work is rather silly - I have working phones to demonstrate the efficacy of the procedure (even if I rarely use that exact method - but that's for other reasons). It's the old gag of "those who claim something is impossible shouldn't get in the way of those of us doing it"...


  26. doperative

    Questions I would have asked before ..

    How would quality of service be impacted if the servers got overloaded?

    How would quality of service be impacted if some of the clients contained buggy software?

    How would quality of service be impacted if significant proportion of clients restarted?

    How would quality of service be impacted when supernodes shut down?

    1. Alan Lewis 1

      And probably have been...

      After all, Skype has been about since, oh, May 2003 if memory serves me right.

      Back then I recommended a certain large telco bought the technology, and add IPt-PSTN gateway, central directory servers, and so forth. And proper destruction testing - something teleco's *do* do right insofar as voice is concerned. The SVP Global Product Management blocked the suggestion. 18months later I had an email from the CEO regretting that decision :-)

This topic is closed for new posts.

Other stories you might like