back to article WAN router IP address change blamed for global Microsoft 365 outage

The global outage of Microsoft 365 services that last week prevented some users from accessing resources for more than half a working day was down to a packet bottleneck caused by a router IP address change. Microsoft's wide area network toppled a bunch of services from 07:05 UTC on January 25 and although some regions and …

  1. NoneSuch Silver badge
    Joke

    Fat Fingers Foul Frantic Fellows.

    1. big_D Silver badge

      This sounds like a variation on the usual BGP fat fingering.

      1. Jellied Eel Silver badge

        This sounds like a variation on the usual BGP fat fingering.

        Not really. Well, not unless you're running BGP as an IGP (Interior Gatway Protocol) and not an EGP (Exterior Gateway Protocol). That's one of those can you/should you questions where the answer is generally 'NO!' because BGP is blissfully unaware of the network topology. So..

        "As part of a planned change to update the IP address on a WAN router, a command given to the router caused it to send messages to all other routers in the WAN,"

        Well, yes. That's what routers do. IP address changes, topology changes, routing tables are recalculated and updated, and traffic resumes. That's more computationally expensive than just forwarding packets, so gives the routers more work to do. This has been known since before 1989, when OSPF became an IETF standard. Or long before then given networks existed prior to the Internet.

        MS seems to have missed that memo, and the importance of managing your routing tables, segmenting your networks to reduce table complexity and reconvergence time, and optimising your time-outs. If not, this situation occurs. Especially if the traffic drops for long enough that other protocols (like BGP*..) timeout, then start generating more traffic or attempts to re-route and you get your good'ol fashioned train wreck.

        So kinda suprised this hasn't happened before, and generally indicates that Microsoft's network is getting overloaded. So with link-state protocols like OSPF, losing a connection can result in routing table updates. The 'fix' generally requires re-partioning the network to re-segment it, and reduce the time to reconverge. Or throw hardware at the problem. Which again is an old one, especially from the good'ol days when this was done in CPU, and 'core' routers from the likes of Juniper & Cisco were basically commodity PCs hooked up to a backplane via plain'ol PCI or Ethernet. But the fix is generally a big job.

        *This is FUN! Which gives non-network engineers time to play a bit of Dwarf Fortress, now available on Steam. Plain'ol congestion can also create the same problem, eg if there's too much traffic across a BGP session, keepalives may drop, sessions drop, traffic drops, congestion vanishes into the bit bucket. Then BGP attempts to re-establish the connection, congestion re-occurs, BGP drops again and you go make a coffee until the flap-dampening timeouts have expired and you can try again. One of the reasons why routing protocols and updates should be prioritised, but gets politically complicated for Internet peering sessions because Net Neutrality. Luckily most sane ISP neteng's ignore the politics and JFDI because we know we'll get blamed.

        1. Anonymous Coward
          Anonymous Coward

          Like, I understood the words you were saying, but not the meaning.

          1. Jellied Eel Silver badge

            Like, I understood the words you were saying, but not the meaning.

            You have not been initiated into the dark arts of router wrangling. Sooo.. lemme try something that doesn't always work, a motoring analogy!

            You're driving along, and your satnav tells you there's congestion up ahead. Would you like to take an alternate route? Chances are your satnav already used a version of OSPF (Open Shortest Route First) to pick the route you're on based on driving preferences. Networks do much the same thing, if configured correctly. You may prefer highways, or side roads and OSPF has parameters like link cost.

            But if you say 'yes', chances are everyone else who's had the same update will try the same alternate route. That route will then likely end up more congested, especially if drivers notice and try to get back on the original course. It's a less complex problem because there's generally fewer roads, so less options to calculate, or recalculate. In a network, the more nodes and links there are, the more complicated it gets to recalculate all possible paths, and the longer it takes.

            So routing protocols and design generally tries to reduce that complexity by breaking down networks into zones, or areas in OSPF speak. That reduces the calculation/reconvergence time when routing tables update. You can also reduce it by limiting the number of routers involved that have a 'full routing table'. Some think it's a good idea that every router has a full table that knows where everything is, except when this happens, and everything has to recalculate that. Kinda like suddenly finding your car's been teleported to a random location, your satnav is down, you have no map and someone's stolen all the road signs. You still know where you need to go, but have no idea how to get there.

            Basically it's a scenario that shouldn't happen with a decent network design. Most routers don't really route anyway, because often they don't have that many interfaces to choose from. So just forward the traffic thataway--> towards a device that should have a better idea. Forwarding is cheaper & faster than routing. I guess it's also a bit like being told there's a road closure in Tokyo when you're in the UK. If you're just concerned with your local area, you don't need to know, and don't care. You only really need to know about your bit of the world to get around.

            Datacentres can get more complicated because there might be thousands, or tens of thousands of devices inside a bit barn, but most of those still don't need to route because they're usually just sitting on LANs and can switch/forward traffic instead. You can run routing instances inside VMs though, but often you just point traffic to a provider's default gateway, and hope they can figure it out.

            1. xyz Silver badge

              In layman's terms, what happened was as embarassing as... I was hoovering naked when....

          2. Norman Nescio Silver badge

            I understood it, and Jellied Eel was giving the high-level explanation. I'm certain JE could go into a lot more detail.

            In general, it's not often you need to change the IP address of a core router. If you do, and you are operating a redundant network, good practice is to disable the interface so there is no traffic transiting it before changing the IP address. Of course, if you haven't got the capacity to move the traffic before changing the address, you have other problems.

            If the network is big enough, I'd also expect it to use redundant BGP route reflectors rather than expecting the traffic-carrying routers to also handle a full-mesh of BGP updates. That way lies madness. And downtime.

            1. Jellied Eel Silver badge

              In general, it's not often you need to change the IP address of a core router. If you do, and you are operating a redundant network, good practice is to disable the interface so there is no traffic transiting it before changing the IP address. Of course, if you haven't got the capacity to move the traffic before changing the address, you have other problems.

              I think it's a problem with not doing an adequate risk assessment as part of a change control review. It's one of those things that sounds simple, but isn't. It's also something you tend to learn the hard way, either directly (debug all on a core peering router.. oops) or indirectly after swapping war stories with other netengs. This is why the social aspects of events like LINX, RIPE, NANOG etc are important. It's also something that often isn't really covered well during certification bootcamps.

              So the issue is you renumber an interface. Simple. But what happens next? That really requires taking a holistic approach and looking at the impact on the entire network, including applications. Basic problem is IP is dumb. Really dumb. There are a lot of things it won't do, or won't do well without a bunch of kludges. There are also a lot of dependencies on other network layers/protocols, and some of the functionality is often done better there than with IP.

              So often it's a good idea to monitor what's going through the interface you're about to delete. Not what should be going through that interface. That's one way to detect the static route someone added to fix a previous routing problem. But renumbering an interface is going to have all sorts of.. interesting consequences. Especially if you get netmasks wrong.

              Traffic stops, and will drop into the bit bucket. Routing tables will withdraw the route. Applications will probably keep trying to send traffic. Congestion may occur. You are on the actual console, or OOB connection, aren't you? This will also cascade. So Ethernet parts use ARP to tie MAC addresses to IP. That's gone. If it's a v6 network and NDP, it's still gone. On a router, this can be.. bad, as in "where TF did my default gateway go?". This is also where other fun kicks in. Like, say, DHCP if it's a default (or popular) route. Then there's DNS. There's always DNS. Especially if you're running stuff like load balancers that rely on DNS PTR records, cnames etc. Then, did you remember to update in-add.arpa for the IP address/host name change? And drop TTLs, add a step to flush DNS caches etc?

              So basically a lot of things that will happen if you 'just' renumber an interface. Which will confuse the heck out of connected servers/applications until everything updates, reconverges and calms down. Which assumes you're aware of all the things that can potentially go wrong, and ways to resolve those or rollback the changes. And because these kinds of changes can (probably will) result in a lot of additional traffic, rolling back to the original IP address may not help, or might trigger a fresh wave of chaos. Especially if that traffic leads to congestion, more packets being lost, more retransmissions etc.

              But basically not something to do lightly on a busy production network. Potential fixes, if this identifies capacity/congestion problems are also distinctly non-trivial, ie re-designing and re-segmenting a production network the size of MS's without extended downtime is a huge job to do properly, and safely.

              I'm no expert on the server/application side of doing this, but know enough to know it's the kind of change that has to be approached with extreme caution. Then again, it's also a learning experience when you hit 'unexpected' problems like this. It's also potentially one of the scaling problems with IP. That was kind of hit years ago with the realisation that global Internet tables were too large to fit into existing router memory. Since then, the challenge has arguably gotten worse. So now we have MPLS and virtual networks, where one core device may have hundreds, or thousands of virtual routing instances, many (or most) of which will be impacted by any changes to core interfaces. This is one of the reasons why service providers generally restrict, or at least want to know how many routes you expect to cram into a single VRF.

              1. Norman Nescio Silver badge

                We obviously work in the same industry.

                It's probably true everywhere, but one measure of how well we are doing our jobs is the lack of problems to the extent that people think we do nothing and that our job is simple.

                "the realisation that global Internet tables were too large to fit into existing router memory."

                I remember a multinational client who wanted redundant and resiliant Internet connections from their AS to ISPs in different countries. Which is a reasonable request. But they were used to having a small router connecting to their ISP via a default route...

                Even clients in the UK who wanted to use different ISPs for redundancy rapidly discovered that this networking lark was more complicated than they imagined, especially if they wanted load-balancing across the different connections. There was a lot of "But surely you can just....?"

                1. Jellied Eel Silver badge

                  Even clients in the UK who wanted to use different ISPs for redundancy rapidly discovered that this networking lark was more complicated than they imagined, especially if they wanted load-balancing across the different connections. There was a lot of "But surely you can just....?"

                  Yep, I hate that one. Just configure 2 default routes. It'll be fine (ish). It's one of those that can actually work, just not necessarily the way you hoped or expected. And not taught in J/CCIE boot camps either. Load balancing is just one of those things IP was never really intended to do. Nor were most routing protocols given they're generally designed to pick the best route, not routes. Same with resiliency/redundancy. Why pay all this money for capacity we aren't using? Well, why pay all that money for Directors and Officer's Indemnity insurance when you never claim on it? Is that because it doesn't cover negligence?

                  There are also little tricks that can be abused along the way. Got 2 fibres or wavelengths running diversely? Just use optical protection switching and it'll generally switch paths long before a router notices. Preferred box shifter solution is, of course to flog you a 2nd router fully loaded and then leave you to figure out how to configure your traffic management. Which then makes your routing more complicated, which means more stuff to go wrong. Luckily these days there are various ways to do channel bonding and group multiple circuits/wavelengths into a single virtual trunk. Usually best done at layer-2, and generally handles diversity better than an HR department. Providing latencies across the paths are close enough, which is often the case on metro circuits, and can be on WAN connections as well. Fun can also be had with passive optical splitters. Use a cheap Y-cable to clone traffic across 2 links, or single link to 2 interfaces. Just make sure one interface is shut down, cos routers and applications generally don't like dupe packets.

      2. Wzrd1 Silver badge

        Stop fat shaming! So, someone's finger are in need of going on a diet, so what?

        We didn't like that traffic anyway...

        I'll just get my coat...

  2. Anonymous Coward
    Anonymous Coward

    We said at the time that we shouldn't blame DNS but it's always DNS

    1. Zippy´s Sausage Factory

      Except when it's BGP, of course.

      1. monty75

        IAT - It's Always TLAs

    2. Sp1z

      "D"id "N"ot "S"elect-the-right-IP-address

  3. Stuart Castle Silver badge

    This is the problem with the cloud. One change stopped hundreds of people being able to access their software/systems, It was only a couple of hours, but it could have been days. it could have cost a lot of businesses their survival. While Office is unlikely to be used in life and death situations, cloud software could be used in life and death situations, and a failure could cost lives.

    1. Hans Neeson-Bumpsadese Silver badge

      cloud software could be used in life and death situations, and a failure could cost lives.

      If it's that critical and that life-and-death, then you absolutely have to make management understand the risks and allow you to build sufficient redundancy into the design.

      If there's anything good to come out of events like this MIcrosoft outage, it's to have real world events that you can use as examples for why you want to keep things off the cloud, or be given enough budget to design something that has a fallback in the event that the cloud fails.

      1. Pirate Dave Silver badge

        I'm not disagreeing, but for places that have gone in whole-hog on the Office365 train, where exactly do they fall back to if that train runs off the track? If Teams in O365 is borked, what are the options for a 1-2 day outage? Or OneDrive/Sharepoint? Or Exchange? I mean, for the vast majority of corps that are O365 slaves, if O365 shits itself at this level, then they are just SOL. And that's by design of Microsoft, since they don't want to play well with others.

        1. The Basis of everything is...

          There's always a fallback option

          If you're all-in on O365, and especially if working remote, then do a little homework and prepare yourself for the inevitable local glitch that cuts you off the interwebs. Funnily enough all the capabilities of working offline are baked into the products. e.g.:

          Teams: Keep key contacts in Outlook and/or on your phone and actually talk to them the old fashioned way.

          OneDrive / Sharepoint: Use offline folders and keep a local copy of key docs

          Exchange: Send/receive might be problematic, but you can at least keep recent mails locally on any decent mail client if you tell it

          Install the apps on your PC rather than use the web versions

          I've lost much more worktime due to muppets digging through phone cables or running over street cabinets. Most organisations should be able to come up with a plan to keep the important things going for a week or two with little cost given all the tools we have available to us now.

          Actually, that's not entirely true. The biggest killer of working time has been pointless meetings that could have been completed much quicker and amicably in a lunchtime pub. Now maybe Teams needs to come with a beer tap?

          1. Pirate Dave Silver badge

            Re: There's always a fallback option

            "the inevitable local glitch that cuts you off the interwebs."

            That's not what we're talking about. We're talking about Microsoft nuking itself off of the interwebs. There should already be redundant links into your DC to overcome the simple loss of local carrier. But if the Mighty Microsoft blows out its own core in such a way that it's out for a day or two or five, then the business world that's bought into their candy-coated dream is fucked. Sure, it can be overcome to an extent, but the truth is, for most non-huge businesses, there aren't many "hot spare" replacements for O365 itself that are plug-and-play ready in 10 minutes to keep things flowing across the business.

            And that's the real crux of the problem - MS has built this huge silo of a system and gotten a lot of businesses dependent on it, but MS still suffers a great deal of internal stupidity that keeps fucking it up. And if you're in whole-hog, it's very difficult to overcome a major outage.

            1. Ken Moorhouse Silver badge

              Re: there aren't many "hot spare" replacements for O365 itself

              I wonder if searches for "Download Libreoffice" spiked during the incident.

          2. SanitizeR
            Trollface

            Re: There's always a fallback option

            No, actually there isn't.

            Let's just use one example - one most actually have to deal with these days. Let's say your company is using O365 and all data is stored in one drive. It's not possible to keep an offline copy of your entire dataset if your company has a single repo in one drive - not to mention that an hour for some places equates to thousands of changes and new documents. Sure, your own stuff you can keep but what about the Terabytes of stuff you'd have to keep and most hardware (corporate laptops) don't go over 512Gb of internal storage and not all of it should be a one drive backup.

            Exchange - this comes down to your settings (like caching your emails) and private PST rules and backups. Also you can't email anyone using personal domains, so this will bite you.

            Office apps like Word - these are apps that should be installed locally or served up in a remote services gateway. Open office or Libre Office can be used in a pinch.

            MSTeams - I agree. Call people. However you'll want to keep a secondary backup free service on tap for conference calls only. Chat can wait or you can use a free slack instance or Discord for a day while MSFT figures their mess out.

            Good IT teams will have redundancy in place and directions delivered to the people they help out. Ideally, training when hired on will cover what to do in this type of situation.

            Going all in is fine, just be ready for things to break because this will happen again, and again, and probably again. If your employer does not plan, and has no plan for these situations, they need to hurry up and fix it. All teams in IT plan for failure so if your in-house help desk has no plans, we'll tell someone who can make planning happen. Show them this article for a perfect example.

            1. Jellied Eel Silver badge

              Re: There's always a fallback option

              Going all in is fine, just be ready for things to break because this will happen again, and again, and probably again. If your employer does not plan, and has no plan for these situations, they need to hurry up and fix it. All teams in IT plan for failure so if your in-house help desk has no plans, we'll tell someone who can make planning happen. Show them this article for a perfect example.

              Yep, and this is one of the reasons why I'm glad I'm semi-retired and don't have to deal with this as much any more. Far too many businesses still see IT as a cost centre, rather than essential to their very existence. My usual examples are finance, where network outages can cost collosal amounts of money when markets turn against them and they can't trade. Or fines given all the legal data retention requirements. The other I saw a lot was industries that rely on apps like SAP for stock and process control. Then 'save money' by hooking sites up with cheap xDSL. So they end up in FUN! situations where connecton goes down, process/inventory control stops and so does their business. Then more fun trying to restore their operations to a known state so they can get back to BAU. SLAs mean the xDSL connection might get fixed next day, and compensation will be a month's credit, or part thereof.

              Solution is usuall simple, so I often drafted up a quick network diagram and asked the client to come up with a cost per hour/day for any location to be offline. That often met with a lot of resistance, especially from finance types. Sure, IT costs a lot of money, but generally a lot less than the loss of business or reputational damage when it all goes wrong.

              1. EnviableOne

                Re: There's always a fallback option

                It's the perennial cost-benefit approach, the purse strings tend to open more when you can put things in $ amounts.

                this $connection being down for $time costs you $$$$$, and if this happens during peak business it cost $$$$$$$$$$$$$$, and we have guaranteed it will be up for $9s

                so simple sums if we spend $$ extra by building the system this way, we save the $$$$$$$$ on average and if the worst happens it will cost us $$ more, so saving us in the long run

                if you want to disappear the risk entirely we can build it this way but that will cost you $$$$$ if the worst doesn't happen its a waste of $$$.

                I spend my time trying to quantify the variables in those equations and have a drawer of reports I can bring out to refute any accusations...

    2. Dr Who

      We all depend on the cloud, whether we like it or not.

      The very term cloud software stems from the cloud symbol used from way back when in network diagrams, originally to depict a large private WAN.

      These days, practically nobody runs a private network to every geographic location that needs access to central systems, and that applies whether those central systems are on prem, in colo or on some sort of SaaS or PaaS offering.

      The cloud in the diagram now depicts the internet, itself a network of many networks, owned and run by many different organisations, any of whom can mess up the world's routing tables. And let's not even mention the DNS root servers.

      Whether you like it or not, you depend utterly on the cloud, wherever your mission critical software is running.

      1. Pirate Dave Silver badge

        Re: We all depend on the cloud, whether we like it or not.

        We must be doing it wrong. We DO have leased fiber between each remote site and don't rely on the cloud for the actual "mission critical software" that runs on our AS400 at Corporate. That's by design, though, not a happy accident.

      2. The Basis of everything is...

        Re: We all depend on the cloud, whether we like it or not.

        There's still a huge amount of private WAN out there. Running VPNs over the internets may well be nice and cheap, but there's no service guarantees so if you've got workloads that need a guaranteed throughput or have sensitivities around latency or response time, or you're simply not allowed to use internet for security/policy reasons then buying a WAN service is still the way to go. And unless you're big enough to be laying your own fiber you're still at the mercy of a provider not screwing up.

      3. An_Old_Dog Silver badge

        Re: We all depend on the cloud, whether we like it or not.

        "Mission-critical software": controls a chip fab's production and testing tools, controls aluminum smelters, controls steel-rolling mills machinery, etc. Those sorts of things will never be "in the cloud". Management of those sorts of facilities is extremely-aware of the costs of an unplanned line stoppage.

    3. Claptrap314 Silver badge

      This is a m$ problem, not a cloud problem

      Outsourcing chunks of IT is absolutely the right decision for many small & medium businesses. Plenty of larger ones as well.

      If after 40 years, these businesses think that trusting m$ is anything other than a bad gamble, I don't know what more to say.

  4. Duffaboy
    FAIL

    If it was working before then the first thing you must always ask

    WHAT HAS CHANGED

    1. Yet Another Anonymous coward Silver badge

      Re: If it was working before then the first thing you must always ask

      And more importantly what did thus change do to everything else.

      Otherwise you would just change that router back, "fixing" the problem but causing all the other routers to repopulate their tables again - causing another round of outages

    2. elsergiovolador Silver badge

      Re: If it was working before then the first thing you must always ask

      Where I worked it was a migration from Slack to Teams.

    3. Pascal Monett Silver badge

      Re: If it was working before then the first thing you must always ask

      And the answer almost always will be : "Nothing ! We didn't change anything !"

      Followed by an extensive waste of time re-auditing the entire network until, hey, what's this ? And then you get a "Oh yeah, we had to modify a setting on the B portion of the network because bla bla, but that couldn't possibly have anything to do with the outage, right ?".

      Grrrr.

  5. jollyboyspecial Silver badge

    Waffle

    There's an awful lot of waffle in there, just like every RFO I've ever seen.

    But the essence of it is:

    Planned change wasn't properly peer reviewed

    Shit got fucked up

    Everybody ran around like headless chickens for a bit

    Then we realised the cause of the fuck up

    Shit got fixed

    1. monty75

      Re: Waffle

      You missed:

      Lessons will be learned

      * time passes *

      Same shit happens again

      1. ecofeco Silver badge

        Re: Waffle

        FACT

      2. sgp

        Re: Waffle

        Lessons were learned.

        Staff was replaced.

        1. EnviableOne

          Re: Waffle

          why do you think companies employ Chief Incident Scapegoat Officers?

    2. ITMA Silver badge
      Devil

      Re: Waffle

      So basically a failure to have and use proper change management.

      Who do they think they are? Royal Bank of Scotland?

      1. ITMA Silver badge

        Re: Waffle

        For anyone who is wondering about the Royal Bank of Scotland reference - it "nods" in the direction of at least two incidents.

        The first could be just internal rumour. How based in fact it is I don't know, but the story goes like this and is known as the "P45 incident".

        A guy in network/telecoms decided he was going to implement a minor fix to something and couldn't be bothered to go through change management. Unfortunately the "fix" took out most of their ATM network one of the last shopping days before Christmas. Cost them a fortune in fees from customers using other banks ATMs. Guy was immediately given his P45 and shown the door.

        The second is very real and happened not that long after Fred "the Shred" Goodwin almost bankrupted RBS and brought the UK banking system very close to collapse. After, as a cost cutting measure large chunks of RBS IT was "offshored" to the Indian sub-continent and people who were not familiar with almost everything. They rolled out a change - in the middle of the week!!! - to a core banking system which went wrong preventing thousands of RBS customers from accessing their accounts. They then mades pig's ear of their "recover plan" to back the change out.

    3. Anonymous Coward
      Anonymous Coward

      Re: Waffle

      Word on the street is it was an inside job. Some of thr recently terminated employees were given a pink slip, but their accounts remained active...

      1. ITMA Silver badge

        Re: Waffle

        Do you mean "inside job" as in deliberate? Or "inside job" due to incompetence?

        1. Anonymous Coward
          Anonymous Coward

          Re: Waffle

          As in "revenge sabotage"

  6. Dan 55 Silver badge

    12:43pm UTC AKA 7:43am EST

    Rather late fixing that issue, the beta testing window had practically closed.

  7. Norman Nescio Silver badge

    SPOF anyone?

    A single IP address change caused this? SPOF anyone?

    I thought 'the cloud' was meant to be resilient and redundant. Where's the chaos monkey when you need it?

    1. Yet Another Anonymous coward Silver badge

      Re: SPOF anyone?

      Amateurs, people have taken down the entire phone system with a single route update

      1. Norman Nescio Silver badge

        Re: SPOF anyone?

        Amateurs, people have taken down the entire phone system with a single route update

        You're not misremembering this incident, perhaps?: The Crash of the AT&T Network in 1990

        1. General Purpose

          Re: SPOF anyone?

          What a superbly elegant cascade. Thank you.

        2. This post has been deleted by its author

        3. Anonymous Coward
          Anonymous Coward

          Re: SPOF anyone?

          I worked for a company that had some virtual Aerohive VPN concentrators. There was a bug in the VMware E1000 virtual driver that caused a PSOD on the hypervisor when the VM was booting up, I think it was ARPing a lot. Unfortunately, the cluster enabled HA behaviour and continued to power on each host followed by PSOD. Brought down the entire production cluster in about 5 minutes.

          It nearly happened again because we didn't know what caused the PSOD.

      2. Mike Lewis

        Re: SPOF anyone?

        And Australia's entire EFTPOS network by correcting a spelling error.

        Someone fixed the code they were assigned, noticed a message in adjacent code was incorrectly spelled, corrected it and down went the network. The length of that message had been hard coded in the program and correcting it changed its length.

    2. Kevin McMurtrie Silver badge

      Re: SPOF anyone?

      The way it's written, I figure somebody had to crawl under their desk to power cycle the box with the blinky green lights.

  8. An_Old_Dog Silver badge
    Joke

    Network Switches Like the Heart of Gold's Onboard Computer

    During this re-computation process, the routers were unable to correctly forward packets traversing them.

    ... the switches were trying to figure out how to make tea the proper British way.

    1. CrazyOldCatMan Silver badge

      Re: Network Switches Like the Heart of Gold's Onboard Computer

      how to make tea the proper British way

      In a pot, using freshly-boiled water and loose tea leaves.

      Put tea pot under tea cosy and leave to steep (time required depends on the type of tea and the strength desired).

      Once at the desired strength, pour into a cup/mug that has already got the desired amount of milk in.

      Drink, accompanied by sandwiches/cakes//biscuits (delete as appropriate).

      Simples.

      1. EnviableOne

        Re: Network Switches Like the Heart of Gold's Onboard Computer

        but no! you can't put the milk in first, it has to be Tea in first.

        and sometimes milk is the wrong option, for the selected tea, black is better or occasionally a little squeeze of lemon...

  9. An_Old_Dog Silver badge

    Testing at Full Load/Scale

    ... is difficult to do, especially in a heterogeneous environment. Yet megacorps such as Microsoft, Google, etc. have the resources, if not the office-political will, to do it properly.

    With great power comes great responsibility, but you know that song.

    1. Cris E

      Re: Testing at Full Load/Scale

      Oh come on, they did full scale testing and found the problem almost immediately. What'd the article say, detected by 0712 and identified by 0820? That's pretty fast.

      Oh, a *separate* full scale heterogeneous environment.... Yeah, no one has that anymore.

      1. Black Label1
        Black Helicopters

        Re: Testing at Full Load/Scale

        Right, seems like Microsoft, Google, Amazon et al. also have beancounters controlling network gear expenditure.

    2. CrazyOldCatMan Silver badge

      Re: Testing at Full Load/Scale

      but you know that song

      # We all fall together, my oh my..

  10. Black Label1
    Black Helicopters

    Information disclosure

    The wobble also affected Azure Government cloud services.

    Are you, M$ + ElReg, really saying that compromising a few Cisco routers from Microsoft enables some wrong-doers to compromise USAGov, Pentagon Cloud et al. ?

    If I got you correctly, this was a sloppy press release.

  11. ecofeco Silver badge

    LOL wut?

    Isn't this first year network engineering?

    FFS.

  12. Anonymous Coward
    Anonymous Coward

    It's an easy mistake to make. Only the other other day I changed the name of my server at home and couldn't access the samba shares. I had a 2 hour outage. It was pandemonium. Had to wait to watch the latest episodes of Velma.

  13. Kev99 Silver badge

    I guess mictosoft uses the same high level quality control on its WAN as it does on its software.

  14. Ken Moorhouse Silver badge

    ...until those systems were manually restarted...

    So, in a nutshell, they turned the internet off, then back on again.

  15. Toni the terrible Bronze badge
    Devil

    Microsoft? No

    It was a Putin inspiured attack! or those Norks

  16. but what do I know?

    Why

    I'm not a network engineer, so can someone explain why the change of ip address of a LIVE system would ever be necessary?

    1. Teal Bee

      Re: Why

      My guess is general housekeeping or network consolidation. Since networks are allocated in blocks, perhaps Microsoft engineers allocated a block that was too large for that network, then realized that less than half of those IPs are used and decided to allocate a smaller block and free up a portion of those IP addresses.

      With the current deficit of IPv4 addresses, it makes sense to optimize their usage.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like