
Fat Fingers Foul Frantic Fellows.
The global outage of Microsoft 365 services that last week prevented some users from accessing resources for more than half a working day was down to a packet bottleneck caused by a router IP address change. Microsoft's wide area network toppled a bunch of services from 07:05 UTC on January 25 and although some regions and …
This sounds like a variation on the usual BGP fat fingering.
Not really. Well, not unless you're running BGP as an IGP (Interior Gatway Protocol) and not an EGP (Exterior Gateway Protocol). That's one of those can you/should you questions where the answer is generally 'NO!' because BGP is blissfully unaware of the network topology. So..
"As part of a planned change to update the IP address on a WAN router, a command given to the router caused it to send messages to all other routers in the WAN,"
Well, yes. That's what routers do. IP address changes, topology changes, routing tables are recalculated and updated, and traffic resumes. That's more computationally expensive than just forwarding packets, so gives the routers more work to do. This has been known since before 1989, when OSPF became an IETF standard. Or long before then given networks existed prior to the Internet.
MS seems to have missed that memo, and the importance of managing your routing tables, segmenting your networks to reduce table complexity and reconvergence time, and optimising your time-outs. If not, this situation occurs. Especially if the traffic drops for long enough that other protocols (like BGP*..) timeout, then start generating more traffic or attempts to re-route and you get your good'ol fashioned train wreck.
So kinda suprised this hasn't happened before, and generally indicates that Microsoft's network is getting overloaded. So with link-state protocols like OSPF, losing a connection can result in routing table updates. The 'fix' generally requires re-partioning the network to re-segment it, and reduce the time to reconverge. Or throw hardware at the problem. Which again is an old one, especially from the good'ol days when this was done in CPU, and 'core' routers from the likes of Juniper & Cisco were basically commodity PCs hooked up to a backplane via plain'ol PCI or Ethernet. But the fix is generally a big job.
*This is FUN! Which gives non-network engineers time to play a bit of Dwarf Fortress, now available on Steam. Plain'ol congestion can also create the same problem, eg if there's too much traffic across a BGP session, keepalives may drop, sessions drop, traffic drops, congestion vanishes into the bit bucket. Then BGP attempts to re-establish the connection, congestion re-occurs, BGP drops again and you go make a coffee until the flap-dampening timeouts have expired and you can try again. One of the reasons why routing protocols and updates should be prioritised, but gets politically complicated for Internet peering sessions because Net Neutrality. Luckily most sane ISP neteng's ignore the politics and JFDI because we know we'll get blamed.
Like, I understood the words you were saying, but not the meaning.
You have not been initiated into the dark arts of router wrangling. Sooo.. lemme try something that doesn't always work, a motoring analogy!
You're driving along, and your satnav tells you there's congestion up ahead. Would you like to take an alternate route? Chances are your satnav already used a version of OSPF (Open Shortest Route First) to pick the route you're on based on driving preferences. Networks do much the same thing, if configured correctly. You may prefer highways, or side roads and OSPF has parameters like link cost.
But if you say 'yes', chances are everyone else who's had the same update will try the same alternate route. That route will then likely end up more congested, especially if drivers notice and try to get back on the original course. It's a less complex problem because there's generally fewer roads, so less options to calculate, or recalculate. In a network, the more nodes and links there are, the more complicated it gets to recalculate all possible paths, and the longer it takes.
So routing protocols and design generally tries to reduce that complexity by breaking down networks into zones, or areas in OSPF speak. That reduces the calculation/reconvergence time when routing tables update. You can also reduce it by limiting the number of routers involved that have a 'full routing table'. Some think it's a good idea that every router has a full table that knows where everything is, except when this happens, and everything has to recalculate that. Kinda like suddenly finding your car's been teleported to a random location, your satnav is down, you have no map and someone's stolen all the road signs. You still know where you need to go, but have no idea how to get there.
Basically it's a scenario that shouldn't happen with a decent network design. Most routers don't really route anyway, because often they don't have that many interfaces to choose from. So just forward the traffic thataway--> towards a device that should have a better idea. Forwarding is cheaper & faster than routing. I guess it's also a bit like being told there's a road closure in Tokyo when you're in the UK. If you're just concerned with your local area, you don't need to know, and don't care. You only really need to know about your bit of the world to get around.
Datacentres can get more complicated because there might be thousands, or tens of thousands of devices inside a bit barn, but most of those still don't need to route because they're usually just sitting on LANs and can switch/forward traffic instead. You can run routing instances inside VMs though, but often you just point traffic to a provider's default gateway, and hope they can figure it out.
I understood it, and Jellied Eel was giving the high-level explanation. I'm certain JE could go into a lot more detail.
In general, it's not often you need to change the IP address of a core router. If you do, and you are operating a redundant network, good practice is to disable the interface so there is no traffic transiting it before changing the IP address. Of course, if you haven't got the capacity to move the traffic before changing the address, you have other problems.
If the network is big enough, I'd also expect it to use redundant BGP route reflectors rather than expecting the traffic-carrying routers to also handle a full-mesh of BGP updates. That way lies madness. And downtime.
In general, it's not often you need to change the IP address of a core router. If you do, and you are operating a redundant network, good practice is to disable the interface so there is no traffic transiting it before changing the IP address. Of course, if you haven't got the capacity to move the traffic before changing the address, you have other problems.
I think it's a problem with not doing an adequate risk assessment as part of a change control review. It's one of those things that sounds simple, but isn't. It's also something you tend to learn the hard way, either directly (debug all on a core peering router.. oops) or indirectly after swapping war stories with other netengs. This is why the social aspects of events like LINX, RIPE, NANOG etc are important. It's also something that often isn't really covered well during certification bootcamps.
So the issue is you renumber an interface. Simple. But what happens next? That really requires taking a holistic approach and looking at the impact on the entire network, including applications. Basic problem is IP is dumb. Really dumb. There are a lot of things it won't do, or won't do well without a bunch of kludges. There are also a lot of dependencies on other network layers/protocols, and some of the functionality is often done better there than with IP.
So often it's a good idea to monitor what's going through the interface you're about to delete. Not what should be going through that interface. That's one way to detect the static route someone added to fix a previous routing problem. But renumbering an interface is going to have all sorts of.. interesting consequences. Especially if you get netmasks wrong.
Traffic stops, and will drop into the bit bucket. Routing tables will withdraw the route. Applications will probably keep trying to send traffic. Congestion may occur. You are on the actual console, or OOB connection, aren't you? This will also cascade. So Ethernet parts use ARP to tie MAC addresses to IP. That's gone. If it's a v6 network and NDP, it's still gone. On a router, this can be.. bad, as in "where TF did my default gateway go?". This is also where other fun kicks in. Like, say, DHCP if it's a default (or popular) route. Then there's DNS. There's always DNS. Especially if you're running stuff like load balancers that rely on DNS PTR records, cnames etc. Then, did you remember to update in-add.arpa for the IP address/host name change? And drop TTLs, add a step to flush DNS caches etc?
So basically a lot of things that will happen if you 'just' renumber an interface. Which will confuse the heck out of connected servers/applications until everything updates, reconverges and calms down. Which assumes you're aware of all the things that can potentially go wrong, and ways to resolve those or rollback the changes. And because these kinds of changes can (probably will) result in a lot of additional traffic, rolling back to the original IP address may not help, or might trigger a fresh wave of chaos. Especially if that traffic leads to congestion, more packets being lost, more retransmissions etc.
But basically not something to do lightly on a busy production network. Potential fixes, if this identifies capacity/congestion problems are also distinctly non-trivial, ie re-designing and re-segmenting a production network the size of MS's without extended downtime is a huge job to do properly, and safely.
I'm no expert on the server/application side of doing this, but know enough to know it's the kind of change that has to be approached with extreme caution. Then again, it's also a learning experience when you hit 'unexpected' problems like this. It's also potentially one of the scaling problems with IP. That was kind of hit years ago with the realisation that global Internet tables were too large to fit into existing router memory. Since then, the challenge has arguably gotten worse. So now we have MPLS and virtual networks, where one core device may have hundreds, or thousands of virtual routing instances, many (or most) of which will be impacted by any changes to core interfaces. This is one of the reasons why service providers generally restrict, or at least want to know how many routes you expect to cram into a single VRF.
We obviously work in the same industry.
It's probably true everywhere, but one measure of how well we are doing our jobs is the lack of problems to the extent that people think we do nothing and that our job is simple.
"the realisation that global Internet tables were too large to fit into existing router memory."
I remember a multinational client who wanted redundant and resiliant Internet connections from their AS to ISPs in different countries. Which is a reasonable request. But they were used to having a small router connecting to their ISP via a default route...
Even clients in the UK who wanted to use different ISPs for redundancy rapidly discovered that this networking lark was more complicated than they imagined, especially if they wanted load-balancing across the different connections. There was a lot of "But surely you can just....?"
Even clients in the UK who wanted to use different ISPs for redundancy rapidly discovered that this networking lark was more complicated than they imagined, especially if they wanted load-balancing across the different connections. There was a lot of "But surely you can just....?"
Yep, I hate that one. Just configure 2 default routes. It'll be fine (ish). It's one of those that can actually work, just not necessarily the way you hoped or expected. And not taught in J/CCIE boot camps either. Load balancing is just one of those things IP was never really intended to do. Nor were most routing protocols given they're generally designed to pick the best route, not routes. Same with resiliency/redundancy. Why pay all this money for capacity we aren't using? Well, why pay all that money for Directors and Officer's Indemnity insurance when you never claim on it? Is that because it doesn't cover negligence?
There are also little tricks that can be abused along the way. Got 2 fibres or wavelengths running diversely? Just use optical protection switching and it'll generally switch paths long before a router notices. Preferred box shifter solution is, of course to flog you a 2nd router fully loaded and then leave you to figure out how to configure your traffic management. Which then makes your routing more complicated, which means more stuff to go wrong. Luckily these days there are various ways to do channel bonding and group multiple circuits/wavelengths into a single virtual trunk. Usually best done at layer-2, and generally handles diversity better than an HR department. Providing latencies across the paths are close enough, which is often the case on metro circuits, and can be on WAN connections as well. Fun can also be had with passive optical splitters. Use a cheap Y-cable to clone traffic across 2 links, or single link to 2 interfaces. Just make sure one interface is shut down, cos routers and applications generally don't like dupe packets.
This is the problem with the cloud. One change stopped hundreds of people being able to access their software/systems, It was only a couple of hours, but it could have been days. it could have cost a lot of businesses their survival. While Office is unlikely to be used in life and death situations, cloud software could be used in life and death situations, and a failure could cost lives.
cloud software could be used in life and death situations, and a failure could cost lives.
If it's that critical and that life-and-death, then you absolutely have to make management understand the risks and allow you to build sufficient redundancy into the design.
If there's anything good to come out of events like this MIcrosoft outage, it's to have real world events that you can use as examples for why you want to keep things off the cloud, or be given enough budget to design something that has a fallback in the event that the cloud fails.
I'm not disagreeing, but for places that have gone in whole-hog on the Office365 train, where exactly do they fall back to if that train runs off the track? If Teams in O365 is borked, what are the options for a 1-2 day outage? Or OneDrive/Sharepoint? Or Exchange? I mean, for the vast majority of corps that are O365 slaves, if O365 shits itself at this level, then they are just SOL. And that's by design of Microsoft, since they don't want to play well with others.
If you're all-in on O365, and especially if working remote, then do a little homework and prepare yourself for the inevitable local glitch that cuts you off the interwebs. Funnily enough all the capabilities of working offline are baked into the products. e.g.:
Teams: Keep key contacts in Outlook and/or on your phone and actually talk to them the old fashioned way.
OneDrive / Sharepoint: Use offline folders and keep a local copy of key docs
Exchange: Send/receive might be problematic, but you can at least keep recent mails locally on any decent mail client if you tell it
Install the apps on your PC rather than use the web versions
I've lost much more worktime due to muppets digging through phone cables or running over street cabinets. Most organisations should be able to come up with a plan to keep the important things going for a week or two with little cost given all the tools we have available to us now.
Actually, that's not entirely true. The biggest killer of working time has been pointless meetings that could have been completed much quicker and amicably in a lunchtime pub. Now maybe Teams needs to come with a beer tap?
"the inevitable local glitch that cuts you off the interwebs."
That's not what we're talking about. We're talking about Microsoft nuking itself off of the interwebs. There should already be redundant links into your DC to overcome the simple loss of local carrier. But if the Mighty Microsoft blows out its own core in such a way that it's out for a day or two or five, then the business world that's bought into their candy-coated dream is fucked. Sure, it can be overcome to an extent, but the truth is, for most non-huge businesses, there aren't many "hot spare" replacements for O365 itself that are plug-and-play ready in 10 minutes to keep things flowing across the business.
And that's the real crux of the problem - MS has built this huge silo of a system and gotten a lot of businesses dependent on it, but MS still suffers a great deal of internal stupidity that keeps fucking it up. And if you're in whole-hog, it's very difficult to overcome a major outage.
No, actually there isn't.
Let's just use one example - one most actually have to deal with these days. Let's say your company is using O365 and all data is stored in one drive. It's not possible to keep an offline copy of your entire dataset if your company has a single repo in one drive - not to mention that an hour for some places equates to thousands of changes and new documents. Sure, your own stuff you can keep but what about the Terabytes of stuff you'd have to keep and most hardware (corporate laptops) don't go over 512Gb of internal storage and not all of it should be a one drive backup.
Exchange - this comes down to your settings (like caching your emails) and private PST rules and backups. Also you can't email anyone using personal domains, so this will bite you.
Office apps like Word - these are apps that should be installed locally or served up in a remote services gateway. Open office or Libre Office can be used in a pinch.
MSTeams - I agree. Call people. However you'll want to keep a secondary backup free service on tap for conference calls only. Chat can wait or you can use a free slack instance or Discord for a day while MSFT figures their mess out.
Good IT teams will have redundancy in place and directions delivered to the people they help out. Ideally, training when hired on will cover what to do in this type of situation.
Going all in is fine, just be ready for things to break because this will happen again, and again, and probably again. If your employer does not plan, and has no plan for these situations, they need to hurry up and fix it. All teams in IT plan for failure so if your in-house help desk has no plans, we'll tell someone who can make planning happen. Show them this article for a perfect example.
Going all in is fine, just be ready for things to break because this will happen again, and again, and probably again. If your employer does not plan, and has no plan for these situations, they need to hurry up and fix it. All teams in IT plan for failure so if your in-house help desk has no plans, we'll tell someone who can make planning happen. Show them this article for a perfect example.
Yep, and this is one of the reasons why I'm glad I'm semi-retired and don't have to deal with this as much any more. Far too many businesses still see IT as a cost centre, rather than essential to their very existence. My usual examples are finance, where network outages can cost collosal amounts of money when markets turn against them and they can't trade. Or fines given all the legal data retention requirements. The other I saw a lot was industries that rely on apps like SAP for stock and process control. Then 'save money' by hooking sites up with cheap xDSL. So they end up in FUN! situations where connecton goes down, process/inventory control stops and so does their business. Then more fun trying to restore their operations to a known state so they can get back to BAU. SLAs mean the xDSL connection might get fixed next day, and compensation will be a month's credit, or part thereof.
Solution is usuall simple, so I often drafted up a quick network diagram and asked the client to come up with a cost per hour/day for any location to be offline. That often met with a lot of resistance, especially from finance types. Sure, IT costs a lot of money, but generally a lot less than the loss of business or reputational damage when it all goes wrong.
It's the perennial cost-benefit approach, the purse strings tend to open more when you can put things in $ amounts.
this $connection being down for $time costs you $$$$$, and if this happens during peak business it cost $$$$$$$$$$$$$$, and we have guaranteed it will be up for $9s
so simple sums if we spend $$ extra by building the system this way, we save the $$$$$$$$ on average and if the worst happens it will cost us $$ more, so saving us in the long run
if you want to disappear the risk entirely we can build it this way but that will cost you $$$$$ if the worst doesn't happen its a waste of $$$.
I spend my time trying to quantify the variables in those equations and have a drawer of reports I can bring out to refute any accusations...
The very term cloud software stems from the cloud symbol used from way back when in network diagrams, originally to depict a large private WAN.
These days, practically nobody runs a private network to every geographic location that needs access to central systems, and that applies whether those central systems are on prem, in colo or on some sort of SaaS or PaaS offering.
The cloud in the diagram now depicts the internet, itself a network of many networks, owned and run by many different organisations, any of whom can mess up the world's routing tables. And let's not even mention the DNS root servers.
Whether you like it or not, you depend utterly on the cloud, wherever your mission critical software is running.
We must be doing it wrong. We DO have leased fiber between each remote site and don't rely on the cloud for the actual "mission critical software" that runs on our AS400 at Corporate. That's by design, though, not a happy accident.
There's still a huge amount of private WAN out there. Running VPNs over the internets may well be nice and cheap, but there's no service guarantees so if you've got workloads that need a guaranteed throughput or have sensitivities around latency or response time, or you're simply not allowed to use internet for security/policy reasons then buying a WAN service is still the way to go. And unless you're big enough to be laying your own fiber you're still at the mercy of a provider not screwing up.
"Mission-critical software": controls a chip fab's production and testing tools, controls aluminum smelters, controls steel-rolling mills machinery, etc. Those sorts of things will never be "in the cloud". Management of those sorts of facilities is extremely-aware of the costs of an unplanned line stoppage.
Outsourcing chunks of IT is absolutely the right decision for many small & medium businesses. Plenty of larger ones as well.
If after 40 years, these businesses think that trusting m$ is anything other than a bad gamble, I don't know what more to say.
And more importantly what did thus change do to everything else.
Otherwise you would just change that router back, "fixing" the problem but causing all the other routers to repopulate their tables again - causing another round of outages
And the answer almost always will be : "Nothing ! We didn't change anything !"
Followed by an extensive waste of time re-auditing the entire network until, hey, what's this ? And then you get a "Oh yeah, we had to modify a setting on the B portion of the network because bla bla, but that couldn't possibly have anything to do with the outage, right ?".
Grrrr.
For anyone who is wondering about the Royal Bank of Scotland reference - it "nods" in the direction of at least two incidents.
The first could be just internal rumour. How based in fact it is I don't know, but the story goes like this and is known as the "P45 incident".
A guy in network/telecoms decided he was going to implement a minor fix to something and couldn't be bothered to go through change management. Unfortunately the "fix" took out most of their ATM network one of the last shopping days before Christmas. Cost them a fortune in fees from customers using other banks ATMs. Guy was immediately given his P45 and shown the door.
The second is very real and happened not that long after Fred "the Shred" Goodwin almost bankrupted RBS and brought the UK banking system very close to collapse. After, as a cost cutting measure large chunks of RBS IT was "offshored" to the Indian sub-continent and people who were not familiar with almost everything. They rolled out a change - in the middle of the week!!! - to a core banking system which went wrong preventing thousands of RBS customers from accessing their accounts. They then mades pig's ear of their "recover plan" to back the change out.
A single IP address change caused this? SPOF anyone?
I thought 'the cloud' was meant to be resilient and redundant. Where's the chaos monkey when you need it?
Amateurs, people have taken down the entire phone system with a single route update
You're not misremembering this incident, perhaps?: The Crash of the AT&T Network in 1990
This post has been deleted by its author
I worked for a company that had some virtual Aerohive VPN concentrators. There was a bug in the VMware E1000 virtual driver that caused a PSOD on the hypervisor when the VM was booting up, I think it was ARPing a lot. Unfortunately, the cluster enabled HA behaviour and continued to power on each host followed by PSOD. Brought down the entire production cluster in about 5 minutes.
It nearly happened again because we didn't know what caused the PSOD.
And Australia's entire EFTPOS network by correcting a spelling error.
Someone fixed the code they were assigned, noticed a message in adjacent code was incorrectly spelled, corrected it and down went the network. The length of that message had been hard coded in the program and correcting it changed its length.
how to make tea the proper British way
In a pot, using freshly-boiled water and loose tea leaves.
Put tea pot under tea cosy and leave to steep (time required depends on the type of tea and the strength desired).
Once at the desired strength, pour into a cup/mug that has already got the desired amount of milk in.
Drink, accompanied by sandwiches/cakes//biscuits (delete as appropriate).
Simples.
The wobble also affected Azure Government cloud services.
Are you, M$ + ElReg, really saying that compromising a few Cisco routers from Microsoft enables some wrong-doers to compromise USAGov, Pentagon Cloud et al. ?
If I got you correctly, this was a sloppy press release.
My guess is general housekeeping or network consolidation. Since networks are allocated in blocks, perhaps Microsoft engineers allocated a block that was too large for that network, then realized that less than half of those IPs are used and decided to allocate a smaller block and free up a portion of those IP addresses.
With the current deficit of IPv4 addresses, it makes sense to optimize their usage.