back to article CenturyLink L3 outage knocks out web giants and 3.5% of all internet traffic

Internet backbone operator CenturyLink has experienced an outage that degraded performance of major web companies around the world. CenturyLink acknowledged the outage and posted basic information about the incident. Our technicians are working to resolve an IP outage. Ensuring the reliability of our services is our top …

  1. John Savard

    My Personal Experience

    I know I was able to use Discord during a period when, apparently due to this event, RuneScape was completely unusable. So it may have been one of the worst-hit services. Even parts of RuneScape's web site were not accessible.

  2. Nick Stallman

    Great reporting

    Good write up - first news site I've seen that didn't say it was a Cloudflare outage.

    Cloudflare got blamed by everyone else since a lot of their error pages were visible to end users. The error pages were only there because the origin servers only had Level 3 transit of course so alternate routes weren't available.

    People saw the Cloudflare logo and instantly assumed they were the source of the problem.

    1. Anonymous Coward
      Pint

      Re: Great reporting

      I agree. Outage "reporting" usually runs the gamut from "<site> is down" to "the internet is collapsing!"

      Rarely does anyone other than the affected companies take the time and effort to dig down to the root cause and even more rarely does anyone report on this. Have a pint, Nick.

      1. IGotOut Silver badge

        Re: Great reporting

        It also helps Cloudflare very early on going "It wasn't us!"

        Yup I'm a CF customer as well.

        1. Mike Pellatt

          Re: Great reporting

          When it _is_ Cloudflare, though, they are exemplary at:

          a) putting their hands up

          b) Giving a detailed explanation of what went wrong

          c) Giving a fully detailed exposition of what they're doing/have done to prevent a recurrence.

          No corporate ass-convering by PR droids from them. Like I said, exemplary.

  3. tomppa_28

    Bad user experience on Twitter

    Yesterday I had a very bad user experience with Twitter in the afternoon (CEST). The site was extremely sluggish and refused uploads. But as you mentioned, no information from Twitter itself.

    1. Anonymous Coward
      Anonymous Coward

      Re: Bad user experience on Twitter

      Bad user experience on Twitter

      But isn't that to be expected, even when it IS working?

  4. Maelstorm Bronze badge

    Well...

    Well that might explain why I haven't been able to connect to GIMP's website.

  5. Anonymous Coward
    Anonymous Coward

    Could this be the reason for fleaBay's TITSUP as well?

    World+dog were getting DNS resolution errors, which made it kind-of hard to update your stock levels and whatnot :-/

  6. TheMeerkat

    Yes, we started getting alerts, switched all traffic from affected DCs in both the USA and the U.K. Continued as normal. Total traffic was what would be expected for Sunday.

  7. Dvon of Edzore
    FAIL

    BGP takes two to untangle

    Gandi.net reported on the issue that they had dropped their BGP routes through CenturyLink/Level3 but CenturyLink was still advertising the dead Level3 routes. This meant that the mitigation built into the Internet for such dead routes wasn't working, so otherwise functional sites couldn't recover using alternate transport. The BGP storm CL unleashed may have caused enough congestion that the good updates simply couldn't get through.

    In my case much of the Web traffic was still working, but email from my three main providers had stopped. Yahoo was feeling poorly, a fact that might bring some glee but it was affecting viewing several rocket launch attempts, dammit! At least the morning (USA time) launches were scrubbed, so no lasting damage beyond stress taking a few more sanity and health points.

    1. Anonymous Coward
      Anonymous Coward

      Re: BGP takes two to untangle

      The suspicion was initially bon CenturyLink route servers as the CenturyLink route table appeared to be "frozen" with no new routes added or old routes deleted.

      Based on CenturyLink indicating a flowspec issue, it's likely that a flowspec rule partially or completely firewalls BGP. The resulting chaos/time to repair was likely caused by techs being unable to access equipment to triage/fix it quickly.

      1. yoganmahew

        Re: BGP takes two to untangle

        Thank you - that might explain it. A four hour partial outage (flapping at 20-40% of traffic) at my place with the fallback routes not working (customers not able to reach us. Only the timing stopped it being a much bigger incident, so I guess +1 for a weekend change slot versus continuous deployment...

  8. TimMaher Silver badge
    Pirate

    Eve Offline

    May also explain why my blockade runner got stuck just outside a gate in Eghellende just at the most dangerous point and connections kept failing.

    I got lucky though and managed to dock during a brief uptime.

  9. pavel.petrman

    IPv6

    I haven't been looking into that topic for some time now, but I remember a few years ago (when I was adding a second IPv4 Internet uplink connection to my home router) that the area of multiple-uplink endpoints remained a gray, contested area in the IPv6 world, and the solution at that time was that everyone with more than one Internet connection should start BGPing (I remember the blog post I read about this being strongly against the idea but stating that there are not many other viable options in this particular scenario). Has anyone here any understanding of current situation? I know that there are not many consumer level endpoints with this problem, but even a small number of private homes and/or small businesses trying to increase the reliability of their Internet connection may wreac havoc by advertising themselves wrongly and occasionally not getting filtered out upstream.

    1. Jellied Eel Silver badge

      Re: IPv6

      It gets complicated, but then proper multi-homing always has been. Issue is basically if you have PA IP addresses (Provider Allocated), then the provider advertisers the aggregate, then other providers base their route filters off that. So you can't advertise a more specific PA address via another ISP.

      So 'solution' is to either get PI (Provider Independent) addresses, but then you need to route those via multiple ISPs.. Which then (generally) means BGP, which means getting an ASN (Autonomous System Number) and creating/maintaining RADB objects.

      And from the sounds of things, wouldn't help in this case if Centurylink was advertising L3 routes externally, but they weren't reachable internally.. So traffic would still go towards Centurylink, and then end up in an internal bit bucket inside their cloud.

      It's possible to kludge it locally with 2 ISP connections, and run a router/traffic manager to prefer 1 link over another, but without BGP from upstreams, you wouldn't have a routing table to do any fancy route selection, and wouldn't be able to force return path. So you could prefer outbound over 1 link, but return traffic may still end up trying to go via L3/Centurylink and fail.

  10. MickeyTheMoose

    DMVPNs bouncing like yo-yo's

    We saw the same but for our DMVPN traffic. APAC seemed unaffected but our US and UK sites were bouncing like crazy. I put it down to a BGP "mistake" by our Chinese friends and went out with the kids for the day. We use multiple ISPs in all sites but at some point globally most must have touched on CL's network. Glad it was a Sunday...

    1. Anonymous Coward
      Anonymous Coward

      Re: DMVPNs bouncing like yo-yo's

      Much of Europe was affected as well. I live in Norway, and many Norwegian websites worked, but not all.. Most foreign websites were unreachable.

  11. Anonymous Coward
    Anonymous Coward

    RFO

    Preliminary Reason for Outage Summary:

    This this is a preliminary RFO summary. An official RFO will be distributed and posted to the CenturyLink Customer Portal following the completion of the full post incident review.

    Cause

    An offending flowspec announcement prevented Border Gateway Protocol (BGP) from establishing correctly, impacting client services.

    Resolution

    The IP NOC deployed a configuration change to block the offending flowspec announcement, thus restoring services to a stable state.

    Summary

    On August 30, 2020 10:04 GMT, CenturyLink identified an issue to be affecting users across multiple markets. The IP Network Operations Center (NOC) was engaged, and initial research identified that an offending flowspec announcement prevented Border Gateway Protocol (BGP) from establishing across multiple elements throughout the CenturyLink Network. The IP NOC deployed a global configuration change to block the offending flowspec announcement, which allowed BGP to begin to correctly establish. As the change propagated through the network, the IP NOC observed all associated service affecting alarms clearing and services returning to a stable state.

    Additional details:

    Many customers impacted by this incident were unable to open a trouble ticket due to the extreme call volumes present at the time of the issue. Additionally, the CenturyLink Customer Portal was also impacted by this incident, preventing customers from opening tickets via the Portal. As such, once the official RFO document is complete, it will be posted to the CenturyLink Customer Portal.

    1. Anonymous Coward
      Anonymous Coward

      Re: RFO

      I wasn't going to post the whole RFO, but yes...came here to share this. :-)

  12. skwdenyer

    This same outage appears to have taken out all BBC services from multiple geographic locations, too.

    1. Anonymous Coward
      Anonymous Coward

      The BBC's live reporting from the Belgian Grand Prix went down just after the race started.

      1. Anonymous Coward
        Joke

        Pah! Who cares about people driving round and round in circles

        Who cares about people driving round and round in circles when there is the far more important activity of staring motionless at a chess board to be concerned about?

        India and Russia declared joint winners of the Online Chess Olympiad

        I wonder how long the connection was lost before anyone realised?

  13. pd4361

    I think that the global impact was down to DNS servers being taken out - DNS caching kept most things going, but the outage was so long that by the end of it, cached entries were starting to expire - and presumably the least-used ones went first.

    In my case, email stopped working, and I found that the name of my mail router wasn't resolvable on Virgin Media's DNS servers. However, it was still OK on Google's DNS service (8.8.8.8), so I switched to that. I tried dnschecker.org and found that my mail router wasn't resolvable on roughly half of the randomly-selected DNS servers - but e.g google.co.uk was still resolvable on all of them

  14. Bruce Ordway

    Router trashed by outage?

    My neighbors fiber optic went down at about the same time.

    They contacted CenturyLink tech support and were told that a replacement for their ZyXEL router would be required.

    Somewhere along the line the outage was also mentioned.

    Before they placed the order, they asked me to look at the router.

    All I could tell them was the router indicated it was communicating with CenturyLink but.... no internet.

    However, I'm not sure how the outage could have trashed the router?

    I remember CenturyLink trashing my DSL router a few years ago.

    In that case, they had sent an auto update that was incompatible with my firmware.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like