back to article BT blames 'faulty router' for mega outage. Did they try turning it off and on again?

BT has blamed a faulty router for knocking its network offline yesterday, leaving hundreds of thousands of customers without the internet. The telecoms giant apologised for the failure, which began at around 2pm yesterday afternoon. Customers across the country were unable to get online, with reports of the outage affecting …

  1. Captain TickTock
    Joke

    'Faulty Router'

    So which customer had the faulty router? ;-)

    1. TRT Silver badge

      Re: 'Faulty Router'

      We're talking BT's hardware here. All of them.*

      *EDIT. Just kidding. I've no idea how good or bad their HomeHubs are. I'm probably biased because I only get to hear about them when they stop working properly. Which seems to be fairly often, actually.

      1. Hans Neeson-Bumpsadese Silver badge

        Re: 'Faulty Router'

        "I've no idea how good or bad their HomeHubs are"

        They're s**t.

        I had the HH3 (or 4) when I moved to them for regular ADSL, which I replaced with a NetGear somethignorother fairly quickly when I realised how flaky the BT kit was. Then, having lost all patience with the HH5 I got from them for the Infinity service, I've just replaced that with a NetGear Nighthawk.

        1. Anonymous Coward
          Anonymous Coward

          Re: 'Faulty Router'

          Little wrong with HH5 for what it is, router "random" reboots are caused by BT sending instructions over the WAN instructing it to do so

          Individual consumers simply get fobbed off as dumb or replaced router and no way of getting behind the issue to any technical team..

          BT HH seems to = outage-as-a-service despite the routed itself being adequate.

          1. Alan Brown Silver badge

            Re: 'Faulty Router'

            HH4 and HH5 are underpowered. If you drive them hard, they break.

            They're fine for users who are only light netters but as soon as people ramp that up, they fall over regularly.

        2. Anonymous Coward
          Anonymous Coward

          Re: 'Faulty Router'

          We have a HH5 configured as a wireless access point and the wifi range is excellent.

          Due to it being an access point, BT has no control over it either

      2. Timmy B

        Re: 'Faulty Router'

        Generally our home hub is pretty ok. I would update it to something better but the OH may have to deal with issue when I am off in the woods and getting BT to sort it is always going to be easier.

        1. cynic56

          Re: 'Faulty Router'

          .. getting BT to sort it is always going to be easier.

          You obviously don't live on the same planet that I do.

      3. Anthony Hegedus Silver badge

        Re: 'Faulty Router'

        BT "Hubs" are an abomination. They aren't fit for any purpose.

        1. Glenturret Single Malt

          Re: 'Faulty Router'

          @ Anthony Hegedus

          That's funny.

          I have a collection of BT-supplied routers going back about a dozen years to when I switched to wireless. All of them are still in perfect working order and the only lengthy hiccup I have experienced in that time (2 days out) was tracked down to a problem at the local exchange. The comment sounds like those that I see about Windows 10 - I hate BT because they are BT.

          1. Anonymous Coward
            Anonymous Coward

            Re: 'Faulty Router'

            "I hate BT because they are BT."

            Most people hate BT because of what those people have learned from experience with BT.

            "I have a collection of BT-supplied routers "

            These broadband routers may have been BT-supplied, but prior to the days of HomeHubs and so on they almost certainly weren't BT-designed.

            So thank BT for choosing a decent vendor's routers back then.

            Nowadays, with HomeHub... let's not go there, OK.

            1. x 7

              Re: 'Faulty Router'

              "So thank BT for choosing a decent vendor's routers back then"

              total bollox

              Before the HomeHub 1, BT used a mix of routers and ADSL modems, mainly from Alcatel-Lucent (or whatever they were called that week) There was the green frog modem, replaced by the purple slug, which both had the same guts as the Alcatel 105

              Total crap.

              1. Anonymous Coward
                Anonymous Coward

                Re: 'Faulty Router'

                The frog was crap, I'll give you that, but what else was there in its day?

                The best SoHo ADSL modem I've ever had was the BT-badged Voyager 2100 and its close relative the V2110. Good at doing what it needs to do, decent user interface, readily usable user-friendly diagnostics and reasonably detailed troubleshooting info for when something breaks, *and* manageable by SNMP, including the DSL Line Statistics MIB. Hadn't seen anything close to it since the D-Link DSL604+. The Voyager 2100 was a GPL-based router (though BT forgot the implications of that detail initially) closely related to another vendor's product, probably a 3Com in a different box, though I forget.

                1. x 7

                  Re: 'Faulty Router'

                  I think the Voyager routers were a french Thomson-branded design, which was an earlier iteration of Alcatel. That unit has had so many names due to the various takeovers and selloffs from Alstom/Alsthom/Alsace-Thomson-Heuston/Roneo/Alcatel/Lucent.............

                  as regards ADSL modems - far superior to the frog and slug was the Fujitsu FX-310, a wonderful little device which had the benefit of when plugged into a phoneline it rebooted stale ADSL sessions automatically. I used to carry one just to clear bad sessions. Not something you hear of nowadays

                  1. Anonymous Coward
                    Anonymous Coward

                    Re: 'Faulty Router'

                    "I think the Voyager routers were a french Thomson-branded design, which was an earlier iteration of Alcatel."

                    Some of them might well have been. Back in the day, I was far more familar than a user should be with the V21x0 and its internals, and it wasn't French.

                    "I used to carry [Fujitsu] just to clear bad sessions. Not something you hear of nowadays"

                    That brings back memories. Not necessarily good ones either. Thank you (?).

    2. JamesPond

      Re: 'Faulty Router'

      One faulty router brings the network down? Seems like a poorly designed and/or configured and/or monitored system to allow that to happen and to affect disparate parts of the UK for so long. However one faulty GPS satellite seems to have stopped DAB radio from being accessible to parts of the UK for several days, so maybe it's possible.

      1. kmac499

        Re: 'Faulty Router'

        Must have been have fairly close to the center of the network, maybe next to a DNS server??

      2. Voland's right hand Silver badge

        Re: 'Faulty Router'

        One faulty router brings the network down?

        If it starts announcing gibberish instead of what it is supposed to announce as routing updates - why not.

        There is bugger all protection at the routing protocol level for most internal gateway protocols. OSPF has none by design (no, do not talk to me about the admin weight hack in Cisco IOS the person who invented that should be shot), ISIS is not much better, iBGP is usually unfiltered as well and let's not even talk about various protos that used to be popular with with BT CTO like PBB and PBB-TE.

        The solution in this case is to have good "view"/"analysis" of the routing protocol state and KILL the router from the wall switch right away to localize the failure. I am not going to comment on BT and either one of these.

        1. Alistair
          Windows

          Re: 'Faulty Router'

          @ VRH

          or: someone left the spanning tree turned on. Whats border protection do again?

        2. pompurin

          Re: 'Faulty Router'

          > If it starts announcing gibberish instead of what it is supposed to announce as routing updates - why not.

          Thank you for being the sensible one here. I was expecting the standard "BT are shit" comments. I've had bad experiences in former houses but the last two I've lived in Cheshire have had no problems for 6+ years. I'll give BT credit where it's due.

          Would you not expect a company of BTs size to have multiple CCIE types on their books, with an incredibly high spec network that is well designed to cope with the network traffic of the UK? Unfortunately all it takes is bit rot somewhere down the line, and you're sending out spurious data. The above poster is spot on.

          1. Anonymous Coward
            Anonymous Coward

            Re: network ... designed to cope with the network traffic of the UK

            "Would you not expect a company of BTs size to have multiple CCIE types on their books, with an incredibly high spec network that is well designed to cope with the network traffic of the UK?"

            No I wouldn't, based on the last few years, though obviously it might be a laudable goal. Nor would anyone else who has followed various BT network shenanigans e.g. those involving congestion in BTwholesale's backbone network, congestion which was repeatedly denied in public by the then "BT Chief Network Architect" around Feb 2014, and various similar foul ups. His name is public domain but out of misguided courtesy I won't post it here.

            http://www.revk.uk/2014/02/stats-arent-facts.html

        3. E 2

          @Voland's right hand Re: 'Faulty Router'

          If you let ISIS handle your routing then you are asking for trouble!

        4. Anonymous Coward
          Anonymous Coward

          Re: 'Faulty Router'

          Router Status: Grand Old Duke of York.

      3. This post has been deleted by its author

        1. Anonymous Coward
          Anonymous Coward

          Re: 'Faulty Router'

          > That explains the DAB problems I was getting at home an din the car last week

          The BT router problems were yesterday. The DAB problems last week were due to a GPS fault. Have you replied to the wrong story?

      4. doofus

        Re: 'Faulty Router'

        Quite believable - quite often when there is considerable failure they then realise that the redundancy path is configured incorrectly as doesn't work and probably would never had.

    3. TheVogon

      Re: 'Faulty Router'

      "Did they try turning it off and on again?"

      More importantly did they try working out why failover to the backup router didn't work?! You do have a resilient design BT for half the country's internet?!

      1. Flywheel

        Re: 'Faulty Router'

        "did they try working out why failover to the backup router didn't work"

        Backup router? but that would double the cost surely, and how often would we need it, really?

        1. Anonymous Coward
          Anonymous Coward

          Re: 'Faulty Router'

          "Backup router? but that would double the cost surely, and how often would we need it, really?"

          14 or so days out of every 28, because you deliberately fail it over every few days, just to test the failover procedures still work as intended.

          Though as has been noted elsewhere, there can be a world of difference between total failure of a network component, and same component going insane in a way which simple automated procedures don't quickly detect. Even so...

    4. Anonymous Coward
      Joke

      Re: 'Faulty Router'

      The routers name was Steve. He is now getting a very big telling off.

  2. Anonymous Coward
    Anonymous Coward

    Redundancy?

    No backup equipment?

    1. Big_Ted
      Devil

      Re: Redundancy?

      No need for it, Its Most likely Cisco kit and the Chinese will have used the backdoor to enter it and fix it for them by rolloing back to the version where they get copies of all the data going through.

      You would expect that the NSA would have informed them they were going to do some software mods though......

      1. Anonymous Coward
        Anonymous Coward

        Re: Redundancy?

        I belive BT are using Huawei for the 21CN kit. And they are not exactly the sort of router you'd have at home.

        Something like this I would guess....but those with more knowledge of BT backbone may know better.

        http://e.huawei.com/uk/products/enterprise-networking/routers/ne/ne5000e

        1. Cynical Observer
          Trollface

          Re: Redundancy?

          @ Lost all faith...

          Taken from that link ....

          with carrier-grade 99.999% uptime performance.

          Well that's going to need some rewriting on the website or it had better be rock solid and stay up for the next 24 years (Estimating a two hour outage yesterday.)

          1. Primus Secundus Tertius

            Re: Redundancy?

            Virgin Internet struggle to hit two-nines reliability, let alone five-nines.

            1. jeremyjh

              Re: Redundancy?

              Mine didn't hit one nine. After two outages lasting entire bank holiday weekends and killing all cable services, I gave up and switched everything.

              They may yet have the last laugh though. On an exchange-only line. No FTTC available.

          2. Tom 38
            Headmaster

            Re: Redundancy?

            Well that's going to need some rewriting on the website or it had better be rock solid and stay up for the next 24 years (Estimating a two hour outage yesterday.)

            The router can be up and misconfigured.

          3. Disgruntled of TW
            Joke

            Re: Redundancy?

            @Cynical Observer .... 99.9999% uptime as measured over the last 300 seconds. They made 100%, as no outages in the last 300 seconds.

            Spin, without the detail. Lies, damn lies ...

        2. Anonymous Coward
          Anonymous Coward

          Re: Redundancy?

          Probably more like this

          https://www.dnorth.net/wp-content/uploads/2014/10/IMAG0236.jpg

          or this

          http://rufee.eu/servurz/4.JPG

        3. Anonymous Coward
          Anonymous Coward

          Re: Redundancy?

          The true IP/MPLE backbone for 21CN is now Alcatel-Lucent (Nokia now) XRS (7950) and 7750 SR12e. Although DC edges and various other PE's are no doubt still Huawej and Cisco. I expect the cutover from Cisco to Nokia backbone is still happening.

    2. MyffyW Silver badge

      Re: Redundancy?

      Probably, for some poor sod that's exactly what they'll get - redundancy.

  3. TonyJ

    One f*****g router?

    Dear BT... if you have single points of failure that can bring down hundreds of thousands of your customers then you need to fire the useless gimps who designed it, and employ someone who actually understands the concepts of resilience, high availability and possibly even DR.

    Understandable that you had your support call centres flooded with calls but over the space of an hour I either got cut off immediately or an engaged tone. On the handful of occasions I did get in a queue you disconnected me to silence after 3m 30s. You are BT ffs! If you can't handle call volume correctly there's no hope for anyone.

    Eventually you had a recorded message. Of course, by then world+dog had already worked out there was a major outage.

    Single router ffs.

    As I said elsewhere, myself and most people will accept the odd problem and this is only the second in 3 years. I am pissed off though that it was nigh on impossible to get information from the horses mouth. Bad, BT, bad.

    1. Volvic

      Re: One f*****g router?

      You need to read up on Spanning Tree. You can have a single faulty core router cause massive problems across your entire network regardless of the "concepts of resilience, high availability and possibly even DR" which you heard about on your last ITIL course.

      1. TonyJ

        Re: One f*****g router?

        "...

        You need to read up on Spanning Tree..."

        No I don't. I have networking specialists to do that side of it for me. Failing that there's always someone on el reg happy to prove they know more than someone else and to do it in the most condescending manner possible. You know, like you just did.

        Delegation. Didn't learn that on any course, ITIL or otherwise.

        1. Volvic

          Re: One f*****g router?

          Ah, I see, so you're happy to just mouth off about completely irrelevant ITSM disciplines when an incident that you don't understand occurs. You've got a bright managerial future ahead of you in that case!

          1. TonyJ

            Re: One f*****g router?

            ITIL...ITSM... all nicely wrapped up in arrogance... I smell a "consultant"

            Oooh I know. .. I'll see yours and raise you a TOGAF!

            People so needlessly aggressive as you are being and tryIng to prove endlessly they know more than random strangers online tend to be hiding behind the buzzwords and bravado.

            Ladies and gentlemen we seem to have found Dick from the internet (see various Dilbert sketches). That or... do you by any chance work for Capita???? I suspect so given your complete lack of a sense of irony or humour.

        2. Anonymous Coward
          Anonymous Coward

          Re: One f*****g router?

          Spanning tree is for switching (layer 2). Routers are for routing (layer 3) and don't use spanning tree in a network core since all their connected networks are usually point to point (no loops). They do use it at a layer 2 edge though yes, but a problem there would only affect that layer 2 domain.

    2. E 2

      Re: One f*****g router?

      I could be wrong, but if the offending router just started advertising bad routes or forwarding to bad destinations it could break stuff while not actually experiencing a hardware fail.

  4. msknight

    Ohh....

    ... just tell them we had a router failure. It took us a while to try a new filter but that didn't fix it so we had to unplug all the phones and disconnect the extension cables and plug the router directly in to the wall port, you know, so that the only equipment on the line was the router itself, but don't tell them that while we were keeping everyone looking at the right hand, they didn't see the left hand plug in the network cable that it accidentally unplugged at the exchange.

    And if they do find out, just blame an unknown third party because we've been forced to open up our premises to the competition.

    1. Disgruntled of TW
      Joke

      Re: Ohh....

      @msknight and ummm ... did they try:

      "Remember to reinstall Windows as we can't help you diagnose why your internet isn't working until you've done that. My call plan says you have to. You might be using your computer for things we're not aware of, so you have to reinstall it. Really you do."

  5. wolfetone Silver badge

    Faulty BT Router?

    They should replace them with a Billion router, like I did.

    Be smart BT, like me.

  6. Halfmad

    Yeah.. don't believe it.

    Such a bad excuse, might pacify home users but come on folks we all know this is more likely someone making a cock up of a change somewhere or during routine maintenance/testing (assuming it happens).

    1. Amorous Cowherder
      Facepalm

      Re: Yeah.. don't believe it.

      Exactly what I was thinking..

      a) a bad firmware update through a few core devices

      b) accidental flip from prod to test setup

      c) accidental flip from prod to poorly tested DR setup

  7. sysconfig

    Don't believe it

    About 2 hours after the outage began, @BTCare posted that "Engineers are now on site" (God knows where they had been, but that's another issue), and less than an hour later my Infinity connection is working like a charm again (so were those of other users).

    So they determined that the router is faulty and repaired/replaced it within an hour? But it takes them two hours to even get on site? A company of that size doing manual fail-over? Hard to believe.

    My money is on human error to do with the purchase of EE (and network changes related to that). But admitting failure is of course becoming increasingly politically incorrect. Attributing failure to others (kit and people) is the way to go.

    1. TheFifth

      Re: Don't believe it

      You were one of the lucky ones. Mine was off (in Plymouth) until at least 19:40. Not sure exactly what time it was back up as that's the time I gave up trying to work and went to the pub.

      It did briefly crawl back to life around 16:00, albeit very slow, but after 10 minutes or so it was gone again. Also, when it was back up I could only view certain websites (BBC, The Register), but others like GitHub were still not reachable.

      So the BT PR statement saying everyone was back up within 2 hours is a big fat lie. I agree that there seems to be more here than meets the eye.

    2. breakfast Silver badge

      Re: Don't believe it

      Two hours is extraordinarily fast by the standards of BT engineers turning up to sort out a router problem. They normally need at least two weeks.

    3. Disgruntled of TW

      Re: Don't believe it

      Someone had to press "F1"

  8. Camilla Smythe

    Wuh!?

    It does seem ever so slightly implausible that a 'single router' would be able to inflict such carnage across the entire network unless the entire network suddenly, for some reason, became dependent on that 'single router' perhaps by having all of its traffic sent through that 'single router'... Is a 'single router' now the acceptable technical term for a chuffing great big rack of DPI kit bought by GCHQ from Huawei and installed in the network?

    1. leexgx

      Re: Wuh!?

      i would of thought this was the Login server that crashed (as routers that had not disconnected stayed working or the one at work never went off) they really need to make the authentication server redundant at least this time it was only 2 hours not 12

      1. Anonymous Coward
        Anonymous Coward

        Re: Wuh!?

        @ leexgx "I would HAVE thought this..." or, if you are feeling particularly slovenly: "I WOULD'VE thought this....."

  9. s. pam Silver badge
    Coffee/keyboard

    Horsefeathers -- was it a DDOS or a failed upgrade?

    To air is human but to really fuck things up is a oft overused phrase.

    You don't take down a couple hundred thousand users without either being hit by a DDOS or an upgrade gone wrong. No one could be so cheap as to have a SPOF or could they?

    Perhaps, BT couldn't get through to the scummy helpLESS lines in OpenReach to get help? That'd be fitting karma!

    1. Alister

      Re: Horsefeathers -- was it a DDOS or a failed upgrade?

      It wasn't a DDOS.

      Not unless a DDOS can suddenly cause the BT network to be handing out non-routable addresses to customer equipment, which is what I saw happening last night to our business ADSL links.

      Our router/modems were being assigned addresses in the 172.16.0.0/12 subnet on their WAN interfaces, for a few hours, then suddenly they were assigned proper BT external addresses, and away we went.

      1. Flere-Imsaho
        WTF?

        Re: Horsefeathers -- was it a DDOS or a failed upgrade?

        <Our router/modems were being assigned addresses in the 172.16.0.0/12 subnet on their WAN interfaces>

        Sounds to me like a rogue DHCP server within their network.

      2. Phil O'Sophical Silver badge

        Re: Horsefeathers -- was it a DDOS or a failed upgrade?

        Our router/modems were being assigned addresses in the 172.16.0.0/12 subnet on their WAN interfaces, for a few hours, then suddenly they were assigned proper BT external addresses, and away we went.

        Somebody activated the CG NAT too soon?

      3. Chloe Cresswell Silver badge

        Re: Horsefeathers -- was it a DDOS or a failed upgrade?

        Same here. which is normally an indecation that the local control node can't talk to the ISP auth systems.

        Normally it's ESR(number).sheffield returning IPs instead of BT broadband's systems.

        I used to get it with demon as well, when sheffield lost contact with demon/anchorage house, and again, I'd get a PPP login with a 172.16 address.

        Had 5 clients down and 2 not down but slow. Which looks again like a reauth happened on the 5, and nothing was reponding, hence they went down.

      4. Andrew Jones 2

        Re: Horsefeathers -- was it a DDOS or a failed upgrade?

        That was only for the people who could get connected in the first place though - the rest of us couldn't even make it past CHAP authentication - so yes - a DDOS against the authentication server is certainly a possibility.

      5. Vince

        Re: Horsefeathers -- was it a DDOS or a failed upgrade?

        Alister, those 172.16.0.0/12 addresses were not due to the faulty router handing out non-routable, but what happens if BT Wholesale can't reach your ISPs authentication platform (RADIUS etc), or if there's too much load.

        Rather than not authenticate you, so your router tries immediately again, they effectively throttle by accepting the login (regardless), issue a 172 address, and you can get to a BT Wholesale page that basically says something like this if you accept the DHCPd addresses and DNS. It disconnects you a bit later so you can try again.

        This essentially helps them manage the surge in login demands, but is nothing to do with the original fault.

        1. Alister

          Re: Horsefeathers -- was it a DDOS or a failed upgrade?

          @Vince.

          Ah right, thanks very much for the info, I didn't know that that was how it worked.

      6. Matt Bradley

        Re: Horsefeathers -- was it a DDOS or a failed upgrade?

        Ooh. Interesting. I did not notice that. Good spot.

    2. Yugguy
      Headmaster

      Re: Horsefeathers -- was it a DDOS or a failed upgrade?

      I'm sorry I can't help myself.

      It is "To err is human"

      I gave you an upvote to make up for my pedantic correction.

      1. Anonymous Coward
        Anonymous Coward

        Re: Horsefeathers -- was it a DDOS or a failed upgrade?

        Here's an upvote for YOU, because if you hadn't I would HAVE..

    3. x 7

      Re: Horsefeathers -- was it a DDOS or a failed upgrade?

      "To air is human"

      Look, I know all humans pass wind but there's no need to remind us

      Or did you just err as you typed?

  10. payne747

    Come on people

    As those engineers who have been around long enough know, no amount of backup or redundancy can save you from that one massive core router that refuses to die in all the traditional senses but nevertheless still fails to do its job. That's how one router causes an outage.

    Consider a device that's pingable, responds to SNMP and can even accept traffic but despite being online in every monitored way, still fails to route, drops 85% of its packets and kicks you out of the console every 10 seconds - those, my friends, are the ones that causes f**k-ups like this.

    But knowing BT - what they actually mean is a cluster of primary and standby routers went "faulty" because they pushed out a duff config to it and took 5 hours to find out who did what where.

    1. sysconfig

      Re: Come on people

      Consider a device that's pingable, responds to SNMP and can even accept traffic but despite being online in every monitored way, still fails to route, drops 85% of its packets and kicks you out of the console every 10 seconds - those, my friends, are the ones that causes f**k-ups like this.

      Valid point, but it doesn't take so many hours to resolve during daytime! In a scenario like that you'd probably send over the person who can be there quickest (often some junior techie) and instruct them to pull cables of certain colours or with certain labels. Connectivity is already severly affected, there's little risk to make things worse.

      They'd shit themselves, but do it, and bang you're back online, because your failover kit is configured properly and takes over in a snap. When reinforcements arrive, they can work out and resolve the issue.

      More likely that they did a not-rolling update, indeed. Always a good idea to apply changes to all parts of a redundant infrastructure simultaneously...not.

      Or maybe, BT are readers of El Reg's new DevOps column and embrace the concept of automated (and unattended) deployments, and especially failure being a good thing!

      1. The First Dave

        Re: Come on people

        From personal observation, it was about 1h 45mins +/- five minutes, so not exactly 'many hours'.

        1. TheFifth

          Re: Come on people

          From my observation it was 5.5 hours at least (I gave up waiting and went to the pub at this point). The difference in recovery time around the UK possibly points to there being something other wrong than a single router dying?

      2. Anonymous Coward
        Anonymous Coward

        Re: Come on people

        Agreed, in many situations where a failover has failed, it doesn't need to be someone trained to physically solve it, just someone who can follow instructions.

        One of my customer's switches didn't come back from a power outage in a good state, and spanning tree went nuts taking their entire network offline and causing chaos. Problem resolved remotely: "see the cable between switch A port X, and switch B port X - unplug it" - instruction followed by minion on the ground, network instantly back online, to the extent the dodgy switch could be remotely diagnosed, reconfigured not to use the port that had gone bad, and redundant link re-plugged while we waited for a replacement part to turn up on site...

        Too many people assume that failures will manifest themselves as complete absence of a piece of kit, not that a piece of kit will merrily start fucking up everything around it. Designing redundant systems that cater for the first kind of failure is easy; designing redundant systems that cater for the second kind of failure is much harder and orders of magnitude more expensive...

  11. Anonymous Coward
    Anonymous Coward

    do i smell a spanning tree loop...

    Router error sounds a bit like a generic "IT" problem for a network organisation.

    Similar to "its a batch scheduling problem" when there are no issues with the scheduler but the CR*P scripts being scheduled.

    Probably translated by a press officer that has only the vaguest notion of networks existance, let alone technical issues.

    I smell a fundamental configuration error rather than a router problem, spanning tree, BGP, backup route switched off or under specified or similar. Magnify that by poor monitoing and you get a nice combination for management flappage (dont change anything / blamestorming) and a nice long outage.

    1. Chika
      FAIL

      Re: do i smell a spanning tree loop...

      Router error sounds a bit like a generic "IT" problem for a network organisation.

      I wondered if someone might say that.

      Even if the "router" in question was faulty, why did it take so long to resolve? No failover? Configuration balls up? No. Either this was something a lot bigger than the sum of one router (or indeed smaller so that they couldn't see what it was) or they are covering their arses.

      Given the money and reputation (if any) that this has likely cost BT, I'd expect at least one head to roll.

      1. Roland6 Silver badge

        Re: do i smell a spanning tree loop...

        Would agree the problem was probably more to do with spanning tree and table contents rather than hardware. That would help to explain the reason why it took time to identify and fix the 'one' misbehaving router. Then flush and rebuilt the tables...

  12. Nigel 11

    Rural broadband

    There's only one answer for rural broadband, and it's independant of the future of BT Openreach.

    Some time during the last century a telephone line went from being a luxury to an essential, and the Post Office (as it then was) was placed under a universal service obligation. Which inevitably made telephone lines slightly more expensive for everyone in towns and cities.

    Around now, a broadband service of at least 4Mbps (I'd say 8Mbps) has gone from being a luxury to an essential, and OpenReach needs to be placed under a universal service obligation. And yes, it means that everyone's fixed line rental will have to go up a bit.

    Until they are under a USO it is simple economics that they will concentrate on the 90% of the profit that comes most easily (ie folks in towns and cities) and pay lip-service only to providing folks living a long way away from an exchange with anything but the least good service that they can get away with.

    1. David Roberts

      Re: Rural broadband

      Not my rental, kid. I'm on Virgin cable.

      Oh, hang on, the rental goes up anyway.....

    2. Anonymous Coward
      Anonymous Coward

      Re: Rural broadband

      You know the government are proposing a USO of 10MB? However, today's outage isn't linked to this .

      An independent Openreach wouldn't affect an USO either.

      BT are responsible for the current USO but one of the issues with a new broadband USO is that Openreach now face infrastructure competition in over 50% of their market - not just from Virgin but also a lot of other providers -so it still has to be decided who will fund the new USO and implement it.

      The other operators have already said they would be very unhappy/object if only Openreach is given funding/responsibility for the USO as it will help OR compete more effectively against them....

    3. SImon Hobson Bronze badge
      Facepalm

      Re: Rural broadband

      > ... was placed under a universal service obligation.

      Never looked at the terms of that USO have you ?

      Yes, you can call up BT and have a phone line installed just about anywhere, but that doesn't mean you can have it installed in the middle of nowhere for the same £99 (or whatever it is these days) fee that someone next to the green cabinet pays. If you're in the middle of nowhere, they'll quote you for excess construction charges - £xk/100m for trenching, £/pole for overhead - and it'll mount up very quickly. In fact, it'll mount up so very quickly that few can afford to take up that universal service.

      The same already applies more or less to broadband. Sure there's no USO, but if you are prepared to pay for it, there are some options available - they just aren't within the budget of a typical household !

      1. Anonymous Coward
        Anonymous Coward

        Re: Rural broadband

        If you're in the middle of nowhere, they'll quote you for excess construction charges

        Yep, used to be that the first 5 new poles were free, after that you'd get a bill for the rest.

  13. pewpie

    Fawlty router.. Sounds like Basil cooked up that story himself..I hope they gave it a damn good thrashing.

    1. x 7

      fawlty router?

      was the techie named Manuel?

      1. Phil O'Sophical Silver badge
        Coat

        Re: fawlty router?

        was the techie named Manuel?

        RTFM ?

  14. fnusnu

    Presumably this was Whoarewe testing their kill switch?

  15. Orwell44

    Splicing the main cable for an intercept?

    1. Sir Runcible Spoon
      Pirate

      Get yerself into that nest laddie...yarrr!

  16. ukgnome

    Faulty Router?

    Faulty excuse - like everyone has said, when you have a SPOF you have a problem. As for openreach, it's not totally their fault. Take my sub-exchange for instance, it serves 500 houses. That's not many, but most of them are close enough to connect directly to it. There isn't space for the roadside furniture to sit. So what often happens is that openreach cable up the exchange and then find themselves in the predicament of not having a cabinet to fibre up. It would be prohibitive to fibre to the premises so they need another solution.

    1. Anonymous Coward
      Anonymous Coward

      Re: Faulty Router?

      Round here, they have taken to putting cabinets in the exchange car park to solve this particular issue. Of course, not all exchanges have sufficient space available outside. Who knows what happens then - I suppose you are SOL until they figure it out.

  17. mrmond

    Didn't affect me, Not had any problems with BT so far in 4 months. Routers basic but fairly reliable.

    Sometimes the network goes down. Any network. I don't care who you're with. Sky, Virgin, Talk Talk. They all have outages and I've had them all fail over the years. Do something else for a few hours till it comes back online. The world didn't end. Yet.

  18. Anonymous Coward
    Anonymous Coward

    Metro Node - possibly a part of the issue

    I am not a 21CN techhie but this complaint seems to relate to a real concern in 2009 in using a single router from the backhaul system.

    https://groups.google.com/forum/#!topic/uk.net.providers.aaisp/ctvXu8C08wc

    Any real 21CN'ers going to comment? :)

  19. Anonymous Coward
    Anonymous Coward

    Faulty router? I'm not so sure I reckon it was probably some little old cleaner plugging a henry in to clean the room.

    Sounds a bit more plausible.

  20. Ryan Kendall

    As a BT user

    I find this funny living out in the middle of nowhere, the downtime didn't affect me.

  21. Anonymous Coward
    Anonymous Coward

    Metro Node - possibly a part of the issue

    It would seem that this has been a concern since 2009 in the 21CN network as this transcript alludes to :

    Posted at 2009-09-08 15:30 BST by AAISP

    Update #1: 2009-09-08 15:44 BST

    We were shocked to learn that BT had a "single box design" for WMBC

    links to us, where by they have, in each metro node, only one (albeit

    very large) router. The consequence is that planned maintenance causes

    3 to 4 hour outages for customers even if they have multiple (BT)

    lines. It also makes them vulnerable to vendor specific bugs and

    issues.

    This sounds very familiar - any real 21CN techies like to comment?

    1. Anonymous Coward
      Anonymous Coward

      Re: Metro Node - possibly a part of the issue

      Certainly there should be a failover device, however for maintenance the routers I use have two routing engines, one is active the other is standby. To do maintenance/upgrades on them you simply failover to the other. I guess I haven't worked with enough different types of equipment to know if this is common; however, maintenance and upgrades should take place during the slowest periods and customers should be made aware long enough before the maintenance that they can plan for potential issues. Like not exchanging database info with head office and such.

    2. Anonymous Coward
      Anonymous Coward

      Re: Metro Node - possibly a part of the issue

      "This sounds very familiar - any real 21CN techies like to comment?"

      Adrian Kennard at AAISP covers topics like this in his personal bog at www.revk.uk

      He's already covered the "single point of failure" design some time ago (more recently than 2009)

      I can't find the particular post right now, which is a shame, but this is worth reading:

      http://www.revk.uk/2014/02/bt-21cn-not-fit-for-purpose.html

  22. Wade Burchette

    Reason #1,063 as to why the "cloud" and SaaS is a bad idea

    Outages can and will happen. If all the important work you need to do is in the cloud and there is an outage, you are screwed.

  23. Derichleau

    Redundancy much?

    Good job BT don't make planes.

  24. Tim Brown 1
    Mushroom

    Twenty years from now...

    a former BT engineer may post the real story in "On Call"!

  25. Anonymous Coward
    Anonymous Coward

    sounds like

    Someone forgot to clear the logs at GCHQ and they ran out of disk space.

  26. scrubber
    Big Brother

    Anyone use a VPN?

    When the network stopped working it initially failed to serve US pages, then UK. At which point I switched on my VPN, routed to Holland and it was all absolutely fine.

    From this limited data I find it unlikely that a single router would cause this behaviour.

    Big Brother because now they know I have VPN they'll be watching me...

    1. Andrew Jones 2

      Re: Anyone use a VPN?

      It seems that the authentication server, DNS server and DHCP server all fell over - possibly caused by problems with whatever core router it was that failed. Anyway providing your connection stayed up - then the lack of the ISP DNS server would result in you only being able to load webpages for which your computer or home router already had cached DNS results for. Many users reported that changing their DNS settings to Google or OpenDNS fixed their browsing problems - which it would because their only browsing problems were that they could not reach the DNS server to translate domains into IP addresses. Using a VPN would result in pretty much the same behaviour as simply pointing your computer at a different (working) DNS server - except of course that all your traffic would be routed down the VPN pipe too. None of these things however made the slightest bit of difference to people who could not even get past the authentication stage or to the people who were being handed non routable IP addresses.

      1. scrubber

        Re: Anyone use a VPN?

        "It seems that the authentication server, DNS server and DHCP server all fell over "

        No, it was just a single router - the PR dept. said so so it must be true.

      2. carl0s

        Re: Anyone use a VPN?

        BTnet leased lines (fibre to the prem) at two of my sites in South Manchester both lost connectivity to various destinations, while other destinations were fine.

        DNS lookups were OK, using BT's resolvers as it happens, but there was no working route to the problematic destinations.

        We had people in remote locations who lost access to our stuff as well. On-prem mail servers not receiving mail from a majority of sources, or sending to, etc.

        Nightmare. Thankfully the SIP provider was still reachable, else I'd have been having a total meltdown :D

        1. Anonymous Coward
          Anonymous Coward

          Re: Anyone use a VPN?

          "BTnet leased lines (fibre to the prem) at two of my sites in South Manchester both lost connectivity to various destinations, while other destinations were fine.

          DNS lookups were OK, using BT's resolvers as it happens, but there was no working route to the problematic destinations."

          Thank you. That's quite possibly the single most enlightening bit of "reporting" I've seen so far:

          1) The failures were not confined to BT Retail's broadband customers

          2) The failures were not simple DNS failures

          3) The failures included loss of routing to some networks, as well as loss of servers/services

          It's almost as though somebody backhoe'd (or DevOps equivalent) one of BT's internal backbone circuits, which was carrying multiple classes of traffic on behalf of multiple classes of BT user (internal and external), and multiple classes of failover either didn't exist or didn't work as planned.

          1. scrubber

            Re: Anyone use a VPN?

            Got it!

            It was GCHQs first large scale attempt at geographically limiting UK traffic. I'm sure they're telling themselves but will only be used in times of national emergency...

  27. Anonymous Coward
    Anonymous Coward

    Faulty router or...

    another unauthorised remote firmware update by spooks gone awry :D

  28. Anonymous Coward
    Anonymous Coward

    So the idea of having a Free WiFi hotspot at any sized BT-Router is out the window?

  29. Outcast

    Never had a problem here in Peterborough. I was online all day & night (using the rest of last years hols up). I have changed BT's HH to a Netgear jobbie though (cost me about £160)

  30. Anonymous Coward
    Anonymous Coward

    Took them 5 hours to fly in one of the support team from India to restart the system.

  31. Bladeforce

    They just downgraded to..

    ..Windows 10tacles and the drivers went t1ts up

  32. Kubla Cant
    Unhappy

    BT email seems to have died

    It's 17:30 on Wednesday, and the last incoming email in my BT account was received at 09:06. Perhaps nobody loves me any more. But a test mail sent from work has failed to arrive, so it looks like all is still not well with Brutish Telecom.

  33. Anonymous Coward
    Anonymous Coward

    Not just BT

    My ISP isn't BT, but they too were caught up in this, but only at some point between 18:30 and 20:30 yesterday. My ADSL was working until 18:30 when I went out. When I returned 2 hours later, the connection had dropped. Rebooting my router saw the connection to the exchange re-established, but I couldn't authenticate. It remained this way until 22:20 when it managed to. My ISP blames BT for their "network-wide outage".

  34. PNGuinn
    Boffin

    Word on the street...

    Word on the street is that BT are at least not wholly to blame for the problem.

    It appears that the problems begun when a maintenance johnny at a swish new place somewhere to the east of the City was trying to fix something up and hammered a 6 inch nail into a wall severing a cable. This took down the BT Home Hub that was powering the outfit's customer support operation. When the hub crashed it somehow fed back into BT's systems and the rest is history.

    Apparently replacing the damaged cable was a simple operation but it took the outfit a while to locate the appropriate pallet of spares in the stores and unpack it. When this was finally accomplished the Home Hub happily rebooted itself and the system started to recover.

    When asked why he had used a 6 inch nail Maintenance replied "Why ever not - we bought a pallet of the buggers and we've got plenty."

    It's not officially confirmed what exactly he was trying to fix up but it's been suggested that it was some sort of new sign for their swanky new hospitality suite which had only recently been delivered.

    So now we know. Cockup not conspiracy. And apparently the sign was delivered by the front door. At ease. Remove tinfoil hats. Move along, please - nothing to see here. Normal service - next outage will be along shortly.

  35. duncandunnit

    bt have a monopily

    bt have a monopoly and should be broken up.

  36. Anonymous Coward
    Anonymous Coward

    Five 9's Uptime

    That's odd - some of BT's literature claims five 9's uptime, a quick Google finds one example:

    "the solution is hosted in the core of BT's network with 99.999% availability"

    Or just a few minutes a year.. Seems they broke that promise for many customers yesterday alone off just one faulty piece of kit. Hopefully they'll change such documents to 99.95% now! :-D

  37. Anonymous Coward
    Big Brother

    Aye, right

    Get real everyone, of course it wasn't a 'faulty router'. The real reason is subject to a 'D Notice' on grounds of national securidee.

    Anyone remember the Manchester tunnel fire - http://www.telegraph.co.uk/finance/yourbusiness/2882204/BT-fire-brings-chaos-to-Manchester.html

    That knocked out the bizzies' national network, but wasn't reported for obvious reasons.

  38. Ken Moorhouse Silver badge

    Faulty Router

    We'll send another one out in the post. Should arrive in a couple of days.

  39. Anonymous Coward
    Anonymous Coward

    TeraData are going to buy HortonWorks

  40. Spaceman Spiff

    Only one bad router was required to nuke their entire operation? God what ID10TS! They should immediately fire their entire network operations management team and promote some of there better and more experienced engineers. No doubt they complained to management on numerous occasions about this Achilles Heel? And were ignored, or told that fixing it would be too expensive? Where have I seen this before?

  41. Barry Page

    Apparently they did restart the router, but they didn't wait 30 seconds before turning it on again.

  42. Charles Smith

    Right tool?

    Why would a woodworker's power tool cause a major Internet failure at BT?

    1. tim 13

      Re: Right tool?

      Have you seen the damage a router can do to a router?

  43. PeteKernow

    BT Core and Metro Routers

    They use the Alcatel-Lucent 7950 XRS (Core Router) in the 21CN https://drive.google.com/file/d/0Bw8QIB3rIxgqeEFHSjMzNE5vb0k/view?usp=sharing

  44. spiny norman

    Why it took so long to fix ....

    They ordered the new router from Argos with same-day delivery.

  45. This post has been deleted by its author

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like