back to article Facebook rendered spineless by buggy audit code that missed catastrophic network config error

Facebook has admitted buggy auditing code was at the core of yesterday's six-hour outage – and revealed a little more about its infrastructure to explain how it vanished from the internet. In a write-up by infrastructure veep Santosh Janardhan, titled, "more details about the October 4 outage," the outrage-monetization giant …

  1. Mark 85 Silver badge
    Facepalm

    Then it was basically a single point of failure... a single system failure at that which drug everything down. Perhaps they need a human to sit inside the door to open it just in case something like this happens again?

    On other hand, I wonder how many divorces, babies, etc. will result from this outage by folks who finally had to talk to each other instead of posting? Sort of like a power outage that lasts a day or so.

    I think this icon is appropriate for them.... They should self apply it about 20 times with a sledge hammer.

  2. This post has been deleted by its author

  3. Mark 85 Silver badge

    Not likely but possible or it was hackers.

    https://www.businessinsider.com/facebook-outage-likely-not-linked-whistleblower-claims-hearing-frances-haugen-2021-10

    1. HildyJ Silver badge
      Holmes

      More likely

      More likely it was Zuck trying to distract the news from coverage of the whistleblower.

      The estimated $60m in losses isn't even pocket change compared to the billions Zuck paid the FTC to take his name and testimony off the Cambridge Analytica settlement.

  4. The Man Who Fell To Earth Silver badge
    WTF?

    Facebook?

    What's that?

  5. DS999 Silver badge

    Too bad their security wasn't better

    If the DNS error had permanently locked them out of remote and physical access with no way around it they'd have to close down their business and we'd all be better off!

    1. Pascal Monett Silver badge

      Re: Too bad their security wasn't better

      Maybe, maybe not.

      I say that in reference to the fact that, since FaceBook's rise, my email spam count has fallen into the very low single-digit zone, so most of the shit I previously had to deal with is now floating in FaceBook's waters and that suits me fine.

    2. Danny 2 Silver badge

      Re: Too bad their security wasn't better

      Actually, the IT staff wanting to fix it were locked out of the building because their security passes didn't work. Quite rightly Stephen Colbert had a comedy ball -

      Facebook's Bad Day: Whistleblower's Claims Go Viral Before Global Outage Takes It All Down

      1. Snowy Silver badge
        Go

        Re: Too bad their security wasn't better

        Thanks for the link to to video, shame I can only upvote you once.

    3. Mark 85 Silver badge

      Re: Too bad their security wasn't better

      They needed a back up system to get in the doors... say a metal key. Quaint but works.

      1. herman Silver badge

        Re: Too bad their security wasn't better

        A metal key, like a medium sledge hammer you mean?

    4. Anonymous Coward
      Anonymous Coward

      Re: Too bad their security wasn't better

      If the DNS error had permanently locked them out of remote and physical access with no way around it they'd have to close down their business and we'd all be better off!

      That is not quite true. As a social network for the elderly and impaired, Facebook does have its uses, like it or not.

      https://www.stonekettle.com/2021/10/recap-october-4-2021.html

  6. lowwall
    Pint

    Who me?

    "During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network,"

    I'd like to nominate the lady or gent who entered this command for your next edition of Who Me.

    And offer them a no-doubt sorely needed pint.

    1. katrinab Silver badge
      Alien

      Re: Who me?

      Judging by recent episodes of Who, Me?, I think we will read about it in around 2040-2045.

      It is only fairly recently that we read about the AOL email outage in the late 1990s.

      1. chivo243 Silver badge

        Re: Who me?

        Or banking issues back when computers were brought in to save money to the customer, not cost an extra surcharge for withdrawing your own money...

        1. Noel Morgan
          Trollface

          Re: Who me?

          Computers were never brought in to save the customers money, they were brought in to increase profits for the banks.

      2. Xalran

        Re: Who me?

        Well there has to be some delay... the Who, Me? could make people loose their job if they get identified.

        I have a few I'll have to send for Who, Me? and On Call ( mainly On Call, I was not directly involved in the best Who, Me? )... one day... probably when they hit the 20 years of age. ( so that most of the people involved would have moved on )

        They are so specific and the French Telecom Universe is so small ( It's common to meet former colleagues in the corridors of $TELCO... to the point that nobody is surprised anymore about it [or to meet somebody that worked for $TELCO1 you worked with on a project in the corridors of $TELCO2 a few years later ) that many people that heard about them would be able to identify me and the other people involved despite all the Regomizer effort.

    2. John Robson Silver badge
      Trollface

      Re: Who me?

      I was too lazy, so i wrote a report saying zero accessibility and then updated reality to reflect my report.

    3. fajensen

      Re: Who me?

      Anyone done 'Reload' on a CISCO router only to find out that the live configuration was never stored despite the thing running for years?

      1. John Robson Silver badge

        Re: Who me?

        About as many people as have run Reload on a cisco router.

        1. Flightmode

          Re: Who me?

          How about messing up the config register setting causing a newly upgraded router end up in rommon with a blank config?

          I got the Platinum Level Achievement in this class - I did it on one of our out-of-band access routers. It's ironic having to get someone on site to log on to the console of your remote console server.

          1. John Robson Silver badge

            Re: Who me?

            Isn't that the one device you want *in band* management for?

        2. WanderingHaggis

          Re: Who me?

          Whatever happened to the you know have five minutes to confirm everything is happy or I'll reboot and fall back to the previous configuration. Fortigate can be told to behave like this and Juno does it out the box probably a script or something could be added to cisco. Really cool except when you sigh with relief and go for a coffee forgetting to confirm and save.

  7. jollyboyspecial Bronze badge

    Out of band management?

    One thing that stunned me about this is that they needed physical access to their data centres to resolve the issue. Has nobody at Facebook ever heard of out of band management?

    You'd think that a company as big as Facebook could afford to chuck in some off net connections to their DCs for the purposes of device management. I remember the days when we used to have a dial up modern connection to the console port of each of our core routers and switches. Maybe things have moved on a bit since then, but that sort of setup would have saved them a lot of time. There's virtually nothing you can do locally that you couldn't do with a console port connection.

    1. Anonymous Coward Silver badge
      Big Brother

      Re: Out of band management?

      The old saying "without physical security, there is no security"... well out of band management effectively puts your physical security in the realm of that management system.

      How well secured is it?

      Is it at least as secure as your physical access precautions, including access control that's definitely up to date but not linked to your main authentication systems?

      OOB management has its place, but (in my view) some systems are just too critical to risk that additional exposure.

      1. Graham Cobb Silver badge

        Re: Out of band management?

        While it is true that there are serious security issues with OOB access, it is reasonably well understood how to make the tradeoff. Basically, you put lots of tedious locks on the OOB access so it is very secure and requires a lot of cooperation between people to make it work.

        But, more importantly, there are external companies who can take this on. Sure, FB's needs are well above the scale of most of these but there are a few global players who could do it. I am guessing a few of the global telco groups, and one or two of the major IT outsourcers (maybe IBM?).

        The hardest part is testing it - remembering that it needs to work even when FB's DNS servers are all down is the sort of thing that could have been forgotten (but not now!).

      2. jgard

        Re: Out of band management?

        That's simply untrue, it's actually far easier to secure out-of-band than normal in-band access. In band access has to be provided on a wider basis for many use cases and many users. The attack-surface and number of potential vulnerabilities are many times greater than OOB done properly.

        For a company the size of Facebook, implementing a secure OOB network is trivial. Point to point ethernet over fibre, mutually authenticated point to point VPN (authenticated by cert and another factor), physically secured and dedicated terminal in remote Facebook office. Designated engineers using multifactor auth and protected, physically secured creds, plus a code only the engineer knows. Monitored 24/7.

        I suppose you could dig up the fibre, splice to your identical hacking hardware, use the cert you previously nicked from the physically secured and network isolated machine in the FB office. Then get your coconspirator (who has managed to break into the live datacentre) to go through the mutual auth process with you. Then log in with the physically protected credentials and MFA tokens you have stolen from the Facebook offices, along with the access code that only the designated engineer(s) knows. You would of course have to do this before the 24/7 sec ops team saw the link go down.

        Your other option would be to go to Facebook HQ, bash all the guards on the head with a truncheon and make your way through their labyrinthine super-secure building, get to the physically secured terminal, read the mind of the engineer(s) with access codes, get the MFA tokens from the other safe. Again, this would have to be done before sec ops found out that HQ is under attack by hackers armed with truncheons, bashing guards on the noggin and running round the offices like barbarians in Rome.

    2. Victor Ludorum
      FAIL

      Re: Out of band management?

      Quote from Santosh's writeup:

      'Our primary and out-of-band network access was down, so we sent engineers onsite to the data centers to have them debug the issue and restart the systems.'

      1. Anonymous Coward
        Anonymous Coward

        Re: Out of band management?

        primary and out-of-band network access was down

        That kind of implies that the "out-of-band" access really isn't - like the backup route from the alternative supplier that's actually routed through the same fibre as your primary.

        I'm rather surprised that loss of DNS broke many of the internal tools - you'd kind of want your network management tools to be robust in the face of DNS failure. Mind you, the same applies to door security systems...

        1. Pascal Monett Silver badge

          Re: Out of band management?

          Well, not a network guy, but if their network is down it kinda seems logical to me that they couldn't use remote tools to see what was going on.

          In the end, somebody will always have to go and press the I/O button.

          1. Xalran

            Re: Out of band management?

            In theory, the Out of Band Management Network is build for that specific purpose.

            It's supposed to be a fully separate standalone network, that can be reached through other means than the normal operating network.

            In security and availability aware companies, it's really physically separate from the normal network ( read : it has it's own network equipment, it's own routing configuration and it's own ( sometime remote ) access that's fully independant of the rest of the network. ( but in many cases, while it's a separate network, it has no remote access for security reasons, you need to be in the datacenter to access it... Security )

            Fesse de Bouc apparently is not that kind of company.

        2. Lon24 Silver badge

          Re: Out of band management?

          Yep, DNS failure/compromise is an expected hazard to any SysAdmin. One would expect someone to have a fresh copy of a host file to their most critical servers that could be quickly circulated to other engineers. That would have speeded up diagnosis and allowed the Zuck back into his office.

          Indeed my memories of 'The Social Network' would imply Zuck would have one. But maybe his lappy was stuck in his orifice?

          1. 42656e4d203239 Bronze badge

            Re: Out of band management?

            as it was borked BGP at the root of the issue, subsequently cascading network wide, having the IP addresses probably wouldn't work either (from anywhere outside of the target's LAN) - depending on route caching and aging...

            As the other poster said, sometimes you just have to get physical with the servers/routers.

            1. Graham Cobb Silver badge

              Re: Out of band management?

              Yes, I am sure FB will now be putting caching DNS servers, with longer-than-normal caches, at important internal network sites. It might be a good idea for external DNS servers that have lost contact with the root to withdraw their BGP routes but you don't want the internal ones to do so - better that they have out of date info than that they prevent using the tools necessary to fix things.

              1. emfiliane

                Re: Out of band management?

                That leads you right back to the *everyday* problem that they fixed this way in the first place: When the systems that gateway is managing go down, it can either cache the IP and advertisement, so that everyone on the same wider network continues trying to connect to the dead systems and productivity and access comes to a complete halt for minutes or hours, or it can immediately take proactive action to withdraw its advertisement so that any of the alternate mirror networks can be used instead. They implemented this in the first place because it's such a routine, everyday problem at their scale, while they've accidentally unpublished the whole company once in 20 years.

                No, they won't add a bunch of long caches to fix this, and if that's your knee-jerk reaction, I hope you don't work at multi-site scale.

            2. Xalran

              Re: Out of band management?

              As I read it, it was an unspecified borked change in their backbone, that led the BGP to withdraw all the routes, which then led the DNS server to go FUBAR.

          2. Blank Reg Silver badge

            Re: Out of band management?

            It's been a few years since I last visited Facebook HQ, but at that time Zuckerberg didn't have an office, just a desk among the sea of desks in the open plan office

          3. fajensen

            Re: Out of band management?

            Except the disk drive with that host file was virtualised, it's physical location recorded in the EAM system - also virtualised?

        3. Anonymous Coward
          Anonymous Coward

          Re: Out of band management?

          In most cases, "out of band" comms are simply a different subnet/VLAN potentially on a different switch but still using the same routers. I've not heard of anyone using an entire mirrored network for management comms, although it's entirely possible, if expensive.

          As for DNS breaking internal tools - having worked in places where IPs were hardcoded or host files were in place, it was a pain to manage and the recommendation was to use DNS to resolve a FQDN (not short name) when connecting to another server, then rely on DNS being accurate/up to date. It does make DNS highly critical to all your systems, but DNS is well understood and generally pretty reliable. Until the network becomes a NotWork. In any case, even if servers had hard coded IPs or local hosts files, they probably couldn't have done much anyway, since routing was TITSUP (Total Inability To Support Usual Packet-switching).

          1. Anonymous Coward
            Anonymous Coward

            Re: Out of band management?

            I've not heard of anyone using an entire mirrored network for management comms, although it's entirely possible, if expensive

            At our place the core routers (number in double figures) all have physically separate connectivity ( fibre or 4G dongle) to their console ports. We've been bitten too many times with router crashes* that needed someone to go to site and type "reload now" into the console port.

            * - Or when you do a "Who me?" moment and screw up the config and you can't SSH in.

            1. IGotOut Silver badge

              Re: Out of band management?

              "all have physically separate connectivity ( fibre or 4G dongle"

              Oh you poor deluded fool.

              Just wait until a digger slices between the trunk line of an exchange. Then you have a cascade fault, which means the alternative exchange, that your other carrier uses falls over. Then you realise your 4G is relying on the same fibre bachaul as has just been sliced.

              It's OK you have a third link, but that alternative European link you have is now routing through another country, who just happens to have a big outage themselves.

              If this sounds like fantasy, it's not. Had it happen..... Twice.

              1. Graham Cobb Silver badge

                Re: Out of band management?

                Sure - but there is nothing you can reasonably do in that case - and it won't just be you having been hit in that case but a significant part of the economy of that country.

                OOB isn't OOB unless it goes through someone else's fibres and routers - that is what global telco groups are for: your OOB should be using a different telco from your main comms, and you put reasonable effort into getting diverse routing where possible. And you expect that about 30% of that will turn out not to work ("both telcos put their fibres in the same duct which the backhoe caught") but at least you can get 60% of the network back.

              2. A Non e-mouse Silver badge

                Re: Out of band management?

                It's all about risk management. You evaluate the risk, costs & mitigation costs.

            2. Anonymous Coward
              Anonymous Coward

              Re: Out of band management?

              I work for a mobile operator and our core routers have 4G remote access. From a competitor with whom we do not share towers or PoP locations.

            3. Anonymous Coward
              Anonymous Coward

              Re: Out of band management?

              At our place the core routers (number in double figures) all have physically separate connectivity

              Core routers. In Facebook's BGP-all-the-racks setup, every rack switch in every DC becomes a core router.

              https://research.fb.com/wp-content/uploads/2021/03/Running-BGP-in-Data-Centers-at-Scale_final.pdf

          2. Anonymous Coward
            Anonymous Coward

            Re: Out of band management?

            Yeah, but a *different* DNS right? Separating internal from external, or production from management, or whatever?

        4. Anonymous Coward
          Anonymous Coward

          Re: Out of band management?

          Transfer me to the Galactica. None of this modern fly-by-wire Battlestar shite...

          1. Throatwarbler Mangrove Silver badge
            Thumb Up

            Re: Out of band management?

            No networks on the Old Man's boat! Apparently no WD-40 either, based on the sounds the doors made.

        5. 2+2=5 Silver badge

          Re: Out of band management?

          > Mind you, the same applies to door security systems...

          The door security systems that I've worked with (clearly not the latest and greatest!) all have local cache in the door controllers so that they can still allow in cards they've 'seen' recently even if the network connection to the controller is down.

          1. PRR

            Re: Out of band management?

            > The door security systems that I've worked with ....all have local cache in the door controllers so that they can still allow in cards they've 'seen' recently even if the network connection to the controller is down.

            But there's a Pandemic on. Many of these rooms were hardly visited in the best of times. Now in Oct 2021 they may not have been entered since Mar 2020.

            Yet normal job-shuffle continues. Bob visited in Feb 2020, the door remembers seeing him, but today Sue is tasked with router touchy-feely and the door has never seen Sue.

            Of course Bob loans Sue his card, touching off a firestorm of security violations; or maybe Bob is confined to his house in another state and can't even contact Sue.

            Yes yes yes: battery drills and thermite. Or as at my last IT job: stout screwdriver and short hammer were everyday tools. Keys? Dont need no steenin keys.

      2. jgard

        Re: Out of band management?

        If their out-of-band network was taken out by their DNS servers going down, it's emphatically NOT an out-of-band network.

    3. Xalran

      Re: Out of band management?

      Apparently their OOB manglement relied on the non available DNS.

      As well as the offsite access to said OOB manglement.

      There's days, I wonder why I love IP addresses when it comes to OOB management ?

      1. This post has been deleted by its author

  8. Filippo Silver badge

    locked out by their own system

    I have a mental image of a team of Facebook engineers having to sneak in through the data center ventilation ducts, then having to dodge a grid of high-energy laser beams, and then having to fend off a horde of drones while one of their team hastily sets up thermite charges on the door to the security system AI, the whole scene lit by flashing red lights and with a background of blaring sirens and a synthesized voice repeating "Warning! Intruders detected! Lethal countermeasures deployed!"

    1. Pascal Monett Silver badge
      Thumb Up

      Re: locked out by their own system

      I'd watch that !

    2. Mike 125

      Re: locked out by their own system

      Brilliant.

      And what Resides in FB's data centres is even more Evil !

    3. Binraider Silver badge

      Re: locked out by their own system

      Facebook AI. Prototype of SHODAN?

      For those that dont get the reference, go play System Shock

    4. Plest Silver badge
      Facepalm

      Re: locked out by their own system

      Given that it's "Facecrap" I had images of 5 nerdy engineers with glasses banging on the glass frontdoor of the datacentre while 2 security guards laugh at them from the desk just inside.

      Finally one of the engineers calls his manager, who in turn calls Zuck to come down and sign the visitors book as the only authorised manager allowed onsite, there then follows a heated exchange between Zuck and guards about letting "friends" into the comms room....

      ...4 hours later the engineers finally get in and start rebooting stuff, however it takes longer than normal as there's only one KVM/monitor every row of racks!

      1. Anonymous Coward
        Anonymous Coward

        Re: locked out by their own system

        I remember an outage of a specific-service at a particular European mobile operator back in the early 2000s (we provided the software and support for that service). A hardware failure took out their Sun mirrored RAID arrays for the Oracle database - it turned out that the 2 independent drive shelves were not actually fully independent (they shared a "passive" backplane which is where the fault occurred).

        Anyway what should have been maybe 2 hours of outage turned into almost 5 hours outage as once the Sun engineer arrived at the data centre with the replacement drive shelves Security wouldn't let him in ("your name is not on the list!"). It took well over an hour until someone sufficiently high up in the mobile OpCo could be found to shout down the phone at the security guard to let the engineer on-site (this was a DC where applications for access typically had to be made 7 days in advance).

    5. Jedit Silver badge
      FAIL

      Re: locked out by their own system

      It may be even better than that. From what I have heard - unverified, so please don't take this as gospel - the Facebook engineers were unable to access the office because the security card system verifies your identity using your Facebook credentials.

      If this is true, and if only for the associated amusement value I dearly hope that it is, then Facebook have managed to create the ultimate rendition of the PC support network that moves its phones to VOIP: a system where it becomes impossible for an engineer to enter the data centre if there is any need for them to do so.

      1. adam 40 Silver badge

        Re: locked out by their own system

        That is definitely true.

    6. fajensen
      Terminator

      Re: locked out by their own system

      This being FaceBook, I'd imagine a security bypass being wave after wave on interns driven forward until the sentry guns overheat or run out of ammo ...

  9. PerlyKing

    Some people

    Are they still saying that "some people" were affected? Technically accurate but totally misleading!

    1. katrinab Silver badge
      Paris Hilton

      Re: Some people

      I wasn't affected, because I never use any of their services.

      1. John Robson Silver badge

        Re: Some people

        They still use you :(

      2. PerlyKing
        Headmaster

        Re: Some people

        If I were being really picky, I would say that you were affected inasmuch as you are aware of the outage and it appears to have somewhat amused you ;-)

        But my point was that Facebook's initial response (on Twitter? I may die laughing!) was to say that "some people" had been affected. In the Good Old Days "some people" would imply a small minority, but these days it has become a stock PR phrase which abuses the old usage to trick people into inferring that only a small number of people were involved, while being technically true. Yesterday's outage affected probably 100% of Facebook's data providers and while this is "some" it is also, what, two billion people? It takes the PR usage of "some people" to a whole new level!

    2. Throatwarbler Mangrove Silver badge
      Terminator

      Re: Some people

      Right!? Think of the poor bots!

  10. Mike 125

    'interesting'

    Janardhan said he found it "interesting" to see how Facebook's security measures "slowed us down as we tried to recover from an outage caused not by malicious activity, but an error of our own making."

    Way to go with the positive spin. If he really found it so 'interesting', he's in the wrong job.

    These failure scenarios are logically predicatable. If humans can't be arsed, then use AI to suggest such scenarios. Virtual networks can then model and test such scenarios.

    Why are people like Janardhan failing do that? That is their job. And they should be kicked out when these things fail.

    Them saying 'Oh well, the tech teched the tech, with the unexpected result that backup tool tech failed to tech the tech...' is not good enough anymore.

    With thanks to - Charlie Stross.

    BTW, I detest FB too, like any sane person.

  11. T. F. M. Reader Silver badge

    And I thought...

    ... that FB asked their AI to find the most reliable way to stop the spread of FAKE NEWS(TM)... And that the AI, uncharacteristically, worked...

  12. Chris G Silver badge

    Spineless?

    I wasn't aware that slimy pond dwellers had any vertebrae, many nematodes are parasitic and can suck blood or damage vital organs such as brain tissue.

    1. veti Silver badge

      Re: Spineless?

      You're over-generalising. Newts, for instance, are vertebrates.

      1. DJV Silver badge

        Re: Spineless?

        Indeed, even those whose only got turned into newts* due to witches.

        * until they got better.

    2. Cybersaber

      Re: Spineless?

      I was going to say the becoming spineless would be in important step forward for Farcebook. Their spine is so crooked there was never going to be a way to straighten it, ergo complete removal and replacement of the nervous system is necessary to cure the patient.

  13. Mike 137 Silver badge

    Network engineering or network winging it?

    "reports of employees' door keycards not even working on Facebook's campuses during the downtime let alone internal diagnosis and collaboration tools, hampering recovery"

    Two omissions are apparent: [1] functional network segregation (or the door cards would have still worked); [2] redundancy (no comment needed).

    When any system (be it an organisation or a technology setup) reaches a critical size, control is commonly lost. The solution is segmentation, so each element is below critical size and each can operate (at least at baseline) autonomously in emergency.

    That's more expensive to implement that chucking everything onto one huge pile, but it's safer and ultimately cheaper to keep running..

    1. yetanotheraoc Silver badge

      Re: Network engineering or network winging it?

      But ... cloud!

  14. Anonymous Coward
    Anonymous Coward

    ... why are Facebook hosting both their primary and secondary DNS?

    What even is the fucking point of a secondary dns resolver, if it's on the same network? Just fucking pay someone else a few thousand dollars a year to host it and this never would have happened

    1. John Riddoch

      It would still have happened. All the internal DNS servers removed their BGP advertisements so would have been unreachable anyway. We'd have known their IP addresses, but the backbone routers on the internet wouldn't have known where to send the traffic.

      As to why they're hosting both, probably a combination of "control" and the hubris of "we're so big and distributed at least one of our DNS servers will be available". Until this happens.

      In any case, tying BGP to DNS is potentially another problem in all this and I'm guessing there will be discussions in FB as to whether that really is such a good idea in future.

    2. Peter Gathercole Silver badge

      Secondary DNS

      If they had their own multiple DNS servers, hosted from a different site using different next-hop links to the backbones, that would normally be sufficient, and pretty much the same as using someone independent to provide a backup DNS.

      But if the internal links to their own core network all go out at the same time, causing each small datacentre to decide to stop providing BGP information then whether they had an independent secondary DNS or not would not make any difference. Even if there were valid DNS entries out there, chances are that something else in the non-functional core network would stop things working correctly.

      I'm not saying that their design was adequate, because obviously it wasn't, but it's not as simple as you think. Problems with DNS and network routing never are.

  15. John 62
    Mushroom

    Move fast...

    ...and break things

  16. Anonymous Coward
    Anonymous Coward

    I've had 2 I can think of:

    - Generator in a room in a building with 3 card-key access doors which lock automatically if the power is lost. We lost power, the generator did not start (it was actually faulty and waiting for a new part so wouldn't automatically spin up on a power failure) and the UPS failed rather dramatically. So we had to kick the doors down to get to the generator room to start the generator (with a screwdriver)!

    - Callout for Y2K (yes, I am old). Phone company using phones on their own network for callout (which isn't going to help if the phone network went down). Luckily nothing major broke, but I had a laugh when I raised it in one of the many meetings and then showed off my shiny new phone from another operator (though to be fair this was really because they were dragging their heals about providing any sort of phone for callout, and I needed a phone anyway).

    Fun times!

    1. Anonymous Coward
      Anonymous Coward

      Used to work for an access control company a few years ago, and the interviewer told me about one of their systems installed at a military base. They had a problem with it (no idea what) and were prepared to send a harrier jump jet halfway across the country (UK, not USA) to get an engineer to sort it out.

      It was decided that wouldn't help, so the captain responded with "We'll just blow the doors off instead!"

    2. Graham Cobb Silver badge

      Earlier than that (late 90's) when I was providing critical tech to one of the UK network operators, the customer insisted that our support engineers used mobile phones from one of their competitors.

      I understood that that was standard practice across all the UK operators.

  17. Doctor Syntax Silver badge

    Quis audit auditors - or something like that.

    1. Irony Deficient Silver badge

      something like that

      Quis scrutabit ipsos scrutatores?

  18. David Roberts

    Some sympathy

    We read many tales of security loopholes.

    For example, operations staff using the fire exit at night to avoid all the hassle with the front doors.

    A very secure data centre probably wouldn't have easily accessible out of band access because that could be a major vulnerability.

    However I assume that they are now trying to make emergency access a little easier withou compromising security.

    1. Medieval Research Council

      Re: Some sympathy

      "For example, operations staff using the fire exit at night to avoid all the hassle with the front doors."

      One place I worked in the 1970s (CADCentre Cambridge) In the early hours it was impossible to find the security guard to be let out because he'd be sleeping in some random place. So standard practice was to go out through a window and simply close it, couldn't secure it of course. When someone wanted to come it in at night they would wander round the outside until they found a office with the lights on and knock on the window. We all recognised each other of course, especially the regular night owls.

      1. yetanotheraoc Silver badge

        Re: Some sympathy

        And if you can't find anyone to answer the knock, try all the windows until you find the unsecured one -- left that way by the last person out.

      2. Anonymous Coward
        Anonymous Coward

        Re: Some sympathy

        Ah yes, the infamous in-security guard. Much like the one at one place I worked who had worked out, to several decimal places, exactly how many lap dances he could get at the local strip club per paycheck.

  19. roblightbody

    Wishing it went down permanently

    I can't have been the only one wishing it never came back up. I don't remember thinking that about any other big outage of any system before.

    The world would be a better place without Facebook.

    1. Gene Cash Silver badge

      Re: Wishing it went down permanently

      Especially considering dips in power usage in the range of tens of megawatts

      So people get vilified for using power to mine cryptocurrency, but FB gets away without a scratch.

  20. Plest Silver badge
    Facepalm

    Shame...

    What a shame my daughter couldn't WhatsApp pictures of her and her mates getting pissed during "Fresher's Week" at Uni to her other mates at other unis.

  21. Roger Kynaston Silver badge
    Happy

    must try harder

    "We try hard, we're sorry we failed, we'll try to do better"

    Next time make sure it is a week's outage.

  22. Graham Cobb Silver badge

    Thanks Santosh

    I am no fan of (or user of) Facebook - I wish they were replaced with a competing infrastructure of small companies, for those who care about social media at all. However, I want to acknowledge and thank Santosh Janardhan for providing some details very quickly. And Simon, of course, for writing it up.

    I feel some companies (including most of the big telcos) would have said nothing at all by this stage.

    I hope Santosh will feel that he can authorise his people to continue to provide information about this outage that we can all learn from.

  23. Anonymous Coward
    Anonymous Coward

    automated BOFH?

    "During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally."

    FB tech: "hey AutoBOFH, how much capacity is available on the 1Gig backbone_link_AQ3543?"

    AutoBOFH: "122Mbps"

    FB Tech: "could you increase the available capacity, pretty please? I really need more bandwidth on that link"

    AutoBOFH: "Shure thing" <CLICKETY> "You now have 1 Gbps of available bandwidth".

    FB Tech: "Oh wow, thank you! Hey, how'd you double my bandwidth so quickly?"

    AutoBOFH "Double? No, I just kicked off a process that will end up dropping all BGP routes. That will clear all the traffic off of that link".

    The VoIP traffic held up just long enough to hear the tech start begging for a configuration restoration as he realized the impending outage would result in a quick career change once the tail lead back to his request.

    1. This post has been deleted by its author

  24. RobThBay

    No ad income for HOURS -- OMG!

    How will Emperor Mark survive?

  25. Potemkine! Silver badge
    Mushroom

    No, really

    Don't need to be sorry. You made the World a favor. Please do it again, for much longer if possible.

  26. Anonymous Coward
    Anonymous Coward

    Sadly we will see more of this in the future. 'network engineers' nowadays are increasingly managing (virtual) networks via GUI with little to no understanding of routing and switching. Same with infosec - too much abstraction is basically turning IT staff in to the Eloi.

    I get virtualisation and automation is good, but someone has to understand the underlying hardware and protocols. Those people are retiring or getting out of the game. It'll be an interesting 5-7 years.

  27. Tired and grumpy

    Sad things:

    1. They fixed it.

    2. It's apparently a rare event.

    3. They hope to learn from it to avoid repetition.

    Unfortunately it didn't take Twitter down with it, or we might have had a moment where the world really took a breath. Imagine if Facebook held a restart party and nobody came.

    Which level of the inferno is reserved for Zuckerberg and Dorsey, I wonder?

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like

Biting the hand that feeds IT © 1998–2022