back to article The day I took down the data centre- I mean, the day I saved the day. Right, boss?

Welcome to a Who, Me? story in which the moral might be: "Be careful what you kick off before lunch if you want a mealtime free of phone calls." Today's tale concerns the exploits of "Anthony," who was working in the security department of one of the larger cable internet providers. One of his jobs was assessing systems due to …

  1. Anonymous Coward
    Anonymous Coward

    All the RDC’s where on their own backbone?

    "All of the RDCs were connected via backbone connections, so latency was negligible, and bandwidth was massive."

    Just also happened to be sharing the corp firewall too.

    If it’s important enough to segregate on its own links but needs firewalling then it’s an own goal to use the enterprise firewall for that shared task, a dedicated device with enough capacity to handle the throughput achievable in those “backbones” should have been implemented too.

    I suspect those RDC’s had their own “backbones” but the testing stuff relied on those same “backbones” and didn’t have dedicated connectivity as the story hints at.

  2. Anonymous Coward
    Anonymous Coward

    memories...

    Did the same with zmap once - without any proper thought, I decided that since my /24 scan had found so many of our corporate subnets, a /16 sweep would be worthwhile as well. I got about 2 minutes into it before my connection dropped. And the phones lit up...

    The VM I was running it from was in the DC, and I couldn't kill the scan as of course I lost connection too. However, the Cisco ASA was always having issues (due to being configured incorrectly, mainly) and so I went to lunch and hoped nobody would notice the "you're doing a zmap!" banner in the linux system log while I was out...

    Anonymously, just in case !

    1. Anonymous Coward
      Anonymous Coward

      Re: memories...

      the old cut you're own throat shit show scenario

      1. Youngone Silver badge

        Re: memories...

        Your. It's your.

        And we've all done it.

  3. chivo243 Silver badge
    Alert

    NMAP's 11 setting!

    As Jamie would say, "Well, there's your problem." I've seen it in action, used by an uninformed admin. Insane about sums it up... all the network monitors went red in the space of 10 seconds, and then stopped reporting until he killed NMAP. That was a fun morning.

  4. A K Stiles

    Takes me back

    When I worked in a place that ran AS400 / iSeries / i5 systems, we developers sensibly had our own box for dev work, separate to the live server.

    Sometimes a new function or a fix would require a reasonably chunky bit of code to be run and, obviously, we'd try it first on the dev box to check it wasn't going to run wild and destroy all the account records.

    Frequently these would result in calls from the sys-admin department calling to complain that "Your job is taking 85% of the processor" or something similar and a request to terminate the job to clear the alert on the big screen. Now obviously as developers (on the dev box) our usual ponder was "what's happening with the other 15% then?".

    Various techniques were employed to see if we could get the jobs to run to completion, from being "on a call" (or at least the phone off the hook) when they might phone us, to prolonging the conversation about which job was a problem, to even coding in some 5 second sleeps every 20 seconds or so to reduce the persistence of the notification, just so the job would actually get a chance to run through to completion without getting terminated.

    I can only remember a couple of occasions in 10 years where one of the other colleagues actually working on the dev box shouted across the office about not being able to do stuff because of server obstruction, so it wasn't exactly a significant issue!

    1. Remy Redert

      Re: Takes me back

      We routinely make other testers angry when our automated tests run. Usually when we're doing single machine test runs to verify our automated tests are good before putting them in the weekly run over the weekend.

      Since the transition to the cloud servers, we've routinely soft locked machines over the weekend and come back to find over half our tests have timed out.

    2. Anonymous Coward
      Anonymous Coward

      Re: Takes me back

      I regularly have to run CPU-intensive tasks on a VM dedicated to this process, but every time I ran it the VM admins complained it was using too much CPU. I somehow persuaded them to double the core count, and with the help of taskset it now never uses more than 50% available CPU so they don't complain any more!

    3. Anonymous Coward
      Anonymous Coward

      Re: Takes me back

      Worked at a small company with a small network some years ago, doing basic admin stuff.

      One day had a outage, no one could access emails, files etc.

      Found that one PC was flooding the network, so disconnected it from the switch, went back to speak to the owner to find a bunch of programmers gathered around a PC wondering why their new code was causing 100% CPU usage while the rest of the company were panicking.....

  5. Pascal Monett Silver badge

    That's interesting

    So, you have a network tool that has a setting that can basically kill the network. It's up to you to not use that setting.

    That doesn't sound like a useful thing to me.

    Is there any reason to have that setting ? Stress test, maybe ?

    1. DJV Silver badge

      Re: That's interesting

      Maybe it needs an "Are you sure you really want to fuck the network completely? (y/N)" prompt before running at that setting, followed by an "Ok, on your head be it! Commencing to fuck the network NOW!" response should the person enter Y and hit return.

      1. Mark 85

        Re: That's interesting

        Maybe it needs an "Are you sure you really want to fuck the network completely? (y/N)" prompt before running at that setting, followed by an "Ok, on your head be it! Commencing to fuck the network NOW!" response should the person enter Y and hit return.

        There are admins who would think "how bad can it be?" and then "challenge accepted" while hitting "Y" and "Enter".

      2. jake Silver badge

        Re: That's interesting

        "Maybe it needs an "Are you sure you really want to fuck the network completely? (y/N)" prompt"

        No. That way lies madness. Where do we draw the line on commands that require a yes or no to proceed? What about in scripts/aliases? Etc.

        1. rwessman

          Re: That's interesting

          That trick never works. Years ago at a former company, there was a disk maintenance tool with the potential to be very destructive if used incorrectly. In fact, it would issue a "Are you sure? (y/n)" prompt twice.

          To which customers wrote wrappers like:

          dpmaint << END

          y

          y

          END

      3. Down not across

        Re: That's interesting

        I prefer tools not to be nobbled, and instead expect people to actually understand the tool they are using and how to use it. It is not too much to ask is it?

        1. Anonymous Coward
          Anonymous Coward

          Re: That's interesting

          Your expectations are way too high, from the administrators that have more power than they know how to deal with to the trainee that "wants to see what it does"

          If you have proper documentation on tools then that is fair enough but usually the documentation is buried in a locked filing cabinet in an unused basement with a sign on the door saying, "Beware of the Leopard"

        2. Terry 6 Silver badge

          Re: That's interesting

          Been mulling that over. An issue here is that there are all sorts of people legitimately using some of these tools, if only because there's no one available who's fully expert in using them- or maybe whose employment can not be justified on a regular basis (at least in smaller enterprises). But there is no filter between expert and knowing just enough to be dangerous.

    2. BenM 29 Silver badge

      Re: That's interesting

      >>So, you have a network tool that has a setting that can basically kill the network. It's up to you to not use that setting.

      I have a car that has a top rated speed of 140mph. Just becasue it can (optimistically, I will admit) doesn't mean that I should drive it that fast on a public road...

      >>That doesn't sound like a useful thing to me.

      In which case, you might want to read up about nmap and why it was written in the first place... people who don't understand nmap should, it could be argued, perhaps not be the people using it.

      >>Is there any reason to have that setting ?

      Well yes otherwise it wouldn't have been included as a setting... there must a use case for it.

      YMMV of course and, being Open Source, you could always fork the project and create a version without any of the built in functionality you disapprove of...

      >>Stress test, maybe ?

      Indeed... or perhaps taking down a firewall, data link or data centre...

      multiple edits: fixing my uncaffinated typos.

      1. anothercynic Silver badge

        Re: That's interesting

        Don't tell me you've never red-lined your car before! Awww man! Lots of fun that... ';-)

        1. Anonymous Coward
          Anonymous Coward

          Re: That's interesting

          Not my car no. Company and rental cars are a different matter....

          1. jake Silver badge

            Re: That's interesting

            You really ought to redline your car occasionally, you know. It's actually good for it.

            HOWEVER, if a vehicle has been babied for 100,000 miles, suddenly taking it to redline a couple times per day might cause problems because it hasn't been worn in properly. Talk to a real mechanic, not one of those "factory trained technicians" who can't diagnose a blown powervalve without $125,000 in computerized test equipment ... and probably not even then, because they don't teach carburetors anymore.

      2. David Hicklin Bronze badge

        Re: That's interesting

        so you have never done the "maximum speed" acceptance test with a new car then?

    3. Jou (Mxyzptlk) Silver badge

      Re: That's interesting

      If nmap packet flood can kill a firewall any device can. Literally any device, including a defective network chip which runs havoc on its own without the machine/OS behind the chip noticing anything.

    4. jake Silver badge

      Re: That's interesting

      "That doesn't sound like a useful thing to me."

      Then don't use it to do that. SImples.

      Note that you can do the exact same thing with perl. Or C. Or assembler.

    5. SImon Hobson Bronze badge

      Re: That's interesting

      That doesn't sound like a useful thing to me

      As said, there must be uses for it.

      Without thinking very hard, you may have a large network where you want to run a scan quickly. bear in mind that the network didn't crash - one device that the traffic ran through crashed. From memory, nmap will scan across a network - i.e. spread it's probes across many IP addresses rather than doing all the probes for one IP at a time. If you don't have a firewall that will crash with that level of traffic, then there isn't a problem running a fast scan against a large address range.

      You'll be restricted by the speed of you single link to the switch, and the performance of your own machine - once the traffic is spread across multiple IPs (and almost certainly across multiple hosts on different links), then the traffic to any single host shouldn't be too bad. If your switches aren't a pile of poo, then they will quite happily pass all the packets - as should routers.

      And of course, if you are doing a scan during a maintenance window, then getting it done quickly is a benefit. Some scans can be quite slow.

      1. whitepines
        Boffin

        Re: That's interesting

        That doesn't sound like a useful thing to me

        We use it for stress testing, mostly (in test environments). It's found several pile of poo switch vendors / models before they were put into production.

        I's also been used at commissioning, to stress test entire networks. Put it on the fattest pipe and watch it do its magic, if nothing crashes the network is probably good to go for years.

        1. Rich 11

          Re: That's interesting

          the network is probably good to go for years a while.

          FTFY. Never make a specific claim further ahead than you can actually see. Fuck knows what a client will decide to do the day after handover.

    6. TSM

      Re: That's interesting

      From the documentation page that was linked in the article:

      > Some people love -T5 though it is too aggressive for my taste.

      Granted, that's not very specific, but apparently it is useful for some.

    7. Anonymous Coward
      Anonymous Coward

      Re: That's interesting

      No you have an incorrectly configured network that buckled at the command.

  6. oiseau
    Facepalm

    Should ..

    Maybe it needs Should have an "Are you sure you really want to fuck the network completely? (y/N)" prompt ...

    There you go, fixed it.

    And without having had my espresso yet. ;^*

    O.

    1. jake Silver badge

      Re: Should ..

      No. Just no.

  7. Anonymous Coward
    Anonymous Coward

    we used to get a similar issue in an MSP I used to work for with Nessus scan's giving the same sort of shit show. Another old favourite is windows update deployments swamping things at the end of remote links

  8. Anonymous Coward
    Anonymous Coward

    Mentioned it before...

    .... was out in the sticks at a community hospital trying to sort out a printer problem with a PC that was running windows Server. Strange, but maybe it was to do with the specialist software running.

    Rebooted server - then found out that the PC was actually a Winterm thing and I've just messed up every GUM clinic in the county.... Once the server had rebooted, it was discovered that the database services didn't start automatically....

    It all worked out - the generic login no longer had permissions to reboot servers, and the database starts up automatically. I described it as "unscheduled disaster recovery testing"

  9. Anonymous Coward
    Anonymous Coward

    One of my guys somehow physically disconnected an entire office block at the mains input, triggering all kinds of problems including automatic calls to the police and fire brigade. Just in time to avoid the power surge that took out the entire neighborhood's electronics. Clients not as upset as they'd otherwise have been.

    1. jake Silver badge

      A similar tale ... One that happened to a friend down at IBM Almaden ... Running late to get out the door (baseball game was due to start), he accidentally entered something along the lines of rm -rf / tmp (note inadvertent space). At approximately 5:04 PM local time on the 17th of October, 1989. About one millisecond later he realized what he had done. About one microsecond after that, the Magnitude 6.9 Loma Prieta earthquake hit, with the epicenter approximately 14 miles to the South South West.

      The SCSI drive, which, in his words, "was happily losing it's tiny little mind, and destroying mine alongside it" suffered a hard crash before the power went out. Seems that even high-end SCSI drives don't like imitating a pogo stick when the heads are moving around frenetically. DriveSavers in Marin managed to salvage most of the drive, thus saving a high-temp superconductor project over a year of data. Drivesavers didn't volunteer that the command had been run, so he didn't lose his job ... but his entire department got yelled at for not having a proper off-site backup strategy in place.

      1. anothercynic Silver badge

        Damn those damn spaces... And thank God for an earthquake. :-)

      2. Precordial thump Silver badge

        Must have forgotten that /inhibit_san_andreas_fault file he'd left in the root directory....

        SUBDUCTION FAULT - CORE DUMPED

        1. jake Silver badge

          Except the San Andreas isn't a subduction zone.

          TRANSFORM FAULT - CORE DUMPED

          Attempt to re-map already mapped territory through right-lateral strike-slip.

      3. Terry 6 Silver badge

        It's many years since I did stuff at that end of computers. (Being one of the 1980's self-taught IT guys) because I've left anything too complicated to proper professionals for a few decades now.

        But that whole "butterfly effect" of a small error in typing a command having devastating consequences haunted me for a good few years.

        And my view hasn't changed. There must be a better way .

        1. Eclectic Man Silver badge

          'Hard' engineering and Software engineering

          I have often considered that software engineering and physical engineering (you know, like actually building things out of Lego, Rolled Steel Joists (RSJs) concrete, glass etc.) are similar in some ways but oh so very different in their failure modes. A few million extra atoms of Iron on a girder make little to no difference to the girder's failure modes. A single byte out of place can make the world of difference to a computer program.

      4. Richard Pennington 1

        Not just any baseball game

        I was in Baltimore at a computer security conference at the time; it was just after 8pm East Coast time, so I was back in my hotel room watching the World Series (two teams from California, IIRC).

        It was the only time I got to watch a major earthquake on live TV.

    2. whitepines
      Boffin

      Are you certain the sudden loss of the load from that entire block isn't what caused the subsequent surge elsewhere?

      1. TSM

        Exactly what I was thinking!

      2. Anonymous Coward
        Anonymous Coward

        Fairly certain, the power company investigation had to explain the millions in damage. They'd have got us to pay if they could.

    3. Doctor Syntax Silver badge

      Are you sure it wasn't your guy who was also responsible for the surge?

  10. tweell
    FAIL

    ''Router Testing'

    As a shiny new CCNA I was trying to figure out why the main WAN router was slow, and input the debug all command. To Cisco's credit they do come back with "This may severely impact network performance. Continue? (yes/no)" and of course I typed Y. The router hung, refused any more commands, and I had to give it a power cycle. OOPS!

    It turned out that the telco was having problems (backhoe took out a fiber bundle), but their first step is to lie about it. 'Problem? What problem? Must be on your end.'

    Two lessons learned that day. First, don't trust telco NOC's. Second, don't do a debug all command!

    1. Terry 6 Silver badge

      Re: ''Router Testing'

      Hmmm.

      This is very like Virgin Media's policy of not telling front line staff that answer calls ( or update the "service status" page) that there's a major outage in the area. So they have punters doing resets and arranging home visits (ffs).

      Only once has a call handler said "Funny that I've had a lot of calls....Let me check" Then come back and say it was an area issue. And that isn't really the point, because he should have told. He had the grace to be embarrassed.

      1. Alan Brown Silver badge

        Re: ''Router Testing'

        "Virgin Media's policy of not telling front line staff that answer calls ...that there's a major outage in the area. So they have punters doing resets and arranging home visits"

        Not just VM

        The local DSLAM is flakey. Openreach knows it but refuses to admit it to ISPs. Contractors are doing lift and shifts, meaning you might have a working line one day and a rotten one the next. Everyone blames everyone else.

        It seems the best way forward is to encourage the local P*** to steal the cabling and vent some H2SO4 gas into the cabinet's air intakes (then let them think batteries boiled). or arrange for the cabinet to be totalled by a HGV

    2. Down not across

      Re: ''Router Testing'

      We had one or two incidents where someone did that on a core router (at an ISP..). From what I recall no one ever did it more than once.

      Having tried it in isolated lab environment for fun, it does not necessarily hang the router but it doesn't take much traffic to be passing for it to saturate the CPU, In most cases there is also such thing as too much information.

  11. Eclectic Man Silver badge

    I wonder..

    who coded the different level and described them (rather unhelpfully IMHO) as

    "The template names," explain the docs, "are paranoid (0), sneaky (1), polite (2), normal (3), aggressive (4), and insane (5)."

    Could they be persuaded to explain to us ignorant readers of el Reg?

    Are there other sysadmin type commands that can total a network or computer we should all know about, so as not to use them until the day after retirement / redundancy / summary dismissal ?

    1. jake Silver badge

      Re: I wonder..

      Whimsy in naming things like this has been common since at least the 1960s. It's one way that grad students attempt to remain sane in a rather stressful time of life. Some make it out of the lab, some don't, alas.

      "Are there other sysadmin type commands that can total a network or computer"

      Almost all of them, when used (im)properly. This is why important systems shouldn't allow users to run as root.

    2. Doctor Syntax Silver badge

      Re: I wonder..

      Recursive delete or move in the wrong place is the classic. However, as per Jake's post above most of us have managed to run them in more everyday circumstances (starting a move of root was mine). There's a good argument that you're not a real Unix admin until you've had an OOOPS!!! like that.

      More subtle is installing what looks like a routine update that uses just enough extra memory to drive a system that was marginal into threshing. At least manglement finally and very, very quickly accepted what they'd been told about a memory upgrade.

      1. jake Silver badge

        Re: I wonder..

        My biggest one[0] was in the late '80s and involved running a script (as root, and from /, of course) that copied all my lovingly compiled ELF system binaries and attendant source and config files onto a production system that was pure COFF ... OOOPS!!! doesn't quite cover it :-)

        I had been watching the production system for a couple days for something unrelated, and the server name, Pluto, was in my head. The test system that I was aiming for was Goofy ... Mea Culpa. Production systems are now named after something related to the business at hand, test systems are science related, and my personal stuff is allowed to be more whimsical.

        And all such scripts are now hard-coded with the proper names.

        I've made similar, smaller mistakes since, of course. I'm only human. A good sysadmin knows how to recover from luser error quickly ... especially his/her own!

        [0] Other than dropping the odd hand-truck full of card decks ... and I once backed a forklift over the only known copy of a boot tape for a rather proprietary bit of kit ... fortunately it was still running, and I managed to generate a new tape before anyone important noticed.

  12. jake Silver badge

    "It's just a couple lines of code. It'll be OK, ship it!"

    With those words, in late 1977 I managed to take down all the PDP10 kit at Stanford and Berkeley with a software upgrade. Effectively split the West coast ARPANet in half for a couple hours. Not fun having bigwigs from Moffett and NASA Ames screaming because they couldn't talk to JPL and Lockheed without going through MIT ... Needless to say, I'm a trifle less cavalier about large-scale software upgrades these days. Even the little ones.

    Live and learn.

    1. Deimos

      Re: "It's just a couple lines of code. It'll be OK, ship it!"

      Saw somebody do something similar to all the UK ATMs of a certain bank at 5pm on a Friday just before a bank holiday. It was literally two lines of code but “somebody” got “and” mixed up with “or” which made withdrawing anything but a single fiver impossible.

      Made even more sweet because the somebody wasn’t me.

    2. Eclectic Man Silver badge
      Facepalm

      Re: "It's just a couple lines of code. It'll be OK, ship it!"

      We DEFINITELY need that as a "Who Me?" entry in its own right, young Jake.

  13. earl grey
    WTF?

    It wasn't a network command

    But the effect was similar.

    Once upon a time in the long distant past I was admin on a Unisys 1100/74 (4 CPU system) that we had 3 production Mappers running and a test version. I set them to run real-time (and on Unisys that's what it really means) and the operators couldn't key into the console for 40-45 minutes at a time.....la de da.

    Changed that back shortly after.

  14. PhillW

    Ever screwed something up

    so badly that the only way out of a P45 and the march of shame was via the medium of spin?

    Who ya' gonna call?

    Four Seasons Total Landscaping

    1. Eclectic Man Silver badge

      Re: Ever screwed something up

      One can only assume that the Four Seasons Hotel 'declined' the last minute attempt to hire them and save face for Trump's team.

      1. anothercynic Silver badge

        Re: Ever screwed something up

        The Four Seasons pointed out the problem in the first place and made sure they couldn't be booked ;-)

  15. MacroRodent
    Go

    Useful also against the Matrix

    In one of the Matrix movies, I think it was the second one, the heroes are shown using nmap to find holes in the Matrix. One of the very rare cases where Hollywood shows actually plausible software in use, instead of fakes that look flashy.

    1. Old Shoes

      Re: Useful also against the Matrix

      nmap and sshnuke. Actual real tools and hacks at the time!

      https://nmap.org/movies/

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like