their conclusions about the same as every rail accident investigation board report.
not documented - not sticking to protocol - bang
Cloudflare has admitted that a four-and-a-bit-hour outage today was caused by someone pulling out cables that should have been left in place, but which were yanked because techies were given unhelpfully imprecise instructions. The incident started with some “planned maintenance at one of our core data centers” that saw techies …
Surely the cables should have a more meaningful label than just the presumed port number as that would mean the two ends would be unlikely to have matching labels and that way lies madness.
Why I was taught was that cables are labelled with a defined cable number and the purpose/connections for that cable are kept in a table as a known good state.
Or even sooner, depending on the environment they are in. the "M-Tape" labels are even worse as far as their ability to stay stuck to things. The ones I applied fell off within six months of application. :(
I'm not certain what model we have at [RedactedCo]; They've held up well enough for us for the labeling we've done, although when we swapped core switches last year we used plain old beige masking tape and red sharpie to label stuff prior to moving things around. I do know that it's a specialized one specifically designed to creating cable labels.
"I find Brother TZ labels tend to start letting go after 5-6 years"
Not if you use the "strong" TZE-S (will take the heat in the back of a rack) or "flexible" TZE-FX (labels roll around the cable, not stick out like flags) adhesive types, or go the whole nine yards and use the HS (heatshrinkable) labels.
HINT: Every character in a Brother TZ name means something (width, fg/bg colour, substrate, etc)
Of course if you use cheap "compatible" labels then all bets are off (and you can get interesting deals on specific genuine types, down to around 1/4 the normal retail price, if you look around, so there's no need to bother with the compatibles)
1. Clearly label the ports with a unique ID
2. Clearly label the cable with a unique ID
3. Document what each ID currently connects to, etc.
As noted, a label on a patch panel identifying the source never gets changes when it should, and labels on patch cables even less so. Altering the documentation is far easier and more reliable.
This post has been deleted by its author
Much like comments in source code. That's why I prefer to use good identifier naming to indicate what the code is doing. The compiler doesn't care about comments so won't ever alert you if they become out of date. Plus in my experience most programmers don't read comments anyway.
Sadly I don't think there's any equivalent for wiring.
Can't upvote this enough.
Code should always be self-documenting, i.e. meaningful descriptive identifiers, no side-effects. A function/method should do exactly what its name states and nothing more.
Comments are to understand the reasoning, expectations and failure points.
Especially valuable in these agile times where documentation is, at best, an afterthought.
Quote "Comments shouldn't tell you what code is doing. the code should do that."
One of my old team leads mantras was that if you couldn't figure out what the code was doing, by reading the code, then the code was wrong, and he'd reject it.
On the rare occasions when some fancy code trick was needed (say for performance reasons or an outlying use case, like working round a known bug/undocumented feature in the language/compiler), then we'd do the code, but add a comment to state exactly why it was done this way. This usually included pseudo code showing how we actually wanted to do it, explaining why it doesn't work, and what the 'hack' is actually doing to fix the issue.
My personal method for patch cables is to colour code the cables themselves for the type of service running on them. That way at a glance you know that red is a connection from the outside world to a firewall, yellow is between switches down to individual servers being connected with white etc.
That way if you see a large nexus of red and yellow cables warning mental signs sound before unplugging them even if documentation says that they are supposed to only be internal cables.
Admittedly this doesn't work quite so well with fiber cables, since unlike cat5 fiber cables colour coded cables are supposed to mean something.
If a supplier tells you this then seriously, find yourself a different supplier. Suppliers exist to supply you with equipment, not problems.
The chap I use prefers to supply through the normal supply chain, but if he can't then he'll phone and give me options such as him sourcing the equipment through amazon, ebay or whereever at $price and delivery would of course be down to the end supplier rather than him and more uncertain than normal. He's found all sorts of odd spares for obscure bits of equipment tucked away in warehouses for me as well as plenty of things that are "out of stock" with normal suppliers.
If your supplier won't do that and supplies you with problems as a result, don't supply them with further orders.
Ran into that at a previous job, One day a slightly different rack with slightly different wire management made the longest run slightly longer than the longest cable of the proper color we had on-hand. Had a black cable in my bag that would do, and wouldn't clash with the scheme and appear to belong elsewhere. But how to label it? After much thought I printed out a handful of labels which said "I AM GREEN" and attached them to various positions. Others who came upon the mismatched situation would inspect the oddity, find the label and (after a few moments of careful consideration) the quizzical look on their faces gave way to a sudden bolt of clarity.
Tell me about it....
What a nasty, nasty little test.
That's when I found out I would never be a fighter pilot. Turned to IT out of despair (and probably lack of any kind of other competence).
You and me both - I was firmly headed for RAF until I hit that problem.
No I just spend meetings repeatedly asking which cells on spreadsheet of the day are green and which are red - why given a significant % of the population have difficulties telling them apart do people persisit in uing those two colours? There's a reason the green on traffic lights has a lot of blue in it...
"There's a reason the green on traffic lights has a lot of blue in it..."
Normally the red has blue added as a number of forms of CVD means the sufferer can't see red _AT ALL_, whilst not seeing green is rarer than rocking horse shit (apart from anything else the night vision cones in our eyes are centred on green spectra, so even someone without colour vision at all is normally seeing green light)
Which makes you wonder why brake/tail lights continue to be be red and increasingly use monochromatic red leds
Same here. I was tested by my optician when I was about 14 / 15 years old as part of the whole 'careers advice' thing.
At the end of the test, he handed me a document (three A4 pages, single-spaced print on both sides) that listed all the jobs that colour-deficient vision disqualified me from ever doing, with Military / Commercial Pilot on the first line.
This was at the bottom of page six:
Careers where colour blindness may be beneficial
Obviously I'm unable to discuss any subsequent work for Military Intelligence.
Mains cable colour coding is a classic example of CVD changing things worldwide.
Black/white/Green is as dangerous as red/black/green when some people can't see red (it shows somewhere between brown and black) some can't tell the difference between red and green (they both show as brownish) and CVD testing was never part of electrical training.
Brown / blue / green-yellow stripe can be distinguished by virtually ALL people regardless of CVD
I don't have any CVD, but my father does and the area that I'm from has somewhere between 1/4 and 1/3 the male population suffering from red/green colourblindness. This stuff got drilled into us because it kills people.
To this end, we take a different approach.
We keep the number of colors to a minimum, with one for data, one for management, and one for external traffic, plus three colors for power cabling to denote which feed they're connected to (A, B and ATS).
Each cable gets an ID tag (with both text and barcode) at each end with a serial number. These numbers are all recorded on a database showing where they go from and to, and staff are forced to keep this updated.
Whenever a cable gets moved, no new labels are needed - the database just gets updated. When we need someone to check or change something, we can simply say "look for cable x", and it's easy to find.
A place I once worked for had a system of route numbers, not cable numbers. This was for highly distributed Audio/Video systems, so you'd see a number on a balanced audio patch panel and the same number would appear on all cables and interconnects right up to the mixing desk. With almost all bundles, there were also a few unconnected lines with a totally different series of numbers.
Server rooms are visibly a major pain to manage properly.
For me, the issue was simple : the techs were sent to empty a cabinet that was being decommissioned, and nobody noticed that an essential piece of kit was in it. Result : whoopsie. No surprise there.
And honestly, I'm not convinced that cable labels would have changed much. The techs were told to decommission the entire cabinet and that's what they did.
It's the higher-ups that fouled up on that one, good on them for recognizing it. Now they're going to have the fun of starting a global review of every panel in every cabinet, to ensure that it is properly recorded and what it is used for. That's the only way they will be able to avoid a repeat of this kind of snafu.
Cable labels may not have prevented the mistake, but they may have made it faster to recover once the mistake had been recognised.
Also, note that their "redundant" fibre wasn't diversely routed, since it all went through one patch panel... the redundant pairs should have been kept as far apart from each other as possible.
When [RedactedCo] originally built it's main data center, each cabinet that held a core switch also had a set of fiber and copper interconnects between it and the various server cabinets. For some reason, one of the fiber interconnect sets was 'value-engineered' out, so as a result, we ended up with a moderate quantity of extra long fiber patches going from the single LIU under the floor to the other cabinet with the core switch.
Attempts at fixing it over the years has been shot down every time during budgeting, because the business doesn't want to drop roughly 20 thousand usd on fixing it right and for good.
Then there was the incident with the backhoe that severed the one of the backbone fiber bundles for one of the buildings, because putting in direct burial rated fiber was also one of the things that was value engineered out. That did get fixed properly, and at twice the cost of what doing it right originally was...
"That did get fixed properly, and at twice the cost of what doing it right originally was..."
Yup. it's always worth bringing that kind of thing up and asking that your objections on that basis be minuted so that when it happens you can bring it up and point fingers at the responsible parties - and if they refuise to minute it then you'll just note by email that they've ferused to do so - and by the way being told about this possiblity may affect liability insurance, which means that whoever makes the decision might find insurers coming after him _personally_ to recover costs at a later date.
You'll get LOTS of dirty looks but it usually makes the person who makes the decision think twice.
The problem here is they unplugged all the cables. So knowing what a port is for, doesn't help you know which of the 48 cables you have hanging there should go into that port. You need labels on the cable too!
(Preferably a unique number per cable, with the same unique number at both ends of the cable. Then your documentation can tell you what ports that cable number is supposed to be plugged in to).
We have a simple system of patching at work (having sometimes hundreds of patches in each data room, and dozens of data rooms makes that a little bit of a necessity). Each socket is numbered. On the switch side of the patch cable, we have a label showing the socket number, and on the other side of the patch cable, we have a label showing which switch and which port it is attached to. We also use different colours of patch cable to denote different networks..
I haven't read all the way to the end of the comments on this thread but I am surprised that no one has mentioned the possibility that the labels could all be white and the the function is denoted by a shape. I am sure most people who drive in the UK will have seen the diversion signs or signs guiding you to a specific location which are usually yellow with black squares, circles, dots etc.
We're not the size of Cloudflare, but we use software which tracks every server, switch, patch panel, PDU, cable (data & mains), duct, etc. in our machines rooms. Any scheduled work has to be pre-booked through the software (Which gives you a report on what you're going to do) Any emergency work has to be updated into the software ASAP.
Failure to comply with using the software is quite simple: Everyone gets to take the piss out of you for causing someone else pain. (Yes, I've been on the receiving end) No management intervention required as peer-pressure is a far more effective stick in this situation.
The initial inputing of the data was a tedious piece of work and getting buy-in took some time, but now everyone sees its value. Now, when someone says "What happens if I cut this cable?" you click a button and it immediately tells you what will be affected.
Device 42 is one example, although I can't comment on how good the cable management side is, as we are not using that functionality yet. But it can include things like server room floor plans, what's in each rack, all the cabling down to exactly where it's plugged in etc).
For ref, we are currently only using it for auto discovery of devices and services (i.e. servers, both hardware and virtual), this is pulling out things like all the OS versions and patch level details, all the running services on every machine (i.e. services,daemons, running exe files, Oracle DBs etc etc), and then all the network traffic for all these services, covering internal, server to server, and any external traffic. This is part of a major migration project, around 150 servers (or more, still counting) on 10+ year old OSs and apps, and the docs are either out of date, or just missing! So Device42 is helping up map what is actually there and has traffic still.
>Failure to comply with using the software is quite simple: Everyone gets to take the piss out of you for causing someone else pain. (Yes, I've been on the receiving end) No management intervention required as peer-pressure is a far more effective stick in this situation.
This is also a tried and tested method for compliance with source code version control rules. Not that I'd know, obviously. A friend told me...
It probably wasn't "they," it was "that guy." Doesn't matter how big your company, there is always "that guy". That guy goes in to do something, the something gets a little complicated, he fixes it "temporarily, I'll clean it up later," and then gets sent off to do something else and promptly forgets.
I've been "that guy." More often than I like to admit.
Probably, but I get the impression that this is one of those things where a BOFH would normally have be asked to double check first and inform the techies not to touch the patch panel.
Its the sort of water cooler conversion part of my career is built on (being a point of reference for all kinds of stuff not necessarily in my actual job description).
In these strange times I would money on that being the case, as it's never a problem until that undocumented conversation doesn't happen.
Gimp because these sorts of things usually become my problem to fix.
Building Mangement used to periodically come down from their ivory tower and go round empty offices, unplugging unused devices and cabinets and sticking notices on them so the poor sap who had just taken the day off knew who to complain to.
Unfortunately this occasionally meant they got locked in a building with no way to phone home as the 'unused' (ie "it didn't have any lights on") cabinet contained the modems/routers/servers feeding the local building services... like entry systems/alarms and phones
Two things are alarming, one rack for all four internet connections - that's just crazy, everybody has been using diversely routed fibre to a number of racks for years now, so Cloudflare obviously didn't even know they were reliant on one rack. Not knowing is worse than it being that way, we all have older infrastructure which need to be upgraded or rebuilt but Cloudflare obviously missed this because they didn't now about it.
Secondly they were dithering over switching to DR because they admit switching back is difficult. It shouldn't be difficult - switching back is part of the DR plan, especially when it concerns your main control system. For such a critical piece of control they should be able to swap between two sites very quickly, both directions.
I like Cloudflare very much and am a customer spending a lot of money with them but they have had their pants pulled down in public and it's shown that perhaps they've grown too quickly and not put enough effort into reviewing key items of infrastructure and processes to ensure resilience is maintained at all times.
"Two things are alarming, one rack for all four internet connections - that's just crazy, everybody has been using diversely routed fibre to a number of racks for years now, so Cloudflare obviously didn't even know they were reliant on one rack. Not knowing is worse than it being that way, we all have older infrastructure which need to be upgraded or rebuilt but Cloudflare obviously missed this because they didn't now about it.
Secondly they were dithering over switching to DR because they admit switching back is difficult. It shouldn't be difficult - switching back is part of the DR plan, especially when it concerns your main control system. For such a critical piece of control they should be able to swap between two sites very quickly, both directions."
Can't echo that enough.
Not many people get that though.
DR in my current place is a joke, every project incorporates an element of it but its just plain lip service and doesn't do what it should. Heaven forbid we actually need to invoke DR as most of it won't work & requires manual intervention.
DR in my current place is a joke, every project incorporates an element of it but its just plain lip service and doesn't do what it should. Heaven forbid we actually need to invoke DR as most of it won't work & requires manual intervention.
Haha -we don't even pay lip service to it - I know what will happen if things ever go pear shaped: we're screwed. I keep paper copies of all the emails I've sent over the years pointing this out!
"Is it just me or is the instruction “remove all the equipment in one of our cabinets” not imprecise, but just plain wrong?"
I refer to this at work as "Star Trek Management" based on the famous command "Fire All Weapons!".
A manager did it at one place that I worked "Get rid of everything in that cabinet!" I asked if he had a decommissioning plan and a list of the specific assets to be decommissioned and was called a time-waster. Someone else started to remove and chuck into the secure waste disposal (a ball mill) everything in the cabinet. An hour or so later I got a call. "There's something wrong with the network, the time source has failed."
The Meinberg Lantime was in that cabinet, or rather wasn't because it was now dust.
In 2008 I turned up for work to find access to the car park a bit difficult because of the number of emergency service vehicles parked on the access road. There were several buildings close together with a number of IT businesses on site. It turned out that the director of one of the businesses had hanged himself from the mezzanine floor of the office block. The reason was because someone at Telehouse had pulled out the wrong cable then spent the early hours of the morning randomly swapping cables to try and fix the problem. The business factored credit card sales on behalf of e-commerce sites and had been able to authenticate and authorise payments but not debit the cardholders' accounts. In a few hours overnight the business had gone spectacularly bankrupt.
I didn't even know my programs I've loaded onto my laptop were using Cloudflare until yesterday when I found myself not able to get into anything!
I had rebooted my PC so many times after cleaning all my data history, cache files, etc etc etc.
I kept getting these messages from Cloudflare on the websites I was trying to get into saying, Error 525 SSL "Handshake!"
I said to myself, What the heck is Cloudflare?" I Google searched everything about Cloudflare to find out what it has to do with the programs I've downloaded?
These programs I've downloaded on my PC I've had about 8 years or so and never once seen anything about Cloudflare in the past if I had any trouble opening anyone of them!
I worked on my Laptop for almost 5 hours yesterday morning then again last night trying to figure out what's wrong with these programs! Late yesterday afternoon I finally decided to use the Fix Me Stick" and after I tried that & then rebooted my PC? All my programs I could get into again with no problems! I thought I've fixed these downloaded program issue myself!
This morning I come back online only to find I'm in the same exact situation as I was yesterday!
So obviously, when Cloudflare finally fixed the problem yesterday on the cables? It must've been at the exact time I rebooted my system after using the Fix Me Stick!"
But Cloudflare was fixed only for a few hours, since they're down again today, April 16, 2020?
Now I'm registered with this Cloudflare so hopefully I'll get emails letting that'll let me know when Cloudflare goes down again for maintenance so I don't have to spend hours on end trying to figure out & fix what's wrong with the programs I've loaded on my PC for nothing ever again!
I see Cloudflare is down now and I hope they'll be fixed and working again soon! I'm going to search online now to see if there's another program I can use to open the programs on my PC.! Seems like bad things are going on everywhere nowadays!
Quite a number of years ago, I was (indirectly!) responsible for dropping large chunks, if not all, of the UK off the internet...for about 45 seconds,
I was in a major UK interworking point (Obscurity re. this 'Tier 1' entity here is very, very deliberate!) at 0-dark-zero (=4 AM), tasked with doing the cut-over between the outgoing 'old' equipment/architecture and the incoming 'new' equipment and it's new topology.
This involved a co-ordinated move of 16 fibre pairs, in the correct order, and at specific intervals with validation pauses after each change (You can tell it was years ago as the procedures were sensible!). So, 16 discrete stages to the changeover, then. We were well aware of what was at stake, at least in terms of raw bandwidth..the commercial impact we couldn't give a toss about *.
Changeover of the first 7 pairs went fine. Then came the 8th - monitoring bodies told us traffic on *all* important physical segments dropped to effectively zero, and all the observable activity LED indicators where I was stopped flashing..which definitely shouldn't have happened. Multiple instances of 'WTF' and 'O,F' were heard :-)
Myself and colleague at the far end autonomously reverted change #8, then #7 , then re-did #7 and things recovered and stabilised after an ass-twitchingly nervous 30 seconds or so, enjoyed by all concerned.
Turns out some contracting entity has mis-labeled (The docs given to them were correct) the last 8 fibre pairs, leading to us interrupting two MPLS routes that *should* have been protected by being on different DWDM wavelengths. They weren't, so we interrupted local routing protocols just long enough for them to switch over to.....something that no longer existed ... which upset some people's implementations of BGP. Oooops.
IIRC, the actual sound was of several rather important ASes 'flapping'.
Only took the project coroner and the architects a week to figure out what went wrong. Root cause was human error in execution and inspection of labelling.
AC for obvious reasons.
*Was younger and 'differently focussed' in those days.
Biting the hand that feeds IT © 1998–2020