Reply to post: Put a label on everything...but make sure it's correct.

Cloudflare outage caused by techie pulling out the wrong cables

Anonymous Coward
Anonymous Coward

Put a label on everything...but make sure it's correct.

Quite a number of years ago, I was (indirectly!) responsible for dropping large chunks, if not all, of the UK off the internet...for about 45 seconds,

I was in a major UK interworking point (Obscurity re. this 'Tier 1' entity here is very, very deliberate!) at 0-dark-zero (=4 AM), tasked with doing the cut-over between the outgoing 'old' equipment/architecture and the incoming 'new' equipment and it's new topology.

This involved a co-ordinated move of 16 fibre pairs, in the correct order, and at specific intervals with validation pauses after each change (You can tell it was years ago as the procedures were sensible!). So, 16 discrete stages to the changeover, then. We were well aware of what was at stake, at least in terms of raw bandwidth..the commercial impact we couldn't give a toss about *.

Changeover of the first 7 pairs went fine. Then came the 8th - monitoring bodies told us traffic on *all* important physical segments dropped to effectively zero, and all the observable activity LED indicators where I was stopped flashing..which definitely shouldn't have happened. Multiple instances of 'WTF' and 'O,F' were heard :-)

Myself and colleague at the far end autonomously reverted change #8, then #7 , then re-did #7 and things recovered and stabilised after an ass-twitchingly nervous 30 seconds or so, enjoyed by all concerned.

Turns out some contracting entity has mis-labeled (The docs given to them were correct) the last 8 fibre pairs, leading to us interrupting two MPLS routes that *should* have been protected by being on different DWDM wavelengths. They weren't, so we interrupted local routing protocols just long enough for them to switch over to.....something that no longer existed ... which upset some people's implementations of BGP. Oooops.

IIRC, the actual sound was of several rather important ASes 'flapping'.

Only took the project coroner and the architects a week to figure out what went wrong. Root cause was human error in execution and inspection of labelling.

AC for obvious reasons.

*Was younger and 'differently focussed' in those days.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon