when a firewall upgrade
"a" firewall. Singular, i. e. No redundancy.
The things companies have to resort just to earn a (in)decent buck.
Australian telco Optus says its staff may not have followed established processes when a firewall upgrade they conducted resulted in customers not being able to call emergency services for 14 hours – a period during which it is thought three of the carrier’s customers died after trying to seek help, according to the company's …
"A firewall" as in meaning a singular device and "a firewall" as in meaning the network definition of a firewall.
They may very well (and I would bet money that they do) have redundant hardware (whether it is active/active with multiple paths or active/passive) and updated first one, then the other as part of this update. You really trust a press release to give you detailed enough information to deduce topology information about their internal network?
We also don't know what the procedure was that wasn't followed. Did they bypass some change control procedures they normally have in place? Did they not do it during a defined maintenance window? Did they not follow some procedure about notification and testing? Were they supposed to only upgrade part of an active/active environment and if they identify an "intermittent" problem will know that their update had some unintended effect?
One possibility is that 000 calls look a little different and are handled differently to normal calls. For example, there could be location information which isn't normally there. The way the calls are routed (e.g. speak to THAT endpoint for those calls) is also slightly different.
Firewalls can be finicky, but firewalls with voice traffic that is not your bog-standard arrangement (and inside a carrier classifies as not standard) are diabolical.
Could a firewall block certain calls and not others? You bet it could. They fail in new and exciting and definitely non predictable ways.
That being said, it could also have just been a typo'ed config line.
You might as well be asking "how could connections only to a single network port be blocked, but connections to other ports got through just fine". Yes I know phone numbers are not network ports, but unless you've worked in a telco networking environment you don't know how calls are routed and what sort of information is available for blocking or routing.
I would guess that they use information about what number is being called to route calls differently, so a regular call goes one place, perhaps a toll free call goes another, "911" goes to another - because clearly it is handled very differently than typical calls. If the 911 calls got routed to the wrong place or had the wrong flags set or whatever, the router they were sent to might have dropped them on the floor as invalid.
Sorry I can't answer authoritatively. But there is a special process for 000 calls. For example, telcos have to forward them, even if your SIM has no coverage!! For example I used to live in a remote part of Western Australia where my Vodafone coverage was patchy and seemed to depend on a base station that was installed for a construction project and never decomissioned (Vodafone denied that they provided service in my area, despite me having 3 bars in my backyard). Although I had a vodafone SIM, if I was out of town where there was no Vodafone service, a 000 call would be picked up by Telstra and connected. Long winded way of saying there must be a special and separate protocol for 000 calls.
I have no idea what happened here, but it is easy to imagine scenarios. Modern phone networks use protocols where the signalling tells the network what sort of call it is - not the actual number. Signalling is provided by end user equipment (dumb phones, mobile phones, VoIP, etc) in many different forms and using different protocols, and is validated and heavily firewalled on entry to the telco network (for example, so that people can't pretend to be another operator delivering a call to avoid being charged or traced). Pretending your call is an emergency call might be used by hackers to avoid charges or to cause a denial of service attack, for example. So equipment and firewalls apply all sorts of validations.
I could easily imagine that if a link had been incorrectly marked as "emergency calls are never carried on this link" a firewall might reject the call. Or if a software upgrade to a firewall broke the configuration somehow I could easily imagine this failure.
Of course, with hindsight, there should have been (i) proper testing, (ii) high priority alarms generated when rejecting validation for calls claiming to be emergency calls, (iii) proper capture and very rapid escalation of the call centre reports of emergency calls failing.
second time? there's been numerous high impact issues and outages within Optus over the recent years.
IIRC they were all due to fat fingers or insufficient testing.
I wonder if this was an Optus or Nokia person on the hook for this one.
From speaking to people in vendor land, apparently neither optus nor nokia would stump up the cash to automate this stuff properly..
Optus view was Nokia should pay - they are paid for an outcome.
Nokia view was there wasnt enough fat in the contract to cover it.
Rinse and repeat a few times where the impact was inconvenient or commercially costly.
Now people are dying due to culturally ingrained incompetence.
> In a Sunday statement Rue said the company is speaking to staff who performed the upgrade to understand why they did not follow procedures.
Reads like: "Oh sure, dear public, OF COURSE we have safe procedures that do not allow things like this to happen, IF ONLY these pesky engineers would do what we told them to do. It was them, not us"
A statement like this, accusing employees to be personally at fault for causing death or injury (and at least hinting at the corporation bearing no responsibility) , should at the earliest be issued after a thorough investigation establishing that fact.
Issuing this statememt a few hours after an incident on a Sunday is nothing more than a despisable way of preemptive corporate damage control: A corporation throwing their employees under the bus ahead of any real investigation in order to deflect responsibility.
Pathetic.
Came here to say the same. You can't start a proper corrective action and improvement review based on the assumption that your procedures are correct and therefore the staff must be at fault for not following them. If they are going into the review with prejudices then "arrogant management" might also be a better start than "staff didn't follow proceducres."
Irrespective of whether it was a case of staff not following procedures a company like this, with a critical public service function needs to have a public service ethos, for want of a better term, that ensures everyone is aware of the consequences of actions. Such an ethos starts at the top. If a thoughtless action at any level was the cause that is ultimately due to the attitude of the board and senior management.
They are likely operating under a quality regime (ISO27nnn etc) that mandates audit of compliance with procedures, so they can't just deflect blame to employees, if they never set up a process to audit compliance. Company and management are still to blame.
Back in the day when I was playing with the UK PSTN, emergency calls were of paramount importance and were treated as such by any entity with a stake in the 'emergency' process> Things got tested to death during our development / test, and doubly so pre- and post- each upgrade. We all took this seriously - my lot as exchange / system manufacturers, our customers as exchange/network operators and the industry/country as a whole.
In those days, we had very detailed call records with a Call Termination Reason field (Records were generally used for billing, but they were also very useful telemetry for us!). Standard procedure during our automated testing was to 'grep' call records for emergency calls and examine these CTRs in detail and investigate if anything other than a 'normal' call termination occurred. Us testing types were allowed to be far more creative and to take longer with the '999' tests than with any other feature.
Admittedly, it was easier to do then, as things were simpler, moved slower, and people generally gave a shit, instead of treating things as an exercise without any real-world consequences for other people.
Standard procedure both during and after any real-world upgrade included a period of monitoring (Statistically and from human feedback) to catch any outlying 'edge' cases and - here's the rub - explicitly check the operation of the '999' system. In fact, there always was a 'background task' running to monitor the performance of emergency calls. There probably still is :-)
Reading the article shows someone in Optus-land got it wrong. They don't have the monopoly on cockups in this area, see this. The scale of these two seems similar to the Optus hiccup, and the causes (Human error, procedural error) look similar also.
Why did I choose "Commitment required" as the title? Because you MUST be committed to avoiding stuff like this. IMO, there are way too many technological fuckups - ones with either actual or potential real-world serious consequences for others, that is - that can be avoided if people gave more of a shit and accepted that this giving more of a shit in the important areas wasn't a financial liability. There are times where it's more importance to be altruistic than make a monetary profit for yourself or others. Allowing people to reliably call for help is one of them.
Here endeth this too-verbose rant.
You're absolutely correct, and in general BT in those days did do proper testing.
Even so, glitches occasionally happened, especially for non-essential services. One that comes to mind was the Directory Enquiries (DQ) service. One area was getting complaints that some people calling DQ were left hanging on a ringing phone for very long periods (no fancy 'you are position X in the queue' then) without having any idea how long they would have to wait. A rule was made that no-one should wait more than 5 minutes, and this was implemented by monitoring the queues and if they got close to 5 minutes new callers were given a busy tone, rather than being left in the queue. This was assumed to affect only a few callers, and those who got ring tone would know that they had no more than 5 minutes wait ahead. Others could try again later, without an indefinite wait on hold.
Needless to say this all looked great, metrics showed that no-one queued for more than 5 minutes (obviously!), pats on the back all round. Until some months later someone looked at the call records, to discover that the load had been waayyy underestimated, and that 90% of calls to DQ in that area were being given busy tone. Ooops.
"Here endeth this too-verbose rant."
Not at all too verbose. It's all about what I termed in another comment "a public service ethos".
One thing stands out in the example you linked - difficulties switching to a DR system. The things that people tend to do badly are those they do seldom simply because of lack of practice. DR rehearsals are important, initially to introduce plan to reality so that the plans, rather than reality, can be changed, secondly so that documentation can be filled out by recording what's actually done and finally to give practice to those who will have to carry them out.
Optus have blocked 000 before, there was a major failure to abide by the telco law in 2023. They clearly have their emergency spinners out and working hard, it was reported consistently on ABC as being a 'techical fault', as if checking stuff, having procedures and following them, etc., were not management responsibilities. Clearly a proactive attempt to shovel the blame onto some poor techie so the management can go away scot free, once again, with bonuses intact. By the way, several of the people who could not get to 000 rang optus's call centre. However, being an outsourced overseas operation, they thought nothing of it and just logged the calls leaving Optus none the wiser. As you might be able to tell from the tone of my comment, I am annoyed.
i worked in the Australian telecoms industry back in the mid to late nineties. Optus were a bunch of cunts to deal with then. Sadly some things never change. Telstra aren't exactly saints either. Horrible industry as a whole.
Back then Optus was Cable&Wireless and full of the very worst US corporate crap which also seemed to have infected Telstra (remember Trujillo and his three amigos?)
Unfortunately nothing in my experience has changed except for the worse since Singtel long ago acquired Optus but Telstra does seem to have regained a little customer focus.
A lot of the nonsense around unsupported 4G phones when the 3G network was switched off was the mandatory VoLTE support on which 000 calls depended.
Previously 112 worked fine even without a SIM in the phone but I am not sure whether that's still the case as I suspect it was a GSM/3G feature.
112 hasn't worked without a SIM for many many years as it was vulnerable to crank callers. However, you don't need credit or an active 'account', and I believe it works even if the IMEI is blacklisted. Presumably the emergency call operator can blacklist "frequent flyers" of the service (or potentially schedule a visit by two nice officers who can relieve the individual of their freedom).
He also vowed to implement an escalation process for any reports of problems with calls to Triple Zero.
surely this is top of the checklist stuff, even if the rest of the network shits the bed the emergency calls should still be working.
This statement just shows it for the tacked-on afterthought it is with Optus
This is where an independent 3rd party review and audit of the process would be useful. Mistakes get made, it's terrible in this instance lives were lost, you figure out where the mistakes were made, update the process and at least ensure the same mistake doesn't get made again. If it turns out to be an honest mistake, let's say it was a unique edge case no-one ever even thought of, the testing didn't trap that scenario and for the sake of argument, it was humanly unforeseeable, then like most rules written in blood, you update the process, add it to the "thou shall not" list and ensure it is followed.
However if the mistake occurred because testing was limited due to time/budget constraints or someone took a gamble that it will be fine, or some other foreseeable scenario, then severe punishments should follow. Not to punish the organization that caused this particularly, but to raise it up the risk threshold for other companies. If you want to stop these types of issues reoccurring, the risk of sufficient pain, be it monetary or reputational or even being put out of business, concentrates minds at the right level to ensure sufficient attention is paid to the problem. Pointing the finger at some lowly tech or even middle management person has about zero impact to the company from a commercial perspective.
I called up the cop shoppe on the non-emergency number this one time, to report some asswipes driving golf balls off the ridge this one time. (They thought they could hit across the river and land on the driving range over there, which charges $7.50 per bucket of balls, and gives you the balls, but these jokers had some stolen ones or whatever and were trying to return them, but were so incompetent they could not even reach the river! And were menacing the hikers walking beneath the ridge!)
So while explaining all this, I casually ask “oh and by the way, what’s the phone number for 911” to which they snorted, and said (as expected) well yeah that would be 911. So I explained to them, well no, I don’t use the phone anymore, I have the freephoneline.ca which is freeware voip, plus I have the “tablet plan” on the cellular, which has no sms or voice service, and all I got is free voip and they don’t have the 911, so can you tell me the phone number for 911? And the cop says, well there is no phone number only 911. So I said, suppose you phone “out” using the 911 phone, and it registers on someone’s call display, what number does that say? And he says he’s not allowed to say, and I say ok what does that mean, is it a secret? And he says no no, there’s no secret, but nobody is allowed to know, and I say well why is that, and he says they don’t want anybody calling that number, and I say what’s the difference if the phone rings either way, and he says they are not allowed to disclose the 911 phone number, and I say what if I come and do industrial espionage and steal the 911 phone number, and he said good thing I’m calling the non-emergency phone line because stealing the 911 phone number is probably a crime, and I’m like ???
Meanwhile the golfers walked away, so I’m like, ah Forget The Whole Thing This Conversation Never Happened and he’s like, and what’s your number?
But I don’t have phone service anymore!