So....restore from backup
The issue is with the docs.
CTO is a knob and why is this an issue because they obviously have backups don't they....
"How screwed am I?" a new starter asked Reddit after claiming they'd been marched out of their job by their employer's CTO after "destroying" the production DB – and had been told "legal" would soon get stuck in. The luckless junior software developer told the computer science career questions forum: I was basically given a …
Only this shouldn't have been a backup issue...
A common sense strategy would have been to use live backups (given that was what they are doing, I'm assuming that there's nothing in prod that would lead to regulatory/privacy/other issues that would dictate NOT using prod data) and read-only accounts to restore the production DB into test. Script the whole thing, run it daily with a few simple checks for restore size/timestamps/simple query results/e-mail the DB team and you save the company the hassle of discovering their restore process stopped working correctly months ago.
Imagine - testing your backups AND creating sensible procedures AND showing new starters sensible ways to do things from day one. And if the backups failed to restore as expected, fix the issue...
Compare this to "break it, discover backups aren't working and fire people" and I would wager make the same mistakes again in the future...
This seems to be an all too common problem. Too many small companies say they can't afford backups and backup validation, I guess their data isn't all that important then.
No, nothing to do with how valuable data is. It's to do with how valuable time is.
Take a small IT firm with 2 staff, both working 16 (and sometimes more) hours per day to get the business built up. Large amount of work starting to come in as well, but not enough work/income-over-costs to hire someone else.
If you get jobs out, you eat. If you don't get jobs out, you don't eat. Or don't pay the rent, or power, or..
Verifying backups can be restored is a pretty long process that ties up resources. Small firms have a hard time buying extra kit let alone having a spare machine they can put on such un-important stuff to actually verify they can re-image from a recent copy. And if your backup is data only (plus licenses etc), then re-installing a machine from scratch can take a couple of hours (*nix) or several days (you-know-which-OS) especially where updates are involved. And in either case, there can be a lot of time involved in making sure that the task is completed properly and everything is back (just numbers of files and bytes can be a good start, only takes a few minutes, but they don't always match even in a perfect restore (say work machine has an extra temp file the backup image doesn't).
Backups take a lot of time to check properly. When I did them I did a system that worked on first test (and pulled some all-nighters getting stuff set up and testing it), worked on week-later test, and then we went to monthly checks. Not proper testing procedure I know, but I didn't have the time to be sure that everything worked perfectly. I could do that, or I could earn money. Small and new businesses have to put every moment possible into earning, as the moment you're forced to close the doors for the last time any backups or other data instantly becomes worthless. Unless you did some sellable dev stuff.
PS El Reg - love that voting no longer requires a page reload! Thanks for changing this!
".....This was a cockup waiting to happen." First rule of process design - always assume at least one user will do something wrong if you give him/her the means to do so, do not assume they will be able to over-ride the impulse to follow an incorrect direction. The management were responsible for the process documentation, therefore, IMHO, they were responsible for the failure.
".... sounds like the CTO needs to visit example.com"
That might not help as much as you think if documentation elsewhere for adding new users also uses "example.com". I.e. it might already exist...
Rule #1 about documentation is that someone somewhere will blindly follow any code or command snippets that are included.
"The issue is with the docs."
Absolutely. Firstly, who wrote the documentation and thought it was ok to include the production password?
If it was a technical author, they could be excused for assuming it was not a real production password. In which case, who gave them it? This password should be handed out on a needs only basis.
If it was one of the developers, what kind of moron are they? Even if they they have received no training at all on basic security, you'd have to be an idiot to think it was ok to record a real password in clear text on documentation.
Secondly, what are the production details doing on a document used to set up a dev environment anyway? A dev, particularly a novice one, shouldn't be anywhere near production, simply because they have no need to be.
In fact, fire everyone. The whole affair suggests rampant sloppy practice everywhere. The fact their backup restores failed just confirms that.
But I'm pretty sure which ones weren't.
> you'd have to be an idiot to think it was ok to record a real password in clear text on documentation.
It wasn't the large multinational where I cut my teeth in, where no passwords were *ever* written or even spoken. And it wasn't the one where I am part of manglement now, where our code of conduct says clear and loud that none will be disciplined for making honest mistakes¹, and we do abide by that.
¹ Our mistakes have a very real potential to kill people. Going around pointing fingers is not going to help anyone. Making sure that mistakes are open and candidly reported, and fixed promptly, is a far better use of everyone's time, and a humbling experience.
Nobody but the DBAs in charge of the production database should have had them. And each of them should have had separate accounts (because of auditing, etc.). Database owner credentials should be used only for database/schema maintenance - after the scripts have been fully tested (and at least a backup available). They should never be used for other tasks. Any application accessing the database must have at least one separate account with the only the required privileges on the required objects (which usually don't include DDL statements).
All powerful accounts (i.e. Oracle's SYS) should be used only when their privileges are absolutely necessary, and only by DBAs knowing what they're doing. Very few developers, if any, should have privileged access to the production system, and they must coordinate with DBAs if they need to operate on a production system for maintenance or diagnostic needs. DBAs should review and supervise those tasks.
But I guess they have followed the moronic script I've seen over and over (and stopping the bad habits usually encountered a lot of resistance):
1) Developers are given a privileged account when the application is first created, so they can create stuff without bothering the DBA.
2) Developers write an application that runs only with a privileged account, and doesn't allow separate database users (granting permissions to other users is a boring stuff, as writing stored procedures to control accesses to things). DBAs are happy because they have to work little.
3) The privileged account credentials are stored everywhere, including the application configuration files, unencrypted, so everybody knows them
4) The development systems becomes the production system. DBAs don't change anything because of fear something doesn't work.
5) Developers still retain access to the production system "to fix things quickly", and may even run tests in production "because replicating it is unfeasible".
6) DBAs are happy developers can work on their own so they aren't bothered.
7) Now that you have offloaded most tasks to developers, why care of backups? Developers will have a copy, for sure!
8) Then comes a fat-fingered, careless new developer....
Let me bring my IT expertise into play here. You're all wrong of course!
Obviously database passwords shouldn't be written in public documents. They should be kept on a post-it note on the screen of one of the shared PCs.
What's wrong with you people for not knowing this basic piece of security best-practice!
Apart from using 'careless' instead of 'only human' new developer ...
I can report similar (some years back now) - new member of the ops staff was instructed to setup a test database, mistyped the DB URL by one character, and because of the risible beyond belief setup of i) dev can see prod network, ii) dev hostnames differ by one character from prod hostnames, and iii) dev, test and prod DBs all shared the same passwords and user names, they wiped the entire production database by running the schema creation scripts.
The impact - this was a bank (no names but North American and generally an IT train wreck at the time), the DB data loss took out the SWIFT gateway, the bank ceased to be a bank for about six hours, and they had to notify the regulator of a serious IT incident. The length of time was due to the backups being inaccessible and no-one having a clue how to restore.
And FWIW, we'd already advised the bank that the one character difference in hostnames on an open, flat network was a suicide pact six months before.
On the plus side, the irascible, voluble Operations Manager took one look at the root cause, said it wasn't the operator's fault and went off to shout at someone on the program management team. Much respect for that move.
Where's the segregated VLAN? Anywhere with such important data and of such a size should be capable of setting up an environment where dev network logon credentials only work on the dev VLAN and so do not permit the crossing over into the production VLAN whether you know the prod db connection string or not. One account for troubleshooting prod environments (which they wouldn't have had in this case), and one for performing dev tasks. Not that difficult.
This happened on my watch in Prod Support for a certain oil company.
The dev/prod Oracle databases had the same internal password. Luckless dev connected to what they believed was their dev instance in Toad and then dropped the schema.
They then called us to advise what had happened once they realised their mistake. We restored from backup (not before they bitched about how long it took to recover) but they never should have been able to do so in the first place.
Rule Number 1: Restrict access properly to your prod environment. That means no recording random username/password combos in docs, scripts etc.
Rule Number 2: Don't share credentials across environments.
Anon for obvious reasons.
Rule Number 1: Restrict access properly to your prod environment. That means no recording random username/password combos in docs, scripts etc.
Rule Number 2: Don't share credentials across environments.
Rule 3 - never lets Devs have access to stuff. Like small children, they *will* break stuff..
Rule 3 - never lets Devs have access to stuff. Like small children, they *will* break stuff..
In a controlled environment this is good. If the Dev breaks something before the users do, it can be patched to prevent anyone else from doing the same. My unofficial CV has 'breaking things' as a skill, which dates back to using BASIC at school and being unable to resist entering '6' when asked to enter a number from 1 to 5, just to see what happened.
My CV also includes breaking things, but I was one to always stick to the rules.
> being unable to resist entering '6' when asked to enter a number from 1 to 5, just to see what happened.
So when asked to enter a number from 1 to 5 I broke into the school in the dead of the night.
"Rule 3 - never lets Devs have access to stuff. Like small children, they *will* break stuff.."
I cannot upvote this enough. Stupid t**ts that I had the misfortune of dealing with would have had a breakdown dealing with keeping the systems running. I want you to deploy xxxx. Ok who to, oh everyone in the organization. Ok how? Oh this installation program that I have written, you just need to install a uninstall a, then install b and c and remove b, that will leave a stable version runnig on their machines.
So... I'm not going to implement it, until I get a firm written procedure for how to deploy this. We also need to test it in a small department. I could have fixed it in a couple of hours, but this wasn't the first time I had to put up with them having no idea about dependencies, and I had wasted four or five days fixing similar messes of theirs, they were on better money too. I never got any recognition, cause little ever went wrong.
I was so happy when they chose the planning department to trial on. (This department had 'skilled' staff who like to setup their own 'uncontrolled' infrastructure) I had a good 3.5 months of peace, until guilt at all the wasted man hours finally got to me, and I sorted it that afternoon.
Then there is the always fun "Yes we do backup database transactions, here is your database backup, and here are your transaction logs, now you buggered up the database, rolled back and forth until you have a confused mess, so which transaction sets do you want to roll forward with?" - (Gives me a warm fuzzy feeling of joy remembering those looks of dawning horror)
Devs really are like small kids, they have NO idea of consequences, and will run away given a challenge.
You are spot on...
You need to separate prod from dev and not allow dev access to prod.
So while the Reg asks who should be fired ... there are several people involved.
1) CTO / CIO for not managing the infrastructure properly because there was no wall between dev and prod.
2) The author and owner of the doc. You should never have actual passwords, system names, etc in a written doc that gets distributed around the company. The manager too should also get a talking...
3) The developer himself.
Sure he's new to the job, however he should have been experienced enough to not to cookbook the instructions and should have made sure that they were correct. He was the one who actually did the damage.
As to getting legal involved.... if this happened in the US... it wouldn't go anywhere. As an employee, he's covered and the worst they could do was fire him. If he were a contractor... he could get sued.
We know this wasn't Scotland or the UK. (Across country? Edinburgh to Glasgow ... 40 min drive. )
I do have some sympathy for the guy... however, he should have known to ask questions if things weren't clear in the instructions.
He should chalk this up to a life lesson and be glad that his mistake didn't cost someone their life.
I maintain a small file server for a small company (about 45 computers). I use Ubuntu with Samba as the server. I have another desktop running Ubuntu with Samba on the ready and use rsync to provide a nightly copy of files from the server to this system which is a live copy of all the files served by the primary server. If the main server were to go down or hit with a virus (or ransomware), all I have to do is take the primary file server off-line, run a script on the system with the live data from the previous day, and change the IP address to make the backup system a temporary server.
Another Ubuntu system is running which provides nightly backups using back-in-time to make nightly backups for off-site storage using portable USB hard drives.
My philosophy is you can never have too many backups, so the primary server also makes hourly backups during business hours that are retained for 2 days. The backkups are then pared down to keeping one copy per day for 14 days, then one a week for a couple months, then one copy a month till drive space runs low. The drive is then removed from service to retain historical data and a new drive put in service. This set is for on-site.
The system makes it easy to restore anywhere from one to all files in a short period of time. There is currently about 90GB of user data on the server. To restore one file takes only a couple minutes to find the file from the desired backup and restore it. A full restore of user data takes about an hour.
:The system is regularly tested and a full restore had to be performed once when the Cryptolocker ransomware encrypted files the victim had access to on her computer and the server. More time was spent ensuring the ransomware had be isolated and eliminated on all computers on the network than to get the server ready.
While some may consider triple redundancy overkill, I like to be prepared in case one of the backup systems may have happened to fail the night before a restore might be needed on the server. There is always at least one backup drive stored off-site. In the case of catastrophic loss of the server (flood, fire, explosion, etc), server configuration files in the nightly backups make it easy to setup and configure a new server in about 2 hours, ready to have files then restored.
Test and test often...
"Yet it's your fault that guide is wrong?"
Actually, the story quite clearly says he *didn't* follow the document. he should have used the credentials generated by the script, instead he copied the credentials from the document.
The fact that the document contained the production credentials is an error but the implication is that he would have been ok with those.
Of course, he should have been supervised so that's also an error. should he have been sacked? Probably not. and certainly not in the way he was.
In the original story, the teller did not precisely follow the "document's" instructions. The script he was supposed to run would generate codes that he was supposed to use. In stead he used codes typed into the "document." I would bet the teller actually read the script and, rather than waste a chance to know more about the system, simply "tried to follow the logic" of the script. He would learn about the system doing by hand the tasks in the script. But, presumably, he did not think through the privileges that the script would need to set up his account. Or worse, he did (as an amateur, would-be BOFH), but had not properly counted on the laziness of upper levels, or anticipated the linkage to the production system, which should not have existed.
Least privilege security is to protect yourself as well as from more malign threats.
Even in the good ol'days when my role was almost entirely performed with admin rights I had a healthy does of, don't give myself the option to mess with production if I can help it.
As for giving a newbie dev production access, with production examples, and clearly no supervision is unthinkable even then. shoulder surfing their activity, answering questions, and perhaps even setting up their environment for them avoids many unnecessary upset and complexity...
INHO story is highly plausible even if not true and dev was fall guy to divert from sackable CTO stupidity.
Since this should be a controlled document it's not the (just) author but who ever signed the document off. Authors can make mistakes (likely a cut and paste from their production screen) but it should get spotted when the document is approved for publication which should have multiple signers to ensure technical correctness, non containing of "classified" material and so forth.
Since this should be a controlled document...
That's assuming a lot. Given the general level of competence demonstrated, either it was controlled inasmuch as it was written and never revised, so there was never a need to do change management, or their control process was along the lines of "use the document in this directory, it should be the latest version", which happened to have global write privilege.
Yep. To give production credentials to a junior, on their first day of employment, is asking for trouble. They should never have been written into this document. Not even a rough draft.
And as far as "legal" getting involved? Well, surely any type of investigation would highlight serious security flaws internally. So the CTO and subordinate managers would be far more screwed than the (ex) new guy.
I was hoping there'd be an option for "The dickhead who wrote the guide".
There are three issues which tag the CTO:
- defective documentation. WTF is production data doing in a junior document, and why would a junior have credentials to production? Who cleared this document for release?
- who does operation and security reviews, and who signs them off?
- whose responsibility is it to keep a production environment operational, a responsibility which includes recovery processes?
If it had been one item I'd say a firm talking to and docking of bonus would be in order, but all 3 combined is unacceptable and ends up being a choice between incompetent or negligent. For C level staff, either is unacceptable.
Whilst I'm all for seeing CTO's heads roll , for data breaches , fraud ,gross negligence , bribery ad infinitum.... I'ts all too easy to say (correctly) that the CTO is ultimately reponsible.
surely in this instance the aforementioned dickhead is far more responsible?
As well as whoever didnt check the backups.
surely theres an overpaid under braincelled fatcat calling thensleves I.T director who should be in the firing line before the CTO?
Remember that old IBM advert with a boardroom and some suit demanding to know "whose job is it to make sure this stuff works?" followed by "Yours!" . It was better than dilbert! Do IBM really think its a good idea to tell their customers they are morons who dont know what they are doing?
Full access credentials for a production database shouldn't be readily available in a guide. Most new devs would never need to be explicitly told the credentials as they should just be part of the application's secured connections. All db access can then be locked to the production application server and all dev can be done against dev/test/staging rather than production.
For a company who is so free an easy with the db credentials to their master production database along with untested backups then they are just asking for trouble.
CTO can't have had any idea about what he was doing in managing a tech company.
I've done almost this. I didn't get sacked on the spot, the company learnt from it, we found some bugs in production too!
Basically I wiped out the entire testing enviroment's DB (Which was a copy of the live DB) on my 2nd day in a job. I had my boss sat with me, and we were seeing how 'good' the documentation was (not great!), for doing an upgrade of the environment. He wanted someone with no experience of the process to undertake it, to see how well it'd go. (It didn't go well, clearly!)
One of the lines was slightly unclear, stating to delete some files, I personally misread this - after asking the boss for verbal confirmation and not understanding him either, and deleted the entire db, nuking the environment.
They considered restoring from a backup but for various reasons this was decided against, and instead decided to then test out the 'build out a new env' documentation instead - from this it was discovered that a lot of the binary processes in the live db (which the testing one was built from) were in fact broken anyway and the entire stack required a whole lot of fixing. I ended up leaving about 9 months later when something similar went wrong and I decided I couldn't take the stress of 'screwing up a million customer records' by a simple finger slip up. Prior to leaving I'd be working on scripting away a lot of this processing because I felt it was a danger for the company to be doing all this manually - not sure that ever got finished... Maybe said redditor took over my old job? ;)
Actually once I wiped a DB using a backup/restore - just it was the wrong one <G>. Luckily it was a copy of the production database on a separate server, but it was to be used for a press presentation demo the following day (it was a software for a museum collection, so data were copyrighted but not "sensitive").
It has been an exhausting day to set up everything, and it was well past 11:00PM. We tested over and over the backup/restore procedure when my boss asked to perform a last one "just to be sure"). Unluckily I made that time the mistake to full backup the wrong database (a test one on the same server) to the production backup file, and then restore it to the production database (it was a single server demo setup, not big disks and tapes with multiple backups) - OOOOPS!!!
It was SQL Server 6.5, and the backup/restore dialogs were the "Are you sure (Y/N)" type - no source/destination information to check (that taught me how to design proper dialogs...)
The original database was at a location 40km away from where the demo was scheduled. The following morning I had to implore to let me access the server very early (disappointing some people there...), make a back up in a hurry (and there were no handy and fast external USB disks...) and then "fly" to the other city center during the peak hour to restore it in time.... the worst thing was my boss tried to cover the cock up despite our advice to be sincere, and the customer wasn't happy. They were fine people (but the curator), and they would have helped us more if they knew.
You all make be feel relieved the day I did a "are you sure" "Y" on the new external drive reformat... when I realised I'd slipped and chosen the os drive.
Thankfully I'd done a quick format, and after 3 days of research could copy over the backup MBR(?) and nothing was lost.
I'd hate to think what going near a DB is like... and so don't even question my brother as to what he does in his IT job other than "have you found somewhere better/easier yet" when it comes to problems, old antiquated systems and people asking for stupid IT setups...
The fault is with whoever created the documentation, and then whoever failed to spot the problem whilst proofreading.
I would think that the new starter had plenty of legal recourse if said company's legal team did try to do anything, especially on their first day... where was the supervision? why were the documents so badly written?
"I wouldn't have thought so. Unless I'm very out of date, isn't it usual for new starters to be on a month's trial?"
Even these usually have "F-up" clauses that allow for immediate termination on gross negligence/incompetence grounds. No matter who behind the scenes allowed it to happen, actually causing the firm to be forced down for long enough to likely trigger a legal event IS going to trigger the "F-up" clause. It'd be a lot safer for them than gardening for the rest of the month since someone like this might trigger Murphy or even contemplate sabotage if they feel they're screwed in any event. In it up to one's neck, so to speak.
Anyone can make a single button/selection error. I did it while doing office work for an accountant. Went way too deep for my pay grade and qualifications, thought I was in to type up letters, and got asked to input accounting data in the main accounting software database.
I was asked to type data into A. Smith's account. I puzzlingly asked why there were two instances of their details (perhaps one personal, one business but we had separate software for each)... I think I was told something along the lines of deleting the wrong one. But after being asked to leave one day... I realised it was because they were trying to get around the software customer limit (50 or so) by duplicating details between packages, putting extra business records on the personal system.
Yeah, my fault for deleting it, but really, asking me to navigate a GUI with instructions from across the room to go to "accounts:customer" when I accidentally clicked "customer:accounts" (Yes, the GUI had that kind of duplicated and ambiguous layout) was just the start of the misunderstandings.
I should have put my hands up and said "nope, I don't know how to use computers, can you do it for me". ;)
Does it explain clearly what is required?
Are any steps where reader specific inputs are defaulted to safe varible e,.g. Enter your details as /<servername>/<yourname>
Can a person not familiar with the system, carry out the commands. If not, is a warning clearly visible?
No matter how clearly one reads directions, Murphy WILL find a way for someone to MISread them and cause havoc, as seen here. See it all the time when I call for someone to "look behind you" and they spin 360 instead, or I say, "Look down!" and they say, "I AM looking down!" while looking UP.
I did this back in the 80's at a large international bank. Not my 1st day, but my 1st night shift. Told to run a command on a terminal, did so and wiped out all that days input. Thankfully my boss told the directors it was a procedural fault and procedures hadn't been updated because of a shortfall in manpower (basically he put the blame on them!!!!). Stayed there for another 4 years before moving on.
Does anybody else say it out loud as a check?
"extend partition X on drive Y", "remove user Joe Blogs", as if, formatting an action through verbal channels is closer to the physical universe than just a thought.
Yes I do work alone most of the time, thank you for asking.
Card on the table, I'm biased as I'm a DBA ( most hated of the hated admins! ), click those downvotes now!
The real problem as I see it is why was a developer allowed free reign on a production system? They should be restricted to dev systems on a day to day. Devs ask for temporary access to UAT and prod systems when required to fix things. Their actions are audited and recorded so all parties are in the clear if anything goes wrong, everyone's arse is covered.
Sorry first day on the job then you are ultra careful and check everything first, at least that's what's in my mind in those first few hours. I don't mean to be callous but there is a reason they don't pay IT people minimum wage. We get paid damn good money for our knowledge, attention to detail and professionalism, we get paid danger money 'cos we carry the can for the company systems. I've had to fix things when stuff has gone wrong, shit happens as a fact of life but you try to mitigate that with a bit of care and attention. Too many people treat systems, especially databases, like dirty black boxes that can do anything. Sadly the shit often flows downhill to the hated DBAs like me, who then have to pick up someone else's crap. It's what we do but we'd appreciate it if you just stepped back for a second and thought before you click GO to submit that untested transaction, to make sure it's pointing at the right system.
I am a developer and I agree with you that developers should not have day-to-day access to production. They might need this access for troubleshooting/deployment etc. reasons but in this case it should be through appropriate channels, not "just select what you need and press GO". It is clearly failure of the organization (and CTO, by proxy) to fail to enforce this separation, and the developer who exposed this huge hole should be thanked and rewarded, not sacked.
As for the documentation - as long as no actual password was stored there, it should not matter. The system should have been designed in such a way as to tell the user "you have no access rights there", unless their authentication (specifically for production system - not day-to-day one) allowed them. It is not the new developer's fault that the system was designed badly and it was not the documentation author's fault either (unless actual production access password was stored in the documentation, which does not seem to be the case)
> least that's what's in my mind in those first few hours. I don't mean to be callous but there is a reason they don't pay IT people minimum wage.
From the redit page:
> Today was my first day on the job as a Junior Software Developer
This is someone who has almost zero experience, not someone who would be expected to hit the ground running. (And, harking back to my first professional position as a junior developer: I started on less than today's minimum wage.)
This post has been deleted by its author
So, the chap was given essentially given:-
1) no induction from a live person;
2) no training from a live person;
3) no supervision from a live person;
4) Documentation which said "do this".
Doing "this" caused a severe disaster, which was then blamed on the person following the documentation. Let's say they want to go after the person for criminally wiping their database.
In the UK to prosecute you'd have to prove "Actus reus", essentially latin for "the guilty act" as in "he did it" and "mens rea" meaning "a guilty mind" as in "knowing and intending to cause [the guilty act]".
So in short, as he just followed documentation he was given there was no intention to cause "the guilty act" and therefore he's not committed an offense.
However, the person who wrote the documentation was a pillock. The person who authorised it was an idiot, and the person managing the lot of them ought to be fired for incompetence.
"2. The dev should have had enough knowledge to have identified possible issues and raised it higher"
The documentation gave an example with, AFAICS, no indication that this was the production database. Absent a clear indication that it actually was the live database the only possible knowledge for him to identify a possible issue would be independent knowledge of the credentials for that database.
Question:
Why did first-day-worker have write access to the production database anyway?
It's not even a question of backups (that goes without saying) - but how did someone walk in and get given production database write access by default without even a hint of training?
And why are THEY setting up their dev environment? Why is that not done for them, especially if it involves copying the production database?
The problem is the culture of a place like that, not what the guy did - even if you assume he's being quite modest about how careless he was.
"Why did first-day-worker have write access to the production database anyway?"
Because he was given documentation that he was supposed to follow, and the documentation included the password. Password = write access. No password = no access. Everyone with access to the documentation, including you and me, had write access to the database.
AND WHY?!
Why would you do that? Is he going to be committing to the live database on his first day? No. Read-access, yes. Write? No.
Least privilege principle. If you don't have write access to it, you can't damage it.
And what prat just puts passwords for write-access on the production database in a document that's going to end up just-about-anywhere in six month's time?
This is my question, not "how", which you answer. WHY!?
...that "whoever wrote the docs" was an option in the survey.
Really, though, I don't think anyone *needs* to be fired over the actual failure. It's a small company, and they just experienced the results of a comedy of collective errors across various people that are probably the result of technical debt accrued as they got up and running. This never has to happen, but it's pretty common. It's bad, but you learn, adjust and move on. Well, you move on unless the lack of backups in this case means they are so thoroughly screwed that they have to shutter the business.
All that said, I do think the CTO needs to go, not necessarily because this happened, but for how he's handled this. Politically, he might have needed to go anyway by virtue of falling on his sword for owning the collective failures of the team(s) under his control, but the obvious scapegoating of the new hire for what's a really broad failure of IT processes says to me that he's not going to learn from this and get his ship right. One way or another he needs to own this, and firing the new guy who copy/pasted commands right from an internal document doesn't convey that kind of ownership to me.
s>One of the analysts working with a Prod DB ran an update query without the WHERE clause. Much hand wringing and gnashing of teeth followed, possibly by some braid pulling.
Way back in the depths of time (mid 90s), I was by default the DBA of an Informix database and was learning SQL. I managed to set the primary key field for a large subset of records in the database to the same value through a badly configured UPDATE query. That took some sorting out, but I managed it eventually.
Who has:
- Ultimate responsibility for ensuring all documentation is correct?
- Ultimate responsibility for ensuring back-up work (including testing regime)?
In both cases it's the CTO's job.
If the tale had stopped before it got to the bit about the new chap being fired I'd have gone with "No-one, it was a wake-up call for everyone." Even C level folks are only human. Things can get over-looked; over-reliance on assurances from the management chain can combine with a monstrous workload to help this along.
But it didn't stop there. The actions of the CTO tend to suggest someone lacking the emotional maturity, willingness to take personal responsibility and with the primary desire of covering their own behinds at all cost to ever be trusted with that responsibility.
Rosie
I agree that first one to be fired should be documentation author and second person who approved it, then backup admin and finally CTO should follow them too but one thing: this guy was supposed to use credentials he got from script output, not the one from the doc. There were creds in the doc and shouldn't be, true, but it doesn't mean this guy is completely innocent. He didn't follow his instructions. The fact that keys to the safe are laying on the safe doesn't mean that you are allowed to open it and destroy whatever is inside.
I guess that credentials given in the doc were given as an example - it's unbelievable idiocy to give real credentials as an example but example is example, it doesn't mean that these are the values you should use.
"He didn't follow his instructions. The fact that keys to the safe are laying on the safe doesn't mean that you are allowed to open it and destroy whatever is inside."
This is more like the Big Red Button for a data centre being next to the exit button for the door, as was in another Reg story recently. You cannot blame the hapless person who hit the BRB.
This is assuming the credentials were for the production server at the time they were written? Perhaps those really were for a dev server and the production server used other credentials then the credentials were changed for some reason aided by foggy memory and so on? Does anyone know how long it had been since the document was checked? Is there a legal requirement for keeping the document current? If so, how long is that limit under the law?
"Is there a legal requirement for keeping the document current?"
Very unlikely in most legislations. Would there even be a legal requirement for the document to exist? There may be a requirement if the business were ISO9000 accredited or something similar. If the latter I'd say this was a clear fail of that.
I worked at a place where we had a major cock up like this. As IT Manager, I took full responsibility.
I insisted that the production environment should be kept separate, and access limited; but I was overruled by the directors.
I said that the consultants should not be given admin access; but I was overruled by the directors.
I demanded extra backup and continuity resources; but I was overruled by the directors. They also cancelled 3 separate planned DR tests in a 2.5 year period.
When inevitably the consultants screwed up, the entire system went titsup. We were able to get it back up and running, but it took 8 days just to determine the problem. As it was not in the data, restore from backup did not fix the issue.
Shit happens; how you deal with it shows the character of the individual.
The CTO told me to leave and never come back. He also informed me that apparently legal would need to get involved due to severity of the data loss.
"Do what you feel is right, Sir. However, I promise you that if you get legal involved, I'll make you famous."
Somehow I don't think I'd be hearing from the company ever again.
> "Do what you feel is right, Sir. However, I promise you that if you get legal involved, I'll make you famous."
I would never say that. You need to complete the live test of the madhouse by including legal. If they decide to go after you, the lawyers also have earned their share of the notoriety.
Devs should not get credentials for production databases.
All stuff they do should be on various dev databases.
With test deploys of development changes, done on a db that should be essentially a clone of any production database (with necessary anonymization done if actually populated from a production db). The testing / deploy should be done by non dev staff to make sure all is clear and simple & no hidden undocumented stuff needed.
I work in dev and have no access / knowledge of any customer database credentials, nor should I (not even access to production servers, never mind the databases therein)
All new dev updates are deployed and tested, by test engineers (again, do not have production access, so cannot accidentally run on wrong system), running against a test db that mimics typical customer production databases (and can be easily restored if things break horribly)
If testing is OK, updates are deployed on customer production sites - before deploy database and old application backups are taken (even in cases where customers have "realtime" backup db replication )
People can make mistakes, especially someone nervous, overwhelmed on first day of new job, mistakes happen, no back covering system is ever perfect, but at least you try and have half decent procedures in place to reduce likelihood of nasties (so where I am, non dev people do have access to production DBs, and they are the ones who have to be careful, as they could wreak havoc, hence others do testing first).
The scenario described, where production db creds are flying around for all and sundry to see is just deranged
If I do need to look at a problem on a production system (e.g. if it looks like an odd data / some corruption issue causing problems) then I work on a clone of that system that reproduces the problem, not the system itself
AC for obv detailed real world reasons.
Oh, the stories I could tell, if only I were allowed to tell them.
Such as the time, while working as a code librarian, that I wiped out 600 man-years (Yes, man-years, as in 30 programmers having worked for 20 years) of work, while doing library maintenance, coupled with a failing disk. Whoopsie! Fortunately, we had multiple backups, including some which were stored off-site. Whew! It shut us down for a day, while we recovered the library, but that (and a round of beers for the system administrator), was a small price to pay. Whew!
Dave
I once worked on a medical prescription handling project. We had the production system, along with a test system that the developers used, as well as a handful of clueless developers who couldn't seem to differentiate between the production and test systems. So, we'd occasionally get a call from a local pharmacy, wanting to know why Dr. Feel Goode had sent a prescription for 10,000 Xanax for patient John Doe to them. Ack!
Anon Y. Mus
I have a similar story from one of my first dev jobs, where everything was ASP classic, nothing had any tests. We did have local environments but you couldn't realllly test anything properly without doing it on production.
The website sold sex toys, and another developer put a test order through to his home address (lived with his mum) and a 12" black rubber dildo turned up at her door.
So.... there is no data security, if the production credentials are in a dev guide...
So.... there are no backups of production data...
So.... they let a junior developer who is totally new to their system set up it up on their own...
We all mess up once in a while. That is why we do things in such a way that its really damn hard to do things like this, without knowing what you are doing.
Sure at my company I can connect to our production system, and in theory could wipe it, if I wanted to. It would have to be very very deliberate. If it did happen, we have several layers of backup we can fall back on to. Fortunately it has never happened.
If something like this can happen so easily by accident, it is not the junior developers fault, it is the CTO for not ensuring that the systems are built with consideration for such things.
Hopefully the CTO gets fired. He deserves it. I'd like to say the junior dev could file for wrongful dismissal, but try explaining the above to a judge who has no idea how to computer. It'd be a waste of everyones time.
"but try explaining the above to a judge who has no idea how to computer"
It really does depend on the judge and I'm sure there's a car analogy you could use in an emergency.
The points you make don't actually depend on understanding how to protect or run a computer system. It should be clear even to the most out of touch judge that if the company depends on "things working" then efforts should be made to ensure that "things work" and "things can be repaired" if they stop working. Then, of course, there's the possiblity of dragging in an expert witness and just letting them laugh their arse off in open court when asked whether the company's setup was either "fit for purpose" or "best practice".
At all levels.
The best test environment I've used had a level above "company" in the DB hierarchy. Although there was only 1 real company (no separate divisions) there was a second test company and it duplicated all the structure of the main company, so anything could be tested on it. If it didn't work, no problem. Scrub the stuff you created (or copied off the live branch) and start again. Your (faulty) code just nuked the (test) database? Well your co-workers are going to be a bit p***ed and Ops will have to recover a version so everyone else can get on with their work but otherwise not a disaster.
It's an obvious tactic (once you've used it or think about it) but it takes someone on day 1 to realize you want to have this feature in your system and make sure it's put in.
This story implies the people involved were so stupid it makes me think it might be a troll. If it is real, the idiots who think it's a good idea to print product credentials in training materials should be canned, along with the people who supervise backups (or rather don't) and probably the CTO for being a totally clueless fool. This is just an after-affect for massively incompetent management, you can't blame the first-day junior dev for it.
"My money ... is on the document having been written by a BOFH type of person."
My money is on a scenario such as the current sysadmin being asked to write the instructions for his successor.
Then at some point down the line someone else sees that document and asks for a copy to edit and redistribute for more general use, not realising that it contains production-specific information.
This post has been deleted by its author
If the developer had enterprise database experience, this wouldn't have happened. My fingers wouldn't let me blithely type commands that alter the origin database when the point is to create a local copy.
Where I work, erroneous, unmoderated and outdated home-brew docs are commonplace. In time, you learn not to follow them to the letter, but new hires generally get bitten once before they gain this wisdom.
It's not the dev's fault for not knowing, it's the fault of the hiring manager for not properly qualifying the applicant. Punt said manager, move the applicant into a more suitable role, then make this incident an interview question to ensure that future candidates are screened for database skills.
In the 180 posts above I assume the following is repeated many times, and I have not read it all.
I just feel compelled to add my mere two cents worth.
The level of skill at the small end of practice is often abysmal as in this case. The CTO is an inexperienced unqualified idiot and almost certainly a bully. If for example he genuinely had an ITIL Manager certificate, none of this could have happened if he was following event the basic aspects of best practice.
In over 40 years of working for large multinationals, and one government department, for the multinationasl, production systems were not accessible to development staff, at all. One of them even had separate networks, as in several air gaps and military style access control to production worlds. A contrast is provided by the Government department , controlling driving license data. It of course was dominated by unqualified bullies, who where even happy to have offshore contractors access the internal network to build and maintain systems despite being warned. Oh and of course the trouble making people objecting, no longer work there.
As a newbie I was given some JCL to test a program change and specifically assured that it only read a more-than-usually important database, used in thousands (literally, it was a large company) of jobs. After 2 or 3 runs I noticed that the output showed a run number that increased by 1 each time and reported it, only to be told that was impossible as my Dev sign-on didn't have update authority. Oh dearie me. Turned out there'd been a special dispensation for another project to update using Dev sign-ons, and the authorisation hadn't been cancelled.
My supervisor was visibly worrying about whether he'd keep his job! Luckily what I was running didn't update client data, only the run number for audit.
1976,us a group of A level mixed tech students in Kent being taught the very basics of computing,invited to go see a big all floor mainframe just up the road(litetaly) from college,shown around for an hour or two with brief descriptions of what each chunk of hardware was doing,then left alone for a few minutes,nobody at all except us 12 students in computer room,someone decides to tap a bit of nonsense on a keyboard,nothing obvious happens,fella comes back,finishes tour with a thank you and maybe he will see us in a few years as employees,we all say cheers and bye,walk back to college,an hour later,paniced tutors pull all us visitors out of various lectures and gather us in common room,to tell us that the mainframe we just visited appears to have died and did any of us touch anything,everyone shakes heads vigorously,deny everything,find out next morning that entire data system is gone,and we have probably just bankrupted a major American oil/mineral exploration analysis company used by lots of smaller companies !! oops..
From what we got told/heard over the months was that it was all the tapes with fresh data on that got killed,the ones from small companies that had fresh data about new exploration holes etc was lost !
You would have hoped that it would have been standard practice for someone somewhere to take copies,but apparently not.
The firm involved went into lock down for months,all sorts of rumours about big oil firms sabotaging the systems etc etc,nobody ever outright accused us of killing the system/data,so it could have been their own cock up.this was an American firm,staffed solely by Americans,who had very little interaction with local population,so all we ever heard was gossip..
They did survive,but it took 3 years apparently to get over the problem..
I think they used it as an excuse/reason to upgrade/rebuild the entire system,and it was big,floor was 70 yards by 30 yards,one entire wall was reel to reel decks,rest of the space was well crammed with the rest of the hardware,the ac system was massive,took up another entire floor..
Am racking my brains trying to remember the name of the company,I have occidental stuck in my head at the moment,but am almost certain it was something different.
Ashford,Kent,UK,1976,there cannot have been many big main frames in the area,this one was right in the town centre,the building has gone now,but someone must have some memories from then.
But as I said,it was a totally American firm,with emoyees all Americans..
It's all a long time ago,I can only remember one other name of the 12 of us involved and she was polish !!
1.) No new member of staff in any area of work should be given an instruction document and told to get on with it. Because there will always be ambiguity, uncertainty and complexity that some new staff will trip over. That's why organisations have a proper induction ( supposedly).
2.) No new member of staff in any area of work should be given any kind of task that brings them close to a destructive error until they know the systems and where the risk factors are. Because you can't avoid the trip hazard until you know where the trip hazards are. (see induction comment)
3) No new member of staff in any area of work should be given any kind of instruction (written or verbal) that hasn't been looked at and potential dangers removed - because they don't have the background to recognise those dangers. ( See ditto)
4) No new member of staff should be expected to discover what the "unknown unknowns " are. because you can only do that by making a mistake, possibly a catastrophic one. The Staff responsible for induction should always ensure that they know what those risks for the new employee are likely to be, ( which effectively means knowing what the job is and what the organisation's culture is ) and preempting these. Perhaps by giving them a mentor who can still remember being new. (another ditto).
1. The documentation for the new dev was FUBAR. That's on the CTO, he has to sign off on that.
2. The new dev should not have had access to the production database at this point to begin with. Again, that's falls within the CTO's responsibility
3. Not having working backups means that they do not have a valid DR procedure in place. And that is absolutely on the CTO, he has to sign off on that as well...
Definitely a docs issue, clearly written by a really 'clever' person who knew what they meant and assumed everybody else reading (using) the doc was as clever as them.
Docs like this need to be properly idiot proof, you need to ensure that whoever is using them cannot get anything wrong.
Oh and working, verified backups is a good thing too!
I can think of at least 3 cases where I've been dropped in the s**t by some "genius."
One case was down to a MENSA member, so he really was a genius. That I could fix quite easily, although that will fail in about 60 years. When I did it I assumed the code would be junked long before it was an issue. Now I'm not so sure.
The others were completely out of my hands to fix. Why would you not use a case statement for multi way decisions based on a state field?
"One case was down to a MENSA member, so he really was a genius. "
Not necessarily.
Mensa is supposed to be open to the top 2% of the population by IQ, which is only about an IQ of 130, depending on the test (two standard deviations above the mean, near enough).
IIRC, 'genius' starts somewhere around 145 to 160 (about 1 standard deviation above the Mensa lower limit), depending on who is doing the definition... one source quoted 140, and another stated that 125 was 'necessary but not sufficient'. I expect that last can be discounted as an idiosyncratic outlier, using a rather different conception of the term.
This post has been deleted by its author
by a really 'clever' person who knew what they meant and assumed everybody else reading (using) the doc was as clever as them.
Hope that was sarcasm. But from a communication POV the first rule is that you can't assume the reader knows what is in your head. If it isn't in the explanation you haven't explained it. To some extent this may go with the territory of IT people. The tendency, expressed in El Reg a few years back, to people with Aspergers type thinking to go into IT and be really good at it, probably means that some quite important people have a very poor understanding what is in other people's heads.
Well to be fair, most of the time the documentation people don't know what they are talking about.
They rely on "SME" or "subject-matter-expert".
That's probably the guy who should be fired. Or one of them...
Or the guy who never proofread the docs, or the guy who never asked them to be corrected/updated, etc.
The purpose of making backups is not to hide them away in a cabinet, but to have those be the ones users are making changes to. Then at the end of each day the modifications are rolled up into the real database.
No one uses the main database, live.
The one responsible for a bulletproof system is the CTO.
Not doing this right is his fault.
That is why he is paid the most.
@kirk_augustin@yahoo.com
> The purpose of making backups is not to hide them away in a cabinet, but to have those be the ones users are making changes to. Then at the end of each day the modifications are rolled up into the real database.
All you have done is redefined "backup" to "real database" and "real database" to backup.
The live production database is, by definition the one that's being used live and in production. The backup is by definition the one you update every night.
It's entirely possible that the database concerned was one of countless that are left installed "open to the skies" as discovered in the MongoDB trouble in January. Database systems often tend to be unsecured by default, on installation, and if no-one gets around to adding it, that's what's going to happen. Presumably that introductory document dates back to when there were only five people working there in the same room?
I was given admin access to corporate Navision for local user administration, I wanted to create a new user with the same permissions and roles as an existing production manager.
In the version of Navision that was in use at the time, the roles looked like a spreadsheet so I selected them Ctrl-C, then opening the roles in the newly created user, Ctrl-V, nothing obvious happened except the roles closed and the orginal user was deleted, I had to ask corporate to recover him from the backup and I was not asked to administer Navision any more.
Was actauly quite pleased as if that sort of cock up is possible, and not controlled in the software only luck will save the user from another one.
Just because software is f^cking expensive doesn't make it well designed.
To be fair, it depends on who you are.
At one job, I did a days long upgrade to the core application for the company starting on my fourth day, on an OS I had not previously worked on (though probably my 9th OS, so that was less traumatic than it could have been)... paying close attention to the manufacturer docs was critical.
back in the day when Novell was used i was on work experience and given the role to look after the network, while logged in as the admin account i was cleaning up user accounts and thought it would be a good idea to delete the admin account and create a new one, for some reason back then it allowed me to delete the master admin account i was logged in under and from that point had no privileges to create user accounts, resulting in the network no longer having an admin account to maintain the network. I wasn't popular from a while.
If admin, then yes, still the CTOs fault for the docs and the lack of working backups. Most of you seem to be answering from the viewpoint that he was an admin, just questioning the context and procedures, but assuming he would have business on that db later on.
If dev... WTF is a dev supposed to be doing with live root passwords to the prod db? Most places don't, and shouldn't, give even system specialists devs (like me) access to the live db, figuring that any interaction should take place through admin-run scripts and/or the product SMEs. I find that sometimes that's a bit overkill and it's nice to have access to support troubleshooting. But I specifically ask for read-only access - no way to screw things up (well, except for mis-written 5M row Cartesian Joins). And none of this justifies root.
That's a system management fail, task segregation fail, people fail and a security fail by writing down those credentials in a disseminated document, all in 1.
This is soooo stupid... Firing this junior won't solve anything. He'd never do it again and this was clearly due to the stupidity of the people of this company. Not protecting prod data, not having a solid backup in place, requiring prod access to create a dev/test database. All of this is so amateuristic. I can hardly feel sorry for the company. It's their own stupid fault, if...
If it's true. Maybe this guy screwed up another way and just made it all up to save his reputation.
To Be Fair: at least in Canada, you can fire anybody you hire within their first 3 months with no notice or explanation and no legal problem. Similarly, the new hire has the right to quit on the spot, within the same time period, without giving any notice or reason, either.
(Because giving notice goes Both Ways).
Firing the ignorant clumsy new guy who wrecks the whole operation seems wrong, but is it?
No, it is not.
Installing & configuring new servers in branches of a Canadian bank.
They came pre-configured up to a point & then it was just a case of following the documentation until one crucial point where the required configuration details have just been input by yourself........
...... the document tails to a point where the bottom half of the page is blank, the next is also blank, as is the first half of the next one..........
Which finally puts up a image of a Stop sign & on the next full page says "Do not press enter at this point, if you have you will have to reimage the server from scratch using your provided reimage media"
This is not what you want to realise you have done at 12.30am.
but if someone expected their "new" pro to be able to figure out what's actually deployed instead of following a script verbatim, then they company didn't get the caliber of employee they expected. Not gonna weigh in on realistic or unrealistic expectations in this regard.
But legally actionable? I don't see any what that's gonna fly. Especially if there IS a documented script and those values were followed.
Unless you got the guy on Social Media bragging about how he's gonna "destroy" this company because he's got some agenda, then talking about "Legal" is someone who's clueless and realizes HE's got some responsibility too.
Restore from backup and move on. No backup? no policies for such? Mr CTO calling for "Legal" is doing so because HE screwed the pooch.
This reminds me of the Unix version of the "I Love You" virus:
----------------------------------------------------------------
If you receive this email, delete a bunch of GIFs, MP3s, and binaries from
your home directory, then send a copy of this email to everyone you know.
----------------------------------------------------------------
So who to fire?
1. the creator of the virus
2. the supervisor of the creator of the virus
Proceed with legal action against the pair.
What else?
3. put CTO on administrative leave (if he wasn't already caught by #1 or #2)
4. investigate CTO
5. audit all documentation, beginning with all new employee orientation
6. instigate a new orientation regime where the new employee is assigned a mentor
7. apologize to the poor victim of this prank whose career is now upended, perhaps offering them a try at the CTO position
This is obvious stuff. If any business wanted to do it any other way, I wouldn't want to work there. So:
8. Turn down the CTO position at such a lax place of work. They probably have passwords on stickies, if they have them at all.
a) The developer should not have had access to production at all. Separation of Duties. Fundamental tenant of security. Enforced by network and host separation.
b) WTF was confidential material, DBA account details, doing in the developers document in the first place?
c) No backups. W.T.F.
The CTO should have been marched out of the door.
The fired developer can undoubtedly claim unfair dismissal.
I have trouble believing this really happened....then again people on this planet really are that stupid! Presuming it is true certainly not the new guy's fault....surely, I would have though, he's got legal claim for unfair dismissal.....
Even if he did it deliberately and maliciously....it shouldn't have even been possible as that production DB should have been password protected, at the very least.
This person did nothing wrong at all. The story should be unbelievable, but unfortunately this kind of halfarsedness is not surprising. Halfarsedness is rampant in the industry. An investigation into security, configuration, documentation, and other things would undoubtably reveal plenty of people and processes at this company that are to blame, and no shortage of people who have privately or loudly complained about it and expected something like this to happen eventually. This person is a victim of the company's collective stupidity, and the CTO and the company deserve no sympathy. Anyone there who tried to highlight such problems, and this poor person do deserve sympathy.
CTO 100% at fault, documentation just gave more fuel for the fire.
None of this is the new guys fault at all.
They had given him high level access as a junior??
Principle of least privilege within a good role based access control (RBAC) model could of helped them. However their backups should of been tested...