back to article KCL external review blames whole IT team for mega-outage, leaves managers unshamed

An external review into last October's catastrophic data loss at King's College London has placed the blame squarely at the feet of the IT technical team, which it found neither understood nor followed the university's system for backing up data. The probe found that "IT did a pretty good job of managing the actual incident" …

  1. Aristotles slow and dimwitted horse

    Erm, no...

    "Whether it would name managers who had been responsible for poor decisions that caused the data loss. These were not included."

    Quite right too. Put yourselves in the shoes of those on the ground deemed as responsible ask yourself if you would like that to happen to you. The public naming and shaming of these individuals, and the certain following witch-hunt by the press and other external non-vested-interest parties crowing for a good story should not be encouraged. I've not read the report, and to be honest I'm not going to do so, so I have no idea if it is simply just down to human f*ckwittery as stated, or a combination of that, and/or a lack of Business and IT DR or BCM awareness, or of funding, or resourcing, or adequate regular downtime to test that those BCM or data recovery processes operate and function as they "should" have been designed and implemented.

    I'm sure the college however will have its own disciplinary procedures and those deemed responsible will be dealt with accordingly. They are still human beings however and after learning from their mistakes they may need to go and get another job to support their families. So as long as KCL it learns its lessons as an organisational whole and puts in procedures (and possibly better staff?) then it shouldn't happen again.

    But it's IT... and it's not always as clear cut as that is it?

    1. TRT

      Re: Erm, no...

      There are names. On the timeline diagram. Big names. And decisions. But no line joining the two.

    2. ToddRundgrensUtopia

      Re: Erm, no...

      And this is why people get away with doing this. They are not named and will not even be disciplined. Classic public sector, be average and when you fall below that call in a consultant to print off some paper.

      They should be fired, as should the balloon that employed them/it as they didn't have a clue about system backup.

      1. Anonymous Coward
        Anonymous Coward

        Re: Erm, no...

        Which private companies publically name responsible people for screwups in a press release? Even when I've seen people fired for big screwups, it isn't acknowledged internally that was the reason - even when everyone knows it was. Even if they didn't have to worry about getting sued (which we all know is the reason they tend to be tight lipped about the reasons for firing people) they wouldn't be putting out a press release about it.

        So why should "public sector" employers be named publicly? Whether they are fired, demoted or given a raise as a result of their screwup certainly doesn't need to be made public.

  2. Anonymous Coward
    Anonymous Coward

    Executive summary

    1) Throw tech team under bus

    2) Exonerate manglement

    1. TRT

      Re: Executive summary

      I don't think manglement are exonerated at all. Depends where you draw the line, I suppose.

    2. Dwarf

      Re: Executive summary

      Well at least the bonus's are still safe then

      After all, the tech's are replaceable.

      I also wonder where the insufficient backup capacity issue comes from. I can't see any technical person under-specifying a platform. Normally this comes from budgetary constraints. I wonder who owns the budgets - the technical staff or the exec's ??

    3. quxinot

      Re: Executive summary

      Thank you.

      So a report from the manager's perspective shows the management in a good light. We're surprised somehow?

      FFS, kids.

      1. Anonymous Coward
        Anonymous Coward

        Re: Executive summary

        It was produced using PowerPoint FFS. Obviously devised as a slideshow rather than a real report.

  3. Jonathan Knight

    So if I read this report correctly

    HP engineers were sent to fix a failed piece of hardware several weeks after a firmware upgrade had been released to address a problem that would cause catastrophic failure of the system if a piece of hardware was replaced.

    Presumably HP should have had a procedure in place to upgrade the firmware before carrying out any operations that may lose the customers data.

    However, this report does not question the actions of the HP engineers and puts he entire blame for failure to install the correct firmware on the IT team.

    I'm of the opinion that the HP engineers are the cause of the failure and have failed to carry out due diligence before carrying out the procedure to replace the hardware.

    1. Paul Crawford Silver badge

      You are right that HP probably were the cause of the primary failure.

      However, the disastrous consequences of such an array failure lies squarely with the management and IT teams for not having a working DR system in place (that includes making sure *all* data is backed up, and that the backups are tested regularly). Even if HP didn't fsck-up, failures can and *do* happen all by themselves.

      But the blame shone on the IT team is worrying, but maybe to be expected from this sort of commissioned report. I'm sure all of us have made mistakes, and all of us have jerry-rigged systems to get by, but not having proper DR in place for an organisation-wide storage system is a likely a management failure in terms of not funding and/or not asking the right questions (or being prepared to hear the true answers).

    2. Anonymous Coward
      Anonymous Coward

      Yes in a ideal world. However in this one HP will only do that if you are enterprise services client and they manage your systems. Otherwise its up to the respective sysadmin/s to check and apply firmware updates. Their EULA states that you are responsible for applying firmware updates on time. If they come and replace hardware they assume that you have patched the system, if something goes wrong they do not really care.

      However if that happened to my IT department I would encourage every member of it to sue both KCL and PAC for defamation if the have proof that the KCL management refused upgrades and were informed that their backups are not reliable.

      It would be nice work if the respective government agency sues/fines KCL for gross negligence, that way they will send a strong message to all universities. Yes I will keep dreaming.

  4. Warm Braw

    Not comprehending the business criticality of the data

    And yet the data owners, the ones fully understanding the business criticality of their data, were expressly forbidden to make personal backups after the incident.

    Sounds like someone has a full understanding of the employment criticality of maintaining their budget line, though...

  5. Anonymous Coward
    Anonymous Coward

    TL;DR

    I'll leave a these statements from the report exonerating the senior management from any responsibility here;

    1. IT is doing many things at once, which has overwhelmed College stakeholders

    2. Insufficient time has been given for the senior IT governance team to constructively challenge IT plans

    3. IT teams following process mechanically with a narrow focus on their own work

    KCL & PA you wankers!

  6. Anonymous Coward
    Anonymous Coward

    Currently working at KCL

    As someone who works at KCL both during this disaster and afterwards, I share the view of many colleagues. This report has been watered down and responsibility has been distributed far and wide. If you read the report throughout, while highlighting the total incompetence of our IT Team and senior management team the undercurrent is a collective responsibility, where everyone from the end user to the IT team is at fault for some reason or another. We have already seen this theme with communications from KCL about what we "end users" must do an dhow we "should store" our data. The failures of a few will impact myself, tax payer and research grants as we now have to pay more for IT services offered by KCL.

    What is boils down to, and the report avoids this. Is that King IT knowingly did not backup the infrastructure and knowingly did not test the system. In some cases they knew they were not backing up. HOWEVER, all the while telling people like me who are putting in research grants and ethics that we WERE backing up and the system is 100% foolproof. Surprisingly, these facts did not make it into the report.

    No excuse this to have happened unless there was a culture of laziness to begin with. That starts from the top down.

    1. Anonymous Coward
      Anonymous Coward

      Re: Currently working at KCL

      Always backup even if its illegal!

    2. Anonymous Coward
      Anonymous Coward

      Re: Currently working at KCL

      As another King's worker, I agree with this. They made promises, they imposed dictates, they negotiated purchasing deals without consulting end users or analysing needs or meeting with teams. They insisted at one point in making research groups pay ~£5000 for multi-function floor standing Samsung copier/scanner/printers that were to be hooked up to follow-me printing, yes, devices that occupy around 3 sq m of floor space and are appropriate for 300-400 users to be found money and homes for within small offices of a group of just 4 researchers on a shoe-string budget. They had survey teams coming around up our offices near Denmark Hill looking for desks to remove to fit these things in. We refused and kept our old HP machines, thankfully, as the follow me system turned out to be riddled with problems. It's this kind of rubbish that really sets end users at war with central departments.

    3. ToddRundgrensUtopia

      Re: Currently working at KCL

      Well said Sir!

  7. Alan J. Wylie
    Joke

    "multiple copies with identical hash sums"

    I hope those weren't SHA-1 hashes.

    shattered.io

  8. Notas Badoff
    Megaphone

    CYA via documentation

    "never been backed up on tape due to capacity constraints and the potential impact of this was never communicated to the College, ..." (they said...)

    Remember that summaries of hallway/meeting conversations should be emailed to relevant fellows and highers. When your worries are not translated into effective remedies - for *years* - people will not volunteer that ole Ned said this would all end in tears back when. And print out a copy if your email about storage mentioned "losing everything!"

  9. TRT

    Well, to be fair...

    If it was a routine firmware update, I'd have left it a few weeks before doing it. But if it was flagged as a critical update, I would have done the update within that working week. Knowing the implications of the patch, HP should have validated the system they were performing a replacement on before doing the replacement. As they said "if this update had been done, none of this would have happened" they must have KNOWN of that vulnerability and the consequences of swapping a component without the update.

    HP's EUA etc might protect them from being sued for the full damages, but it does not absolve them entirely.

    My fear is that this will be used as an excuse for even more red tape and bureaucracy in an already management heavy system. A massive swelling of the ranks, and the associated expense, and all the ITIL and ISO certifications and training and documentation, but that's all so much fluff on top of getting the actual job done. They'll point the finger firmly at having to support and migrate legacy systems, accelerate the move away from those towards a corporate IT model and make even more people do their own thing. Tying all IT purchasing into a single supplier, for example, would exclude many medical and scientific instruments that come supplied with integrated systems; everything from water purifiers to brain scanners, from chocolate dispensers to air sampling systems, building management systems and audio visual systems. That's the feeling, anyway, I get from the document. Users feel IT don't understand the business, especially in research, IT say they do and you should do things their way for "reasons". Trust them. They are experts. But no-one else is allowed to be.

  10. Henry 8

    "the core College IT systems and data and file storage were backed up on a different location of the same storage unit"

    I'm sorry, but whatever organisational problems might also have been at play in the sorry episode, any sysadmin who thinks that copying data to the "same storage unit" can in any way count as a backup is incompetent.

    1. Martin Gregorie

      Agreed. Any copy of the data that isn't complete and held offline, either in a firesafe or (preferably) in a different building that's far enough away to survive destruction of the data center is not a DR backup.

      1. TRT

        According to one diagram...

        They had two similar storage units.

        Why weren't they being cross backed-up?

        How many training sessions on data security do they run? Do they issue certificates to say how all their end user services staff and other IT staff have attended training on data security, data protection act etc etc. Compulsory training for all staff? What kind of a bollocking do people get for forgetting their training?

        Still unanswered questions here. But on the whole I think the document isn't as much of a whitewash as people were expecting.

    2. TRT

      Amazon...

      Amazon's AWS status control panel was on the same system as it was reporting on, allegedly.

  11. Anonymous Coward
    Anonymous Coward

    What stood out for me was this long, long, years long transition. Who migrates data between DC's without a project? The half-arsedry of that migration is 100% managerial and this whole episode seems to stink of a chronically underfunded operation.

    1. Anonymous Coward
      Anonymous Coward

      They'll blame having to support legacy systems and they'll bring the axe down. Just watch!

  12. This post has been deleted by its author

  13. Anonymous Coward
    Anonymous Coward

    Let's just add a bit to that report

    1 - grab PDF, look at doc properties => person who authored this (and that he used Powerpoint 2013)

    2 - dig out this person on LinkedIn and find ".. is a versatile, successful, and experienced consultant specialising in business transformation, analysis and change"

    So, this is not an IT guy. Now, the problem I have with that is that said author is thus making statements about something that he has for a part no competence in, which also happens to be demonstrated by his interview list. There is ONE (1) guy in there who I would consider to be at the coal face, and we all know that those who are absent tend to take the blame.

    This means that the repeated claim that "IT followed processes mechanically" (i.e. without thinking) lacks a certain amount of validation. The author seems to be very proud of that phrase, but I'm left with the question why that team was not lead by someone who ensured processes were in place?

    I thus get the feeling that the objective as stated in the report does not quite equal the objective that got the undoubtedly fat bill signed off. It feels like a white wash.

  14. RichardB

    Quite clear who is to blame here:

    "Hed of Architecture (IT)" [sic]

    To all those saying this report squarely blames IT, I disagree entirely.

    Throughout there is a common strand of how 'the users' and 'the business' willfully fail to pay attention to the concerns of IT, a litany of failure to choose to understand basic IT concepts, and chose to form an antagonistic relationship with a 'supplier' of internal IT.

    It's all there in black and white, and borne out in the comments here...

  15. CliveH

    Learn from this and sympathise

    So KCL had a problem and got hit. While they are trying hard to get it working we are going to kick them while they are down rather than supporting them?

    I've worked in a university and am willing to bet that the level of investment in IT has not matched the constant demands to improve the "student experience". IT will have been trying to stabalise a crap infrastructure whilst at the same time getting constant demands to deliver the things students demand, all with too little money and staff. This tied with uncompromising fragmented requirements that lead to over-complicated and flakey systems.

    Let's be honest. I know that here we have not done any sort of backup test for ages. I'm willing to bet that most (if not all of you and the register too) haven't either. Businesses arn't normally all that keen to let you take systems down to do this. We currently stand in the knowledge we think we've done a good job. How are we going to feel when some rare complication causes an issue and every one stands back to laugh and point!

    As for the whole KCL won't let people store data where they want. My Uni didn't have a health school, I think though that KCL does. Do you really want your personal information stored on a USB stick by some numpty that then leaves it unencrypted on a train? I think KCL's reason for this was probably trying to avoid the next article about irresponsible data storage leading to leaked patient information.

    Oh and if it is anything like I've seen in the past I bet nobody takes responsibility for their own data at all. If our users started demanding backup tests perhaps they would help to keep us honest. It's just easy to assume IT know all about it and then moan and curse them when something goes wrong. Personally I'm off to see about getting those tests arranged. Be you IT or Business maybe you too should learn from the KCL experience rather than wasting your time name calling. That is what a truly Pro IT publication would encourage.

    1. Anonymous Coward
      Anonymous Coward

      Re: Learn from this and sympathise

      Let's be honest. I know that here we have not done any sort of backup test for ages. I'm willing to bet that most (if not all of you and the register too) haven't either. Businesses arn't normally all that keen to let you take systems down to do this.

      Oh no, we test. There's a simple reason for that: we also have BCM competence and we class a failing backup as a company extinction event - if we cannot recover IT within 48h we hit all sorts of problems, including regulatory ones. This makes it easy to defend such tests, and we have two data centres that each are capable of taking a full load for a few hours (at impaired performance, but at least it doesn't all come to a grinding halt) so we basically kill one off completely, every 6 months (it means that in a year we have both tested).

      That said, it is NOT a very comfortable thing to do. Even though we know it all works in theory, hearing it all spin down at once is nerve wrecking (first we kill power to see if UPS + generators pick up, but after that we kill main power to simulate a catastrophic event). I think we may put the next one on video for our customers, that gives us at least some marketing capital as a return on effort..

      1. Doctor Syntax Silver badge

        Re: Learn from this and sympathise

        "so we basically kill one off completely, every 6 months"

        Scary. A real failure on the other live system midway through the test?...

        1. Anonymous Coward
          Anonymous Coward

          Re: Learn from this and sympathise

          Scary. A real failure on the other live system midway through the test?...

          The systems are actually redundant in themselves (hence their ability to handle the full load), but we also share a DC with another company with a cold standby (tested annually in rotation with the other company) but we don't count that as a backup because it is by now barely able to handle the load, and the sync lags so it's perpetually out of date.

          To be honest, I'm about to flag that one as a risk and have it replaced..

    2. Anonymous Coward
      Anonymous Coward

      Re: Learn from this and sympathise

      Oh, I agree. It was a big problem, and it's a learning issue for ALL professionals (and others). Why KCL tried to suppress the media coverage, I don't know. Honesty counts for something, yes?

      The investment in IT, though, seems to be on the up and up. Massive swelling of staffing eating up the budget. Have you seen their organisational chart? They've got a CO managing four directors and a Chief Digital Officer. Directors are IT Governance, Transformations, IT solutions and IT services.

      Each of these has between six and eight heads underneath them, and each of the heads has between two and thirty people under them, and a lot of those have people under them. All told, it's 331 people, of which 279 are permanent staff.

      KCL does have a health school, but "personally identifiable information" is present everywhere in a University; student reports, letters about personal circumstances sent to tutors, appraisals. And beyond that there's also "sensitive" information which isn't personally identifiable, things like accounts, finances, vivisection data, purchase deals, commercially exploitable data, patents etc. So I would suggest practically everyone and everywhere and every organisation, business should be looking to implement the appropriate data protection standards. That doesn't mean you get all Draconian on people's ass - although some in management levels seem to take this approach.

    3. ToddRundgrensUtopia

      Re: Learn from this and sympathise

      CliveH, stop defending incompetence,it's embarrassing.

    4. Doctor Syntax Silver badge

      Re: Learn from this and sympathise

      "Businesses arn't normally all that keen to let you take systems down to do this."

      With good reason. The downtime is a secondary consideration. The main one is that if you're needing downtime on the live system it means you're doing the test on the live hardware and if the backup/restore fails for any reason you've just blown away the system you were trying to restore. You do not do your restore tests on your live hardware. You rent hardware for that purpose, ideally you have a DR arrangement which includes the facility for periodic tests. That way you can do your testing without any down time and without any time pressure other than the slot allocated. Your first test will be an interesting learning experience.

  16. Anonymous Coward
    Anonymous Coward

    Blame

    They are to blame because they allowed HP kit to be used for a critical system.

  17. timerider

    What I find surprising is...

    I don't normally comment on this kind of public forums, but I had found this piece of information shocking...

    Page 13. Point 3.

    IT has not been able to convince users on the need for doing full

    Disaster Recovery tests and negotiate windows for these to occur. The

    infrastructure limitations mean that any such test will involve downtime

    which business has so far refused, without properly understanding the

    consequence. Had these occurred it would have demonstrated that the

    backup systems were not functioning correctly..

    I can't believe that: 1. Any company wil put them in huge risk without testing the DRP Plan. and 2. The business is so blind that will refuse any testing for what is more or less a mandatory requirement for any modern business.

    I know there is a huge amount of companies that doesn't have the resources or means to execute a full DRP plan, but I rest my case with this particular example.

    BTW, who told them that snapshots = backups... Another huge mistake and I blame 100% for believing this...

    1. Doctor Syntax Silver badge

      Re: What I find surprising is...

      "1. Any company ... 2. The business"

      A University of College is not a company, neither is it a business in the sense you seem to mean. Putting all the resources into a single IT operation in a college ought to be about as likely an undertaking as herding cats. It's not surprising that there was no effective communication between IT and users as just about every researcher in the place probably has different requirements.

      1. Anonymous Coward
        Anonymous Coward

        Re: What I find surprising is...

        A University of College is not a company, neither is it a business in the sense you seem to mean.

        To be honest, I don't rate the subject-appropriate expertise used for the report as very high. The whole report exudes the sense of a foregone conclusion that the data had to match.

  18. Anonymous Coward
    Anonymous Coward

    From the report, I think it says they used the same 3par to also store their primary Veeam backups. These primaries were then copied to tape. They then lost their primaries in the controller failure so had to go to the tapes. Looking at the end of the report, the now fixed solution is still using the 3par for the Veeam primaries, lets hope they get some investment to allow for some Veeam repositories on different hw than their prod data.

    A few years back we had HP come in and design some unified storage for us using EVA's, they also designed it so that our backup repositories sat on the same stack using the same controllers. This seems like their standard advice to people who don't have millions to spend on kit. It seems that when budgets are tight, these big companies recommend bad solutions rather then a good solution that might cost 20% more.

    And the firmware bug? This array was apparently 4 years old, after 4 years surely system breaking bugs like this should never happen, chances are this bug was introduced into a `maintenance` firmware that was recommended previously. It's never nice doing FW updates that could potentially destroy all your data, so its acceptable to me that they were a few months behind. The idea of unified, one system storage array that accommodates all workloads always made me nervous.

    1. Anonymous Coward
      Anonymous Coward

      "This seems like their standard advice to people who don't have millions to spend on kit."

      TBH this requirement usually stems from the customer, the conversation is usuall something like the below :-.

      Me: You don't want your backups on primary storage, you need an off array copy, here's a cheap disk based backup solution that would do that for you.

      Customer: Why not vendor x, your competition, told me snaps are as good as a backup, besides I don't want to manage something else and budget is tight.

      Me: But if the primary fails so do the snaps in which case you're SOL.

      Customer: But your selling me a five nines array, are you suggesting it's going to fail?

      Me: Here have some nearline drives.

    2. Anonymous Coward
      Anonymous Coward

      As the saying goes hardware fails and software eventually works.

      But in order to really cock things up and cause irretrievable data loss you need some wetware in the equation.

      Looks like once again people and process trump hardware failures.

  19. Anonymous Coward
    Anonymous Coward

    Those who don't learn from the past...

    I worked at King's for the better part of a decade, and what I find truly sad is that in spite of the clear investment in technology and the massive increase in headcount, is that there is STILL a "Strand data centre". I remember the joys of 'Manage Data Once' a supposedly future proof storage project that turned out to be a bunch of hard drives in a rack in the basement. I remember suffering weeks without email in the summer of 2008, power outages that lasted days, and all the while a politically driven campaign to pretend that outhosting wasn't a viable option.

    That this sort of thing can happen in spite of a DECADE of 'change', strategic plans, and new CIOs—that is the real horror.

    1. Doctor Syntax Silver badge

      Re: Those who don't learn from the past...

      "That this sort of thing can happen in spite of a DECADE of 'change', strategic plans, and new CIOs—that is the real horror."

      ISTM that it didn't necessarily happen in spite of these things but maybe because of them.

  20. Anonymous Coward
    Anonymous Coward

    Losses?

    Another thought is that the report doesn't mention what was lost. How much data? What was the impact to research? Are there financial implications for this public institution?

    I wonder if someone needs to put in a solid Freedom of Information request.

    1. Anonymous Coward
      Anonymous Coward

      Re: Losses?

      I believe there's another article coming with a more in depth analysis and possibly some FOI requests. Much of it will be rumour, of course. I'm not sure they even have a list of what was there to begin with - the story seems to be that they didn't have catalogues. Page 11, item 3.

      And more and more stuff is getting recovered every day, so I hear, so the figures would be out of date by the time the report came out.

      The big funders like Wellcome and MRC have been along to talk to the grande fromages - after all, they paid for the research in the first place.

  21. Anonymous Coward
    Anonymous Coward

    no interview with the Director of IT services, no interview with problem manager or change manager...

    1. Anonymous Coward
      Anonymous Coward

      There may be a problem...

      with getting hold of some of the individuals concerned. Rumours abound concerning some individuals.

  22. Anonymous Coward
    Anonymous Coward

    no interview with Director of IT Services, problem manager or change manager...

  23. Anonymous Coward
    Anonymous Coward

    "KCL external review blames whole IT team for mega-outage, leaves managers unshamed"

    What type of managers are they? Managers are paid more money. What for? If they're IT managers, they should have the IT expertise to know how their systems operate, and a responsibility to explain the processes to the less knowledgeable staff in the IT team. If they're more expert in people management, they have a responsibility to know when their staff are performing their duties properly and when there are concerns. Either way, there is a responsibility..

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like