SSH keys & the coming apocalypse
[Anonymous because I can hear the black helicopters hovering overhead.]
I used to work somewhere with a machine closely related to ARCHER: I think I may have had an account on ARCHER but I'm not sure. One of the reasons I eventually left was the repeated amusing cycle of:
ac: 'you have a security problem around x: here are some approaches you could use to fix this, I'm very willing to help you do that because I am quite good at thinking about this stuff'. them: [very many polite words explaining why they couldn't be bothered fixing it as it would inconvenience the science people and would also just generally mean they would have to think about security and who cares about that]. ac: 'you didn't read my suggestion: if you did what I suggest the science people would not even notice. Also you are doing stuff here which is critical to the security of the country, you really do want to think about security'. them: [repeat previous statement with more and different words]. ac: [gets depressed, gives up, leaves organisation, rants on The Register, dies in ditch].
So, the basic deal with SSH keys here is this: you need to be able to run jobs on the HPC. Those jobs involve lots of stages: extract code for model from repo, do configurationy stuff, compile code for machine, do lots more configurationy stuff, run model for n cycles, postprocess, archive output, run model for another n cycles, postprocess, archive iterate that a few hundred times, clean up, shut down. During that time things will die: the model will hit some bad thing and will fall over, so you'll need to be able to go in, perturb things a bit from the previous cycle and restart the cycle that died and so on. All of the substantive steps run by submitting jobs into whatever Cray's batch scheduler is, which I forget, but it's some open source thing, so there's also a lot of stuff like 'wait for job to find enough nodes to run on'.
Many of these jobs run for months of wallclock time: possibly years in some cases: I ran jobs which ran for hundreds of days.
So all of this is run by, essentially, some agent on your behalf and you talk to the agent by ssh from your login machine, and it then talks to other bits of the system, also by ssh. The agent itself will often die during this the course of a job, so it carefully keeps all its state in files. In general you really want to be able to recover things when things die, so everything dumps state in files every fairly short while. The whole system needs to be able to recover from a complete cold start of everything: you really do not want to lose half-a-year's run because someone did the equivalent of tripping over the power lead. There's a good chance the OS will be upgraded on some of all of the system during a run.
So things like SSH agent forwarding are not even close to being solutions for this, because no single part of the system, other than the filesystem, will survive for the duraton of a job.
So there are some possible solutions to this, which would at least include manufacturing per-job pairs of SSH keys which never got reused. But that is, of course, not what they do. What they do is say that your SSH keys need not to have passphrases, so the system can get access to your private key and things can work. Because that will be fine.
Well, of course, it's not fine. Will they learn from this? No, of course they won't. Are there other security problems, both logical and physical: hell yes there are. Will something really bad happen in due course as a result: yes, of course it will.
I have a paper I wrote on all this before leaving which I'm tempted to make public but no doubt I would be in violation of everything if I did.