The weakest link in the toolchain
If the payloads are compiled on site, does that suggest the CI toolchain and automated CD are the weak link?
A British supercomputer hacked last week was among a group of big beasts targeted around the world to mine cryptocurrency, it has emerged. The European Grid Infrastructure (EGI) team reported this week that, in addition to the hijacked user accounts on the ARCHER cluster at the University of Edinburgh, Scotland, supercomputers …
It's HPC it is not like an ordinary server. Firstly they are academics. Secondly you can't run an HPC system without access to compilers. A significant proportion of the users are bringing their own code and need to compile it up. Trying to replicate the environment on their personal machines to produce code that will work on the HPC is really quite tricky and many HPC users have access to multiple HPC sites (which is how it has spread among HPC systems) so way way easier to compile on the machine. Second we prefer our users to use vendor compilers because they can typically squeeze an extra 5-10% out of the machine, and there is never enough compute cycles.
Many moons ago, when I was a super-computer specialist, I was asked to develop a little app to report for which process was using which amount of CPU, then link all to projects.
This was of course on a top class super-comp ...
First run of the app was surprising: about 70% of last month CPU usage was "rshd". Yes, whole month, day and night.
After a WTF session with the super's admin team, it turned out one of the dude working on this super never understood a thing.
In lieu of running all the big loops all on the super, he was running the main loop on his workstation, and would launch an rsh to the super for each iteration, millions of them. And on the poor super, the iteration would last one milli-sec but take many seconds to rsh to ...
Lots of laugh. And much very expensive CPU power lost ...
Physicist Dr. Robert Helling at LMU Munich, one of the sites similarly attacked, published a preliminary analysis of the malware at https://atdotde.blogspot.com/2020/05/high-performance-hackers.html. He discovered altered files in /etc/fonts. In particular, the .fonts file was an executable with SUID root that simply gave a root shell (running bash). Another file in /etc/fonts named .low was larger and obfuscated by XORing. He was able to decode some of this and determine that it had lists of files in /var/log, presumably because it cleaned the logs. Also, they likely were able to steal additional SSH private keys from user directories, enabling them to login as many different legitimate users to further obscure the tracing.
Clearly, a sophisticated attack. Less clear is how they managed to implant the rootkit in /etc/fonts in the first place. Stealing a SSH private key from someone's personal device doesn't explain it. They could get a shell as a legitimate user, but still should not allow planting the rootkit in /etc/fonts. I wonder if their IPS only looks for remote password guessing attacks and not from sudo attempts.
In any case, it looks like they were able to delete logging and probably more, so the evidence of crypto mining, or anything else they did, is of course going to be limited. There is probably a gateway or router from which investigators could determine IP traffic, and that would reveal the extent of crypto mining.
Local privilege escalation. Possibly CVE-2019-15666, and lots of HPC uses GPFS so CVE-2020-4273 would be a possibility. The evidence of crypto mining is very weak and most likely a smoke screen for far more nefarious activities. There is evidence of data extraction to Chinese IP addresses...
Don't most HPC tasks consume lots of CPU? If you compromise the account of a known userA who runs the known taskB, how will you know that the task isn't doing what it says it's doing: surely you just see the processB, owner userA, at roughly where you expect in the process monitor?
This sort of hack could be very damaging because it's not just compromised accounts - what if they've compromised some of your code? Every time you run it you get bogus results ... and someone else gets cryptocurrency.
EDIT: I know nothing whatsoever about HPC, although I'd love to.
Because on an HPC system you have one process to one core as a rule. Typically you will have somewhere between 85-90% of your cores busy at any one time. The slack is mostly down to differing job sizes, and when someone submits a large job you have to hold nodes in reserve waiting for enough to become available to schedule the large job.
Now sure there are some background processes but lets say you have 40 cores in a node then a load average more than 45, say 50 tops will produce an alert because something is wrong. As a sys admin you would be investigating pretty dam sharpish if all or even a significantly number of your nodes where showing a sustained load average above the number of cores on the node.