Programming supercomputers is hard! You need to be able to write (hopefully!) optimal and cache-oblivious algorithms, manage domain decomposition (between nodes, between GPUs per node, between cores per GPU) and message passing on a substrate that hopefully reaches peak theoretical speeds without any bottlenecks, with the CPU being used as little as possible on busywork or in latency-spiking sleep states. You need a good mental model of EVERYTHING, a complete mastery of the entire stack and exactly how every message, interrupt, memory copy, packet, and bus read/write happens and when and why :D