Re: C is the new COBOL
You can multithread in C/C++ fairly easily, but not portably unless you use POSIX APIs for everything you want to do. Absolutely not hacky at all. There are libraries you can use that bring portability to the mix, but your binaries themselves will never be portable like pure JVM (Java) or CLR (C#) binaries can be.
You can also write SIMD code for general-purpose processors (x86, x64, ARM or MIPs) with SIMD extensions in C/C++. You can #pragma into ASM to use the SIMD opcodes directly or you can call libraries that do that for you. Is did the #pragma thing with x86 SSE2 back in the late 90s. Nowadays you are more likely to use libraries than roll your own like I did. Rolling your own was fun though!
C was initially developed for Unix, which eventually became a multiprocessor (capable) operating system. The C language however does not have direct support for threads. This is what operating system runtime libraries were for, with POSIX as an attempt at a portable one (that the likes of IBM and Microsoft only supported halfheartedly unfortunately). This way the C language could remain pure. Not all operating systems supported threads (or even multiple processes) back in the 70s/80s. You can even write C code right down to the metal with no OS available. This is what makes C great for embedded applications and bootstrapping code. A C compiler can allow you to be small and tight with the ability to layer more complex stuff on later after you have the hardware functioning properly.
It's important to note that multi-threading != SIMD. Single instruction multiple data means a single "thread" of instructions that can process vectors in parallel with each instruction. For example, 8 additions in parallel rather than 8 adds back-to-back in serial. You can have multiple threads of serial instructions or multiple threads of SIMD instructions depending on what your hardware supports. Back when I was working with SSE2 you had to be careful to keep threads from messing with each other because there were not enough SSE registers to go around for all the (hyperthreaded) cores you could throw instruction streams at.
Operating systems generally support the serial or native SIMD instruction threads directly. Since CPUs and NPUs can operate in separate memory spaces from the OS itself those instruction streams are handled by coordination between the OS and external devices through device driver communication mechanisms to send data back and forth across memory busses. This type of work currently falls outside the capabilities of language compilers, which do not have enough context on how the communication happens. Instead, you compile code in special languages such as CUDA and then use a library/device driver to throw that code from CPU code through a vendor supplied library that copies the code over a memory bus to where the GPU/NPU executes it. Then the results get copied back into memory space that the serial code (and operating system) can see and manage,
Perhaps someday there will be enough standardization so that a language such as Rust, or even a modern C/C++ or Java or C# could handle both sides of the CPU/G(N)PU problem with a single language. And maybe someday operating systems will also deal with the low-level memory management in a way where a compiler could reason about things at compile time to protect against mistakes in interop between the CPU/GPU/NPU code that can lead to CPU/G(N)PU panics.