It might be more precise to say we know why in general, but not specifically.
Deep-learning architectures are a stack of neural nets, and typically most of the discrimination is done by convolutional layers. Convolutions are good for detecting signals probabilistically: they try to maximize the area of the intersection between the incoming signal and a reference signal.
If you stack a lot of convolutions and you don't provide the equivalent of a null hypothesis in your training – a bucket that the model can dump low-probability inputs into – then even a weak signal will be "recognized" as something from the training.
Basically, the input runs through the stack and has to end up in some category. DALL-E I think (can't be bothered to look it up) is a transformer architecture, so it's all based on attention, which gives it a certain amount of "memory" or context. So the context in which the nonsense words appear will affect what categories get chosen.
In short, what's happening here isn't surprising, and in fact is precisely what you'd expect from a large transformer DL model. And it is a fine example of an inexplicable model, as you suggested. This is a big problem for the deployment of DL architectures, as many researchers have discussed at length.
Note that there are other approaches to ML and "AI" (insofar as "AI" means anything) that are explicable or at least more transparent. Deep-stack ANN architectures like DL are particularly resistant to explication and other types of analysis.