Re: IBM ROMP vs. ARM
MUL & MLA were indeed slow when both sides of the multiplication were variable, but lots of multiplies have a constant one one side, often sparse in bits (e.g. 2^N - 8, 16, 256 - or 2^N+2^M - 10) and the great trick (of ARM assembler hackers like me, and the - at the time - brilliant Norcroft compiler) was to unfold the multiply into shift-adds (one per bit) using the barrel shifter, one cycle each.
One my most treasured possessions is an original ARM-1 dot-matrix instruction set description with CONFIDENTIAL scrawled over it in red ink...