It is quite surprising that ARM1 has a ripple carry. Even more surprising, the 32 bit barrel shifter executes in the same cycle, slowing down one of the inputs to the ALU. And as far as I can tell (BUT NO!), ARM2 and then ARM3 haven't put in any faster an adder, even though the target clock speed goes up and in the case of ARM3 there's an on-chip cache. (But I may be missing something, and I believe at some point the barrel shifts cost an extra cycle on execution.)
Edit:
this presentation says ARM2 has a 4-bit carry lookahead structure.
Edit: and also says ARM1 use of a complex gate means only one gate delay per bit.
It's certainly about balance though: if the CPU and external RAM speeds are linked, and the RAM speed is limiting, the CPU need not be faster and probably shouldn't be, otherwise someone spent some effort and maybe some area or power needlessly.
Some interesting reading if you can find these papers:
Furber, S., & Thomas, A. (1990). ARM3 — a study in design for compatibility. Microprocessors and Microsystems, 14(6), 407–415. doi:10.1016/0141-9331(90)90113-a
Furber, S. . (1988). The advantages of RISC architectures. Computer Standards & Interfaces, 8(1), 29–35. doi:10.1016/0920-5489(88)90073-6