Slightly OT, but for these instructions that do pick range of bits, add, insert into range, does x86 have dedicated silicon in the ALU to implement this process or is it implemented in microcode? If it's the latter then how can it be faster than the equivalent unrolled instructions on a RISC ISA?
This is more of a general question about how microcode can be faster than using separate instructions, which is something I have never quite understood. Any CPU engineers that can enlighten me?