There appears to be dedicated silicon for e.g. ADD AH, BL, see uops.info showing it having 1 uop across multiple microarchitectures (e.g. 1*p0156 being notation that it takes one uop on any port between 0/1/5/6, i.e. theoretical throughput of 4 instrs/cycle; I think the displayed 0.4 is just an artifact of it only testing 3 different destination registers despite there being a dependency on it). The newer Alder Lake actually has less throughput, but still takes only one uop.
There appears to be dedicated silicon for e.g.
ADD AH, BL
, see uops.info showing it having 1 uop across multiple microarchitectures (e.g.1*p0156
being notation that it takes one uop on any port between 0/1/5/6, i.e. theoretical throughput of 4 instrs/cycle; I think the displayed 0.4 is just an artifact of it only testing 3 different destination registers despite there being a dependency on it). The newer Alder Lake actually has less throughput, but still takes only one uop.