Outline

Pentium Pro

simple instructions executed by hardwired control
complex instructions executed under microprogram control
if compilers generate many simple instructions, as compilers are likely to do, the machine acts like a RISC: instructions execute quickly
if a (legacy) program contains many complex instructions, microcode is invoked frequently, and the CPI falls

microcode makes it relatively easy for one machine to emulate another -- for example, for the Pentium Pro to emulate the 80286
the first successful application of microcode was in building a family of computers with the same instruction set architecture, the IBM/360 family
microcode does have limitations -- the datapath must be well-suited to implementing the instruction set.

if I have a washer, a drier, a folder, and a put-away unit
and if I have a large number of loads of laundry to do
I would want each unit to be working full time
so I would load the drier as soon as
- the drier is done with the previous load, and
- the washer is done with this load
each load takes 4 cycles, whether or not we pipeline
but with pipelining, I can do a new load every cycle, so 1000 loads only take 1003 cycles

starting from the single-cycle implementation, each major functional unit can be processing a successive instruction
example: if I have 5 instructions, I₁ I₂ I₃ I₄ I₅, then the functional units can simultaneously be doing:
- instruction fetch for I₅
- instruction decode/register fetch for I₄
- execution for I₃
- memory access (if any) for I₂
- register file write-back (if any) for I₁
See Figure 6.3
We need to add registers between successive functional units, to hold the values between clock cycles Figure 6.12

in the best of cases, we can execute one new instruction every clock cycle.
so with a good pipeline, we can get the CPI close to 1
the clock cycle is comparable to that of the multi-cycle implementation.
so in the best of cases the performance (instructions/second) is better than either the single-cycle or the multi-cycle implementation.
even complicated instructions simply lengthen the pipeline, which in the best of cases does not affect performance

consider the instruction sequence
```
lw $r1, 12($r2)
add $r2, $r1, $r2
```
in order to execute the addition, we need the result of the load instruction to be in the register file by the time the add instruction executes instruction decode/register fetch. This is the 3rd cycle, if the load instruction started executing on the first cycle
but the result of the load instruction is not stored into the register file until the 5th cycle, so the add instruction would use the wrong value
this is a data hazard

data hazard: an instruction requires the result of a previous instruction, which hasn't been computed yet
structural hazard: a functional unit is used for two different stages of a pipeline
example: if we have a single instruction/data memory, the instruction fetch for one instruction cannot be overlapped with the memory stage of a load instruction
control hazard: a decision in the execution of one instruction requires us to know something from the previous instruction
example: the branch instruction has to determine whether the next instruction should be executed or not

solution: stall the pipeline for a sufficient number of cycles until the decision can be made
detail: actually, instruction fetch and decode can be executed for the following instructions (or for the branch target), since no register or memory is affected. If we do stall, simply abort execution of the pipelined instructions (i.e. do not let them proceed to the EXE and further stages), and start executing the correct instructions instead
note that we usually predict a branch taken if it is at the bottom of a loop, i.e. if it is a backward branch
another solution is to say that the instruction following the branch is always executed
this instruction is in the branch delay slot
the compiler can move the instruction that was before the branch into the branch delay slot if the branch does not depend on the results of that instruction

data hazard: an instruction requires the result of a previous instruction, which hasn't been computed yet
```
add $s0, $t0, $t1
sub $t2, $s0, $t3
```
sub needs the value of $s0 in its second cycle, but add only records the value into the register file in its 5th cycle
would have to stall sub for at least two cycles
however, the result is actually available at the end of the 3rd cycle of the add
we could forward the result from the ALU output directly to the ALU input (as well as storing it into the register) -- this means no stall is needed in this case (forwarding is also called bypassing, since the value reaches the ALU directly, bypassing the register file)
for other instructions, stalls might be needed even with forwarding, for example, a lw instruction still would stall a subsequent R instruction since the data is not available until the end of the 4 cycle
compilers sometimes reorder instructions to try to avoid data hazards
computer hardware sometimes reorders instructions to try to avoid data hazards