We need to add registers between successive functional
units, to hold the values between clock cycles
Figure 6.12
Pipelining Performance
in the best of cases, we can execute one new instruction
every clock cycle.
so with a good pipeline, we can get the CPI close to 1
the clock cycle is comparable to that of the multi-cycle
implementation.
so in the best of cases the performance (instructions/second)
is better than either the single-cycle or the multi-cycle implementation.
even complicated instructions simply lengthen the pipeline, which
in the best of cases does not affect performance
Hazards
consider the instruction sequence
lw $r1, 12($r2)
add $r2, $r1, $r2
in order to execute the addition, we need the result of the load
instruction to be in the register file by the time the add instruction
executes instruction decode/register fetch. This is the 3rd cycle, if
the load instruction started executing on the first cycle
but the result of the load instruction is not stored into the
register file until the 5th cycle, so the add instruction would use
the wrong value
this is a data hazard
Hazard classification
data hazard: an instruction requires the result of a previous
instruction, which hasn't been computed yet
structural hazard: a functional unit is used for two different
stages of a pipeline
example: if we have a single instruction/data memory, the instruction
fetch for one instruction cannot be overlapped with the memory stage
of a load instruction
control hazard: a decision in the execution of one instruction
requires us to know something from the previous instruction
example: the branch instruction has to determine whether
the next instruction should be executed or not
Control Hazards
solution: stall the pipeline for a sufficient number of
cycles until the decision can be made
detail: actually, instruction fetch and decode can be executed
for the following instructions (or for the branch target), since no
register or memory is affected. If we do stall, simply abort
execution of the pipelined instructions (i.e. do not let them proceed
to the EXE and further stages), and start executing the correct instructions
instead
note that we usually predict a branch taken if it is at the bottom
of a loop, i.e. if it is a backward branch
another solution is to say that the instruction following the
branch is always executed
this instruction is in the branch delay slot
the compiler can move the instruction that was before the branch
into the branch delay slot if the branch does not depend on the results
of that instruction
Data Hazards
data hazard: an instruction requires the result of a previous
instruction, which hasn't been computed yet
add $s0, $t0, $t1
sub $t2, $s0, $t3
sub needs the value of $s0 in its second cycle, but
add
only records the value into the register file in its 5th cycle
would have to stall sub for at least two cycles
however, the result is actually available at the end of the
3rd cycle of the
add
we could forward the result from the ALU output directly
to the ALU input (as well as storing it into the register) -- this
means no stall is needed in this case (forwarding is also called
bypassing, since the value reaches the ALU directly, bypassing
the register file)
for other instructions, stalls might be needed even with
forwarding, for example, a lw instruction still would stall
a subsequent R instruction since the data is not available until the
end of the 4 cycle
compilers sometimes reorder instructions to try to avoid data hazards
computer hardware sometimes reorders instructions to try to avoid
data hazards