Outline

Combining all the instructions: principles

simple implementation: one clock cycle per instruction
hence, each component can be used for at most one phase of the execution
this means we need two separate memories, one for data, one for instructions: to relax this, we would need to allow an instruction to execute over multiple clock cycles
multiplexers can allow us to provide different inputs to the same component, to implement different instructions

see Figure 5.13
same ALU can be used for:
- R instructions and testing for equality (inputs are two registers), as well as for
- load/store instructions, for adding a sign-extended offset to a register
so use a 32-bit wide, 2-1 mux to select the second ALU input
value to be written to a register comes from memory (sw), or the ALU (R-instructions), so use a 32-bit wide, 2-1 mux to select the register file data input
the PC should be loaded from either PC+4 (most instructions), or PC+4+offset (beq if the condition is true) -- offset is sign-extended to 32 bits -- so use a 32-bit wide, 2-1 mux to select the PC register next value
for the jump instruction (j):
- the topmost 4 bits come from PC+4
- the next 26 bits come from the instruction
- the last 2 bits are zero
so use another 32-bit wide, 2-1 mux to select between the output of the previous mux and the above bits

The ALU needs 3 bits to control its operation (it can ADD, SUB, AND, OR, or Set on Less Than)
looking at the machine instruction, these bits must come from:
- the instruction opcode (bits 31-26 of the instruction), for everything but R instructions: specifically, ADD for lw and sw, SUB for beq
- the function field (bits 5-0 of the instruction) for R instructions
assume that our ALU control hardware obtains two bits from elsewhere reflecting the opcode
- for load or store, ALUOp = 00
- for beq, ALUOp = 01
- for R instructions, ALUOp = 10
then the truth table for the ALU control unit is relatively simple (Figure 5.15):

ALUOp1 ALUOp2 F5 F4 F3 F2 F1 F0 op
0 0 X X X X X X 010

X 1 X X X X X X 110

1 X X X 0 0 0 0 010

1 X X X 0 0 1 0 110

1 X X X 0 1 0 0 000

1 X X X 0 1 0 1 001

1 X X X 1 0 1 0 111
implementing this control unit can be done by hand, but can also be done automatically.

ALUOp1	ALUOp2	F5	F4	F3	F2	F1	F0	op
0	0	X	X	X	X	X	X	010
X	1	X	X	X	X	X	X	110
1	X	X	X	0	0	0	0	010
1	X	X	X	0	0	1	0	110
1	X	X	X	0	1	0	0	000
1	X	X	X	0	1	0	1	001
1	X	X	X	1	0	1	0	111

control is a combinational circuit
for the ALU control, inputs are the function field and the two ALUOp bits, output is the 3 ALU control bits
for the main control unit, inputs are the 6 bits of the opcode, outputs are the control lines for the datapath, i.e. the lines for each of the multiplexers, the ALUOp bits, and bits to read or write memory or registers
in the machine word:
- the two registers to read are always in positions rs and rt. This includes the two registers for R instructions and the base register for load and store, as well as the source register for store
- the destination register is in position rt for a load, and position rd for R instructions
- the 16-bit offset is always in the low 16 bits of the word, for beq, lw, and sw
this information leads to a slightly updated datapath design, as in Figure 5.19

Truth table in Figure 5.27
yet another combinational circuit, that can be implemented in a variety of ways, most of them involving a computer automatically designing a circuit

CPI is always 1, and all operations take the same time
cycle time must be sufficient for the slowest operation to complete
drawbacks:
- should never add hardware to support a slow operation, such as floating point multiply
- can't do anything to make frequent operations faster, have to "waste" resources on making slowest operation faster
- functional units can be used at most once per clock cycle, hence one ALU and two adders and two separate memories

assume:
- memory takes 2ns to read or write
- the ALU and adders take 2ns to compute their results
- the register file takes 1ns to read or write
the load word instruction uses the following functional units:
1. instruction fetch: 1 memory access, 2ns
2. register access: 1 register file access, 1ns
3. adding base to offset: 1 ALU operation, 2ns
4. data memory access: 2ns
5. storing the result: 1 register file access, 1ns
so the minimum clock cycle would be 8ns
but most other operations use fewer functional units
with a variable clock cycle, or (more practically), with different CPIs for different instructions, we can let the other instructions perform faster.

ALUOp1	ALUOp2	F5	F4	F3	F2	F1	F0	op
0	0	X	X	X	X	X	X	010
X	1	X	X	X	X	X	X	110
1	X	X	X	0	0	0	0	010
1	X	X	X	0	0	1	0	110
1	X	X	X	0	1	0	0	000
1	X	X	X	0	1	0	1	001
1	X	X	X	1	0	1	0	111

ALUOp1	ALUOp2	F5	F4	F3	F2	F1	F0	op
0	0	X	X	X	X	X	X	010
X	1	X	X	X	X	X	X	110
1	X	X	X	0	0	0	0	010
1	X	X	X	0	0	1	0	110
1	X	X	X	0	1	0	0	000
1	X	X	X	0	1	0	1	001
1	X	X	X	1	0	1	0	111

ALUOp1	ALUOp2	F5	F4	F3	F2	F1	F0	op
0	0	X	X	X	X	X	X	010
X	1	X	X	X	X	X	X	110
1	X	X	X	0	0	0	0	010
1	X	X	X	0	0	1	0	110
1	X	X	X	0	1	0	0	000
1	X	X	X	0	1	0	1	001
1	X	X	X	1	0	1	0	111