Outline

Memory Hierarchies

wanted: lots of fast memory
available: tiny amounts of really fast memory (registers), relatively little fast memory (SRAM), lots of slower memory (DRAM), extremely large amounts of very slow memory (disk)
DRAM: 120ns for random access, 60-25ns access to data in the currently accessed page
SDRAM: 8-10ns for sequential access
SRAM: 5+ns for sequential access

if a program executes an instruction at address X, in the future we will probably need instructions at address X+4, X+8, etc: spatial locality
also the instruction at location X is probably going to be needed again in the future: temporal locality
if we access a data structure, we are probably:
- going to access other fields in the structure: spatial locality
- going to access it more than once (e.g. read-modify-write): temporal locality
without locality, we would be stuck with big, slow memories
with locality, we can try and keep copies of items that are likely to be needed in our small, fast memories: caches

automatically (in hardware) keep copies of recently-requested words of memory
the processor requests the data from the cache:
- the cache can serve the data if it has it: a cache hit
- the cache can request the data from memory if it does not has it: a cache miss
to exploit locality, after a cache miss the cache will store the data received from memory for a while
when new data is stored in the cache, sometimes we have to decide what old data to remove: this is done according to a cache replacement algorithm, e.g. LRU (Least Recently Used)
every cache must have a valid bit to record whether the contents and address are useful

the simplest way to implement a cache is to use a fast memory:
each cache location stores one word of data, and the corresponding memory address
the low-order bits of the address are used to index the cache
a memory word can only be in one location in the cache: this is a direct-mapped cache
finding a word is easy: cache address = memory address modulo cache size (if it is a byte-addressable memory, divide the memory address by 4 by ignoring the low-order bits)
replacement algorithm is straightforward

the replacement algorithm for direct-mapped caches may cause frequently-used words to be replaced because of a collision
instead, could divide the cache into sets, with each set holding up to n words (n is usually a power of 2)
when a word is inserted into the cache, it can replace any other word in the set
for example in a 2-way set-associative cache, there are two possible locations for each memory word
in a 4-way set-associative cache, there are four possible locations for each memory word
in a fully set-associative cache with n words, there are n possible locations for each memory word

with the caches described so far, we take advantage of temporal locality: if a word is reused quickly, it will likely be in the cache
to take advantage of spatial locality, we can load multiple words at once:
- in the instruction cache, we are likely to need the next few sequential instructions
- page mode access to RAMs is faster than random access
- even in the data cache, this might be useful
a block only needs one address and one valid bit

try to fetch the word from the cache
if a miss is detected (no address match), send the address to memory and stall the datapath pipeline
note the address might no longer be available in the datapath, e.g. PC+4 might have already been computed
with multiple execution units, stalling a memory execution unit might not slow down the other units
stalling the instruction fetch is likely to stall the entire pipeline

with a word-wide cache, can simply write the word to the cache, then have a FIFO which also writes it to memory: write-through cache
a deep enough FIFO will handle the bursts, the memory bandwidth must be enough to handle the average rate
a block cache has to check for a miss, and may have to fetch a block for a word that is being written
an alternative to write-through is to simply record in a bit (a "dirty" bit) that the word is not in memory, and write the word back before replacing the block: write-back cache