Outline
- memory hierarchies
- temporal and spatial locality
- basic cache behavior
- direct-mapped and set-associative caches
- blocks: spatial locality
- cache misses on read, write
Memory Hierarchies
- wanted: lots of fast memory
- available: tiny amounts of really fast memory (registers),
relatively little fast memory (SRAM), lots of slower
memory (DRAM), extremely large amounts of very slow memory (disk)
- DRAM: 120ns for random access, 60-25ns access to data in the
currently accessed page
- SDRAM: 8-10ns for sequential access
- SRAM: 5+ns for sequential access
Locality
- if a program executes an instruction at address X, in the future
we will probably need instructions at address X+4, X+8, etc: spatial
locality
- also the instruction at location X is probably going to be needed
again in the future: temporal locality
- if we access a data structure, we are probably:
- going to access other fields in the structure: spatial locality
- going to access it more than once (e.g. read-modify-write):
temporal locality
- without locality, we would be stuck with big, slow memories
- with locality, we can try and keep copies of items that are
likely to be needed in our small, fast memories: caches
Cache Basics
- automatically (in hardware) keep copies of recently-requested words
of memory
- the processor requests the data from the cache:
- the cache can serve the data if it has it: a cache hit
- the cache can request the data from memory if it does not has it:
a cache miss
- to exploit locality, after a cache miss the cache will store the
data received from memory for a while
- when new data is stored in the cache, sometimes we have to decide
what old data to remove: this is done according to a
cache replacement algorithm, e.g. LRU (Least Recently Used)
- every cache must have a valid bit to record whether the contents
and address are useful
Direct-Mapped Caches
- the simplest way to implement a cache is to use a fast memory:
- each cache location stores one word of data, and the corresponding
memory address
- the low-order bits of the address are used to index the cache
- a memory word can only be in one location in the cache: this is
a direct-mapped cache
- finding a word is easy: cache address = memory address modulo cache size
(if it is a byte-addressable memory, divide the memory address by 4 by
ignoring the low-order bits)
- replacement algorithm is straightforward
Set-Associative Caches
- the replacement algorithm for direct-mapped caches may cause
frequently-used words to be replaced because of a collision
- instead, could divide the cache into sets, with each set holding
up to n words (n is usually a power of 2)
- when a word is inserted into the cache, it can replace any other
word in the set
- for example in a 2-way set-associative cache, there are two
possible locations for each memory word
- in a 4-way set-associative cache, there are four
possible locations for each memory word
- in a fully set-associative cache with n words, there are n
possible locations for each memory word
Implementing Set-Associative Caches
- an i-way set-associative cache with n words total
- divide the cache memory into n/i blocks, each with i words
- read all the blocks in parallel
- compare all the addresses to the desired address
- select the word (if any) whose address matches
- if no address matches, it is a cache miss
- see
Figure 7.19
Cache Blocks
- with the caches described so far, we take advantage of temporal locality:
if a word is reused quickly, it will likely be in the cache
- to take advantage of spatial locality, we can load multiple words
at once:
- in the instruction cache, we are likely to need the next few sequential
instructions
- page mode access to RAMs is faster than random access
- even in the data cache, this might be useful
- a block only needs one address and one valid bit
Handling a cache read miss
- try to fetch the word from the cache
- if a miss is detected (no address match), send the address
to memory and stall the datapath pipeline
- note the address might no longer be available in the datapath,
e.g. PC+4 might have already been computed
- with multiple execution units, stalling a memory execution unit
might not slow down the other units
- stalling the instruction fetch is likely to stall the entire pipeline
Cache Writes
- with a word-wide cache, can simply write the word to the cache,
then have a FIFO which also writes it to memory: write-through cache
- a deep enough FIFO will handle the bursts, the memory bandwidth
must be enough to handle the average rate
- a block cache has to check for a miss, and may have to fetch a
block for a word that is being written
- an alternative to write-through is to simply record in a bit
(a "dirty" bit) that the word is not in memory, and write the word
back before replacing the block: write-back cache