Outline

cache performance
multi-level caches
virtual memory
page table
translation lookaside buffer
context switch
page fault
segmentation
cache concepts

cache performance data

we can measure or compute:
- miss rate for both instruction cache and data cache: e.g. 5% miss rate -- should be measured on actual programs, perhaps on a simulator
- miss cost, in either time (100ns) or clock cycles (50 cycles) -- can be measured, simulated, or computed
- the CPI for the same machine and same progam, with a cache large enough to give 100% hit rate -- usually from simulation
this data can be used to compute the overall CPI

CPI computation for caches

inputs: CPI_zero for infinitely fast cache, miss penalty MissPenalty, and miss rates IMissRate and DMissRate
IMissRate of the instructions have to wait an additional MissPenalty clock cycles, so
the CPI considering only instruction misses is
CPI = CPI_zero + IMissRate x MissPenalty
example: CPI_zero = 2 clocks/instruction, IMissRate = 3% = 0.03, MissPenalty = 40 clocks, then
CPI = 2 CpI + .03 * 40 CPI = 2 + 1.2 CpI = 3.2 Cycles Per Instruction
if the DDataInstructions is the fraction of instructions accessing the data memory, the data miss cost per instruction is
DMissRate * DDataInstructions * MissPenalty
in the above example, if data instructions are 30% of the instructions and the data miss rate is 5%, we have
additional CPI = .3 * .05 * 40 CpI = .6 Cycles per Instruction
combining the two gives a CPI with cache misses of 3.8 Cycles Per Instruction
book uses the same formulas, but multiplying by I throughout, where I is the number of instructions

multi-level caches

clearly, reducing the cost of a miss would be beneficial
a second-level cache can (on average) reduce this cost
the second-level cache:
- can have higher hit time than the first-level cache
- can be much larger than the first-level cache
- can use higher associativity
- can be useful even if it only has a relatively small hit ratio, e.g. 50%
if
- the first level cache has miss ratio M₁, and
- the second level cache has miss ratio M₂, and
- the cost of a level-2 miss is C₂ cycles, and
- the cost of a level-1 miss (i.e. a level-2 hit) is C₁
then the overall cost, in CPI, for a two-level finite cache is M₁ * C₁ + M₁ * M₂ * C₂
Note that for a one-level cache, the cost would be M₁ * C₂ assuming that the cost of accessing main memory remains C₂
the number of CPI just computed is simply added to the no-memory-stall CPI

In-class exercise

three architectures, all (unless otherwise noted) with a CPI=1 if there are no memory stalls:
- single-level cache, with 40 cycle miss penalty, 2% misses
- two-level cache, with 5 cycle miss penalty to second-level cache and 40 cycle miss penalty to second-level cache (so that a second-level cache miss really stalls for 45 cycles). First-level cache miss rate is 2%, second-level cache miss is 50%
- single-level cache, with 40 cycle miss penalty, 1% misses, but a no-memory-stall CPI=5
which machine has the best overall CPI? For simplicity, consider only instruction fetches, i.e. there is exactly one memory access per instruction

virtual memory

when multiple programs are running at the same time, we need:
- protection: prevent a program from accessing other programs' data
- relocation: let the operating system move data or code pages around, perhaps to temporarily let in a higher-priority program

virtual memory mechanisms

the addresses used by the programs do not correspond to fixed memory locations: they are virtual addresses
both physical and virtual memory are split up into blocks, usually of size somewhere between 4KBytes and 256MBytes
when a program is executing, its pages are typically mapped so that each page of virtual memory is actually stored in (mapped to) a page in physical memory -- the low-order bits of the virtual and physical addresses will be the same, the high-order bits have no correlation
a directly indexed page table, usually stored in main memory, provides the translation between virtual addresses and physical addresses
when a program is not executing, its virtual pages may be stored on disk instead of being mapped to physical memory
the memory subsystem can detect when a virtual address is not mapped, and trigger an exception, called a page fault
on a page fault, the operating system finds the appropriate data on disk, locates (or allocates) a free physical page, and maps the virtual page to the new physical page
LRU or approximate LRU can be used to free pages
whenever a virtual address is translated to a physical address, the permissions are checked, insuring protection

virtual memory and caches

a virtual memory is like a cache for the disk
virtual memory works like a write-back cache: dirty pages are written back to disk only when they are replaced in memory
the cost of a miss penalty is huge -- thousands to millions of cycles
can use page protection mechanisms to implement approximate LRU and dirty page bits if they are not available

page table

the page table is accessed on every memory access, since all program addresses are virtual
simplest implementation is a table with as many entries as there are possible pages
a valid bit can be used to record whether the page is mapped in memory or not
the simplest implementation might lead to a very large table, so
a better implementations might only store the pages that the program has actually ever used -- this can be done in different ways
note that now every program memory access requires two memory accesses -- one to the page table, one to the actual data

translation lookaside buffer

assuming that we have locality of accesses, we can have a small cache of page table translations
cache entries have a valid bit, an address (tag), and a physical page address
if the entry is found in the cache, the memory access can proceed without consulting the page table
this cache is called a translation lookaside buffer, or TLB

cache addresses

with a TLB and a cache, the cache can be either:
- virtually addressed cache: the cache lookup and TLB lookup can proceed in parallel, but the cache must be flushed whenever a new address space (a context) is used
- physically addressed cache: the cache lookup follows the TLB lookup, with potential consequences for performance and for complexity should the TLB lookup fail
first- and second-level caches can be different, e.g. virtual first-level cache and physical second-level cache
TLB miss rates typically 0.01% to 1%

context switch

the same virtual address for two different processes refers to two different bytes of physical memory
on a context switch, the hardware must know which translation to use
simple but expensive solution: flush the TLB and any virtually-addressed cache on a context switch
more complex solution: in every virtually-addressed cache (including the TLB), keep track of a "process ID", which may have fewer bits than the operating system's global process ID
most operating systems use physical addressing, so the TLB is not used in kernel mode
if the first-level cache uses virtual addressing, it must be flushed whenever there is a system call or exception

page fault

if the memory subsystem detects that a virtual address is not mapped, we have a page fault, and the OS starts to execute the page fault handler
since disks can be slow, the page fault handler generally tells the disk to start the transfer, then schedules another process for execution
once the disk delivers the page, the OS (on a disk interrupt) can reschedule the page-faulting process for execution
if there is insufficient memory or if the page size is too large, the pages that a process uses a lot -- the process's working set -- can get swapped out while the process is waiting on a page fault. This is called thrashing, since very little useful work gets done, but the computer (and especially the disk) keeps very busy.
once the page is mapped to physical memory, the computer has to restart the instruction that had the page fault
page faults can be quite complex if an instruction can reference different pages of memory

segmentation

the above concepts describe paging
an alternative way to share the computer is called segmentation
in segmentation, a small number of registers keeps track of the virtual-to-physical mapping, and all accesses are relative to the segment base
segmented systems can also keep track of protections and bounds checking
relocation is simple but expensive: the OS can move the data in memory, then change the segment register
weaker segmented systems, such as the x86's, have no protection
segmentation has been used to support more physical memory than available to the program, e.g. 32-bit memory spaces for 16-bit memory address programs

Cache Concepts

what is the size of a cache?
- TLB: as few as 8 entries
- first-level cache: a few hundred to a few thousand bytes
- second-level cache: up to about a megabyte
- main memory: up to about a gigabyte
what is the size of a block?
- memory: one page
- cache: 1 or a few words
bigger blocks give bigger miss penalty and may make writes more complicated, but if there is spatial locality, increase the likelyhood of a cache hit
where can a block be placed?
- direct mapped: in one place
- n-way set-associative: in one of n possible locations
- fully-associative: anywhere (e.g. a virtual memory system, some TLBs)
how is a block found?
- direct mapped: use the low order address bits
- n-way set-associative: use some of the low order bits to index the cache, then search
- fully-associative: search, or (e.g. in a virtual memory system) find in a full lookup table
which block is replaced?
- LRU: the one that was used least recently
- approximate LRU: one of the one that was used least recently
- random: any one -- if it is frequently used, it will come back into the cache soon
how is data written?
- write-through: immediately write both the cache and the backing store (main memory)
- write-back: immediately write to the cache, only write to backing store when replacing the block
- write-around: only write to the backing store
write with a block size greater than one may require a read miss, unless we can keep track of individual words (or bytes) written

Cache Model

compulsory misses: needed to initially fill the cache, reduced by increased block size
capacity misses: needed to read words again that would have been in cache had the cache been large enough
conflict/collision misses: caused by insufficient associativity

Memory history

mercury delay lines would present the data every T ms, working like a rotating drum (but less reliably)
magnetic core storage: very reliable, nonvolatile, randomly addressable, but relatively slow and expensive
semiconductor storage: very reliable, volatile, very fast
memory density is increasing very quickly, as is processor speed, so more caching is foreseeable
pentium pro: second-level cache is on a separate chip in the same package: low delay, high bandwidth (higher cost?)