Outline
- cache performance
- multi-level caches
- virtual memory
- page table
- translation lookaside buffer
- context switch
- page fault
- segmentation
- cache concepts
cache performance data
- we can measure or compute:
- miss rate for both instruction cache and data cache:
e.g. 5% miss rate --
should be measured on actual programs, perhaps on a simulator
- miss cost, in either time (100ns) or clock cycles (50 cycles) --
can be measured, simulated, or computed
- the CPI for the same machine and same progam,
with a cache large enough to give 100% hit rate -- usually from simulation
- this data can be used to compute the overall CPI
CPI computation for caches
- inputs: CPIzero for infinitely fast cache, miss penalty
MissPenalty, and miss rates IMissRate and DMissRate
- IMissRate of the instructions have to wait an additional MissPenalty
clock cycles, so
- the CPI considering only instruction misses is
CPI = CPIzero + IMissRate x MissPenalty
- example: CPIzero = 2 clocks/instruction,
IMissRate = 3% = 0.03, MissPenalty = 40 clocks, then
CPI = 2 CpI + .03 * 40 CPI = 2 + 1.2 CpI = 3.2 Cycles Per Instruction
- if the DDataInstructions is the fraction of instructions
accessing the data memory, the data miss cost per instruction is
DMissRate * DDataInstructions * MissPenalty
- in the above example, if data instructions are 30% of the instructions
and the data miss rate is 5%, we have
additional CPI = .3 * .05 * 40 CpI = .6 Cycles per Instruction
- combining the two gives a CPI with cache misses of 3.8 Cycles Per
Instruction
- book uses the same formulas, but multiplying by I throughout, where I is
the number of instructions
multi-level caches
- clearly, reducing the cost of a miss would be beneficial
- a second-level cache can (on average) reduce this cost
- the second-level cache:
- can have higher hit time than the first-level cache
- can be much larger than the first-level cache
- can use higher associativity
- can be useful even if it only has a relatively small hit ratio, e.g. 50%
- if
- the first level cache has miss ratio M1, and
- the second level cache has miss ratio M2, and
- the cost of a level-2 miss is C2 cycles, and
- the cost of a level-1 miss (i.e. a level-2 hit) is C1
then the overall cost, in CPI, for a two-level finite cache is
M1 * C1 + M1 * M2 * C2
- Note that for a one-level cache, the cost would be
M1 * C2
assuming that the cost of accessing main memory remains C2
- the number of CPI just computed is simply added to the no-memory-stall CPI
In-class exercise
- three architectures, all (unless otherwise noted)
with a CPI=1 if there are no memory stalls:
- single-level cache, with 40 cycle miss penalty, 2% misses
- two-level cache, with 5 cycle miss penalty to second-level cache
and 40 cycle miss penalty to second-level cache (so that a second-level
cache miss really stalls for 45 cycles). First-level cache miss rate
is 2%, second-level cache miss is 50%
- single-level cache, with 40 cycle miss penalty, 1% misses, but
a no-memory-stall CPI=5
- which machine has the best overall CPI? For simplicity, consider
only instruction fetches, i.e. there is exactly one memory access per
instruction
virtual memory
- when multiple programs are running at the same time, we need:
- protection: prevent a program from accessing other programs' data
- relocation: let the operating system move data or code pages around,
perhaps to temporarily let in a higher-priority program
virtual memory mechanisms
- the addresses used by the programs do not correspond to fixed memory
locations: they are virtual addresses
- both physical and virtual memory are split up into blocks, usually
of size somewhere between 4KBytes and 256MBytes
- when a program is executing, its pages are typically mapped so that
each page of virtual memory is actually stored in (mapped to) a page
in physical memory -- the low-order bits of the virtual and physical
addresses will be the same, the high-order bits have no correlation
- a directly indexed page table, usually stored in main
memory, provides the translation between virtual addresses and physical
addresses
- when a program is not executing, its virtual pages may be stored on
disk instead of being mapped to physical memory
- the memory subsystem can detect when a virtual address is not mapped,
and trigger an exception, called a page fault
- on a page fault, the operating system finds the appropriate data on
disk, locates (or allocates) a free physical page, and maps the virtual
page to the new physical page
- LRU or approximate LRU can be used to free pages
- whenever a virtual address is translated to a physical address,
the permissions are checked, insuring protection
virtual memory and caches
- a virtual memory is like a cache for the disk
- virtual memory works like a write-back cache: dirty pages
are written back to disk only when they are replaced in memory
- the cost of a miss penalty is huge -- thousands to millions of cycles
- can use page protection mechanisms to implement approximate LRU and
dirty page bits if they are not available
page table
- the page table is accessed on every memory access, since all program
addresses are virtual
- simplest implementation is a table with as many entries as there are
possible pages
- a valid bit can be used to record whether the page is mapped in memory
or not
- the simplest implementation might lead to a very large table, so
- a better implementations might only store the pages that the program
has actually ever used -- this can be done in different ways
- note that now every program memory access requires two memory accesses --
one to the page table, one to the actual data
translation lookaside buffer
- assuming that we have locality of accesses, we can have a small
cache of page table translations
- cache entries have a valid bit, an address (tag), and a physical page
address
- if the entry is found in the cache, the memory access can proceed without
consulting the page table
- this cache is called a translation lookaside buffer, or TLB
cache addresses
- with a TLB and a cache, the cache can be either:
- virtually addressed cache: the cache lookup and TLB lookup can
proceed in parallel, but the cache must be flushed whenever a new
address space (a context) is used
- physically addressed cache: the cache lookup follows the TLB lookup,
with potential consequences for performance and for complexity should the
TLB lookup fail
- first- and second-level caches can be different, e.g. virtual first-level
cache and physical second-level cache
- TLB miss rates typically 0.01% to 1%
context switch
- the same virtual address for two different processes refers to
two different bytes of physical memory
- on a context switch, the hardware must know which translation to
use
- simple but expensive solution: flush the TLB and any virtually-addressed
cache on a context switch
- more complex solution: in every virtually-addressed cache (including the
TLB), keep track of a "process ID", which may have fewer bits than
the operating system's global process ID
- most operating systems use physical addressing, so the TLB is not
used in kernel mode
- if the first-level cache uses virtual addressing, it must be
flushed whenever there is a system call or exception
page fault
- if the memory subsystem detects that a virtual address is not mapped,
we have a page fault, and the OS starts to execute the page fault handler
- since disks can be slow, the page fault handler generally tells the
disk to start the transfer, then schedules another process for execution
- once the disk delivers the page, the OS (on a disk interrupt) can
reschedule the page-faulting process for execution
- if there is insufficient memory or if the page size is too large,
the pages that a process uses a lot -- the process's working set --
can get swapped out while the process is waiting on a page fault. This
is called thrashing, since very little useful work gets done, but
the computer (and especially the disk) keeps very busy.
- once the page is mapped to physical memory, the computer has to
restart the instruction that had the page fault
- page faults can be quite complex if an instruction can reference
different pages of memory
segmentation
- the above concepts describe paging
- an alternative way to share the computer is called segmentation
- in segmentation, a small number of registers keeps track of the
virtual-to-physical mapping, and all accesses are relative to the segment base
- segmented systems can also keep track of protections and bounds checking
- relocation is simple but expensive: the OS can move the data in
memory, then change the segment register
- weaker segmented systems, such as the x86's, have no protection
- segmentation has been used to support more physical memory than
available to the program, e.g. 32-bit memory spaces for 16-bit memory
address programs
Cache Concepts
- what is the size of a cache?
- TLB: as few as 8 entries
- first-level cache: a few hundred to a few thousand bytes
- second-level cache: up to about a megabyte
- main memory: up to about a gigabyte
- what is the size of a block?
- memory: one page
- cache: 1 or a few words
bigger blocks give bigger miss penalty and may make writes more complicated,
but if there is spatial locality, increase the likelyhood of a cache hit
- where can a block be placed?
- direct mapped: in one place
- n-way set-associative: in one of n possible locations
- fully-associative: anywhere (e.g. a virtual memory system, some TLBs)
- how is a block found?
- direct mapped: use the low order address bits
- n-way set-associative: use some of the low order bits to index
the cache, then search
- fully-associative: search, or (e.g. in a virtual memory system)
find in a full lookup table
- which block is replaced?
- LRU: the one that was used least recently
- approximate LRU: one of the one that was used least recently
- random: any one -- if it is frequently used, it will come back
into the cache soon
- how is data written?
- write-through: immediately write both the cache and the backing store
(main memory)
- write-back: immediately write to the cache, only write to backing store
when replacing the block
- write-around: only write to the backing store
write with a block size greater than one may require a read miss, unless
we can keep track of individual words (or bytes) written
Cache Model
- compulsory misses: needed to initially fill the cache, reduced by
increased block size
- capacity misses: needed to read words again that would have
been in cache had the cache been large enough
- conflict/collision misses: caused by insufficient associativity
Memory history
- mercury delay lines would present the data every T ms, working
like a rotating drum (but less reliably)
- magnetic core storage: very reliable, nonvolatile, randomly
addressable, but relatively slow and expensive
- semiconductor storage: very reliable, volatile, very fast
- memory density is increasing very quickly, as is processor speed,
so more caching is foreseeable
- pentium pro: second-level cache is on a separate chip in the same
package: low delay, high bandwidth (higher cost?)