Outline
- Performance scaling
- Amdahl's law
- Benchmark evolution
Performance scaling
- what happens if you double processor speed?
- the answer depends on the benchmarks and on the rest of the system
- given the benchmarks, the result depends on the rest of the system
- if the system is CPU bound, the benchmark performance should
double
- if the system is memory bound, the benchmark performance might
not change at all
- most systems are somewhere in-between:
- reasonable designers put effort/money where it will do some good
- if increasing the CPU speed does no good, no point in spending the
extra money
- while the performance scaling in many case is close to linear
(even if the factor is less than 1), this is not necessarily the case
- a memory bound system can be improved by adding well-designed
cache memories
- memory hierarchy: the main memory is a cache for the disk, the
cache caches the memory, the registers cache the cache
Amdahl's Law
- the law:
- if component A takes x% of the time
- no matter how much we improve component A,
- we cannot save more than x% of the time
- sounds straightforward, but easy to forget or overlook
- this means: spend time and effort improving components that
take large percentages of the time (also applies to programming)
- often results in different components later taking a larger
percentage of the time, making them better candidates for optimization
- speedup: ratio of performance after to performance before improvement
- speedup: ratio of time before to time after improvement
- if it takes 10 seconds before improvement, 9 seconds after improvement,
the speedup is 11%
No hardware-independent Metrics
- it would be nice to have a way to predict performance from
general considerations
- for example, larger code sizes for the same program often
imply slower execution
- however, this is not always true
- in modern processors (e.g. RISC), larger code sizes can achieve higher
processing speeds
- measured (or computed) execution time really is the best measure
Not useful: MIPS
- Million Instructions Per Second
- different instructions do different things!
- different instructions have different CPI, so different
programs on the same machine have different MIPS
- a program which executes more instructions in more time might
have more MIPS!
- also referred to as MOPS
- sometimes peak MIPS are quoted, which is even less helpful
- relative MIPS: given a time t1 on a reference n-MIP machine and t2
on the target machine,
relative MIPS = t1 / t2 * n
marginally useful: MFLOPS
- some user programs (typically used on supercomputers) have a need
to perform a certain number of floating point operations
- hence, machines that perform more FLoating point Operations Per Second
are more desirable
- problem: some computers might take more or less time for different
floating point operations, so the mix of operations may affect the number
of MFLOPS
- problem: if an operation (e.g. square root) can be implemented either in
hardware or in software, the number of floating point operations will
not be the same on different machines for a given program
- problem: says nothing about the performance of programs that
are not floating-point intensive
- problem: peak MFLOPS...
Benchmark evolution
- Vax-11/780 was considered a 1-MIP machine
- running programs on our machine and the Vax-11/780 gives us a number
of MIPS, but:
- the Vax 11/780 was reclassified as a 1/2 MIP machine
- do you have access to a Vax? The baseline machine should be comparable
- as compilers evolve, do you evolve the
baseline compiler? (in-class discussion)
continuously evolving numbers, hard to have a good historical perspective
- SPEC uses a mix of programs: SPEC (1988) SPEC 92, SPEC 95, SPEC 2000
- SPEC synthetic benchmarks which include I/O:
- SPEC system development multitasking: compiling editing,
execution, system commands
- SPEC system level file server: file server performance
- manufacturers need to look good, consumers need accurate information
Other benchmarks
- synthetic benchmarks:
- whetstone: scientific and engineering programs, first in Algol, now
in Fortran
- dhrystone: systems programs, first in Ada, now in C
- program kernels:
- livermore loops: 21 small loop fragments,
representative of supercomputer loads
- linpack: linear algebra benchmark
- small programs:
- sieve of erastothenes
- quicksort
do not really reflect real workloads
- major drawback: too easy to optimize, and these optimizations do not
benefit other programs