Slides
Overview
- file system implementation
File System Implementation
- keeping track of which blocks of data belong to which file:
- contiguous allocation is simple and makes sequential or random
access fast, but also requires knowing the file size when allocating
and may give fragmentation such that the space is available but not
usable (until defragmentation)
- linked list allocation makes sequential access fast, but
random access is very slow and blocks end up containing a number n
of bytes that is not a power of two.
- linked list allocation in memory keeps a global table with one pointer
per block (table is in memory while running, frequently backed up on disk):
sequential and random access is fast, block sizes are a power of two, but
the memory table may be large (4GB disk with blocks of size 16KB
and 4-byte block pointers requires 1MB of RAM)
- i-node (index node) allocation keeps a per-file table of blocks,
sequentially ordered, and stored on disk until the file is opened. The
inode may also store the file attributes, and some of the entries may point
to indirect blocks which contain more pointers. Fast for sequential
and random access, does not use a lot of memory or disk, does not cause
fragmentation
- the last two algorithms could be combined, with an inode containing
just a pointer to the part of the in-memory table where the file begins
Unix Inodes
- keeping track of which blocks of data belong to which file:
- some versions of Unix used i-nodes with 13 block addresses per inode
- this makes the inode constant size and such that it fits in a block
- if more than 10 data blocks are needed, the 11th pointer is a single
indirect block, i.e. points to a block containing addresses of data blocks
- if this is not sufficient, the 12th pointer is a double indirect
block, containing the address of a block which contains addresses of
blocks of addresses of data blocks
- the last address is a triple-indirect block
- if each indirect block holds the addresses of up to 64 other blocks
of 512 bytes each, what is the maximum file size (in bytes) in this system?
Pseudo File Systems
- a device file is really a pseudo file -- it does not correspond
to space on disk
- a file system can implement arbitrary access to internal data
structures -- simplest example is the RAM disk
- a file system can also be used to support access to kernel-internal
data structures, e.g. in linux the /proc and sysfs file systems,
and in many unix-like systems /dev/mem and /dev/kmem
- for example, to access the USB on a specific Linux system I have
to mount /proc/bus/usb
Directory Implementations
- fixed-size directory entries are simple, but either
- are too limited in the length of the file name that is supported, or
- waste too much space
- MS-Dos uses fixed-size (32-byte) directory entries with 8 bytes
for the file name and 3 for the extension, 10 unused bytes, and space
for size, time, and date
- Unix stores the inode and the filename in variable-sized entries:
- management of directory entry deletion is more complicated, unless
- space is defragmented after deletion, which makes deletion from
big directories very slow
- Unix directories are stored in files, i.e. inodes point to the blocks
on disk holding the directory entries
Disk Space Management
- similar to (virtual) memory management, e.g block size selection,
keeping track of free blocks
- must select a block size
- smaller block sizes waste less space:
- small files leave most of a block unused
- even large files typically leave unused about 1/2 of the last block
- larger blocks are faster (for accessing large files or large
directories), since a single seek results in reading or writing more data
- median file size reported as 1K (1984) or 12K-15K (1997), so a disk
block should not be dramatically larger than this
- Linux tries to allocate blocks nearly sequentially (i.e. as close as
possible) to try to improve time to access a large file without taking
more space than needed
- the Berkeley Fast File System uses small blocks for small files
and large blocks for large files, which is very efficient but more complicated
- free blocks are kept track of as either linked lists or bitmaps
File System Reliability
- any given block may be (or may become) bad, that is, unable to retain
data
- hard disk controllers often
- are configured with lists of bad blocks
- may check for bad blocks (corrupted CRCs) during normal accesses
- have spare tracks, or spare blocks in each track/cylinder
used when the CPU tries to access one of the bad blocks
- seek time for a spare track is usually much larger, for a spare block
is usually quite close
- most commonly, controllers only know about bad blocks that
are present when the disk leaves the factory
- file system could also build a special file containing all the bad
blocks, and could grow this file as more bad blocks are found
Backups
- backups are essential for any data that is hard to replace:
- to external medium such as tape, CD, or a remote file system (e.g. on
another computer)
- to another disk within the same system, e.g. by mirroring
- with redundant storage such as RAID, where the same data is automatically
written to multiple disks, or the data is written to i disks
and check bits or
check words are written to a number of other disks
- backups may be incremental, only saving data since the last
backup -- this makes backups much fasters and more comprehensive
(several snapshots are available going back through time), but
restores are harder and slower (when did you create this file?)
- it is a good idea to protect users against user errors, e.g.
copy files to a "wastebasket" directory rather than delete them
(VMS automatically created backups of every modified file, e.g.
with backup numbers ";1", ";2", etc, which could be cleared
with a fairly safe purge command)
Bad Block Checking
- the controller usually has indications of errors during reading
or writing. The most common response is to try again up to a maximum
number of times
- a read-only test for a bad block simply reads each block
multiple times, and reports the block as bad if the CRC did not verify
or if not all reads returned the same result
- a read-write test for a bad block saves the content of the
block, then writes new patterns on it and tests to make sure they can
be read again
- all this requires at least some cooperation from the controller to
prevent reading/writing cached versions of the block (which are normally
correct), or requires reading/writing more blocks than the controller
can cache
Consistency Checking
- usually, when the OS shuts down (or unmounts a file system) it
writes a block to disk that says the file system was shut down correctly
- if this block does not have the correct value when mounting the
file system, the OS (often automatically) starts a file system consistency
checking program (fsck on Unix variants)
- block consistency: each block should be in at most one file exactly
once, and otherwise should be listed as a free block
- if the block is not in any file, it is added to the free list if necessary
- if the block is multiple times in the same file, or in multiple files,
it can be copied to a free block (as often as necessary), though this
normally means the file(s) is/are corrupted
- file consistency: each inode should be in as many directories as
its reference count specifies
- a deleted file may not have its reference count set to zero, but
may not be in any directory entry, and should be deleted
- a file accessible from multiple directories may have a reference
count too low, in which case the reference count is simply set correctly
- further checks can verify that inode numbers are valid, that
permissions are reasonable, that directory structures are reasonable, etc
Log-Structured (journaling) File Systems
- with a large disk cache, files can be read-ahead, so latency for
file reads can be small
- it is hard to optimize writes, since caching a write for too long
increases the likelyhood of inconsistency
- so instead, group all the writes together and write them (as a log
segment) when a sufficient number of them has accumulated
- the log segment may contain i-nodes, data blocks, and directory
blocks
- finding i-nodes requires searching the entire log to build a map
(this is done once), then using the map in memory to locate the inode
- a cleaner daemon (thread, process) searches the log (from the head)
to find out which inodes are still in use, and writes these back to
the head of the log
- as the cleaner proceeds along the log, less data will be
written back for each segment read (assuming files are deleted or blocks
overwritten), and the log gets smaller
- any log segments the cleaner has processed become free space and can
be reused
- makes performance very good when there are many small writes, and
makes consistency checking a lot faster, since only the last segment could
be corrupted
Performance: caching
- disk access is much slower than memory access
- therefore (using the working set principle) keep the most
recently used disk blocks in a cache
- LRU can be used with disk blocks, since the overhead of recording
each read/write access is small
- strict LRU leads to long delay writing back inodes (which affects
consistency), and to keeping some blocks which are only read very rarely
- some categories of blocks could be placed in front of the LRU queue,
so they will be evicted soon, e.g. inode blocks
- blocks that are needed for file system consistency must be
written back quickly, or perhaps immediately
- all blocks should be written back reasonably quickly
- the sync system call writes all dirty blocks back to disk,
and a daemon can call sync every few seconds
- alternately, every block might be written back immediately --
a write-through cache, very inefficient if blocks are
written one byte at a time, but otherwise much better
Performance: disk accesses
- new blocks should be allocated close to blocks that are logically
adjacent -- easy to do with a bit-map or an in-memory free list or
sequential allocation, but hard to do with a free list on disk
- access when using inodes requires at least one access to the
inode, followed by one access to the data, probably with a long seek
- inodes can be placed in the middle of the disk (halving the seek
time), or cylinder groups can contain inodes and (usually)
the associated data, making many seeks small

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 2.5 License.