Today's Outline

Implementation layers for disk and file access

disks serve a number of purposes, including implementing file systems and swap (and memory mapping a file combines the two)
implementers find it convenient to think in terms of abstractions
low-level abstractions, e.g. "read disk block n", are used to implement higher-level abstractions, e.g. "read bytes a through b of file x".
in an OS, the highest-level abstraction implements the system calls, the lowest-level abstraction controls the hardware
the following is the layering for the Linux disk and file access system (note that this picture is not in the book):

The Virtual File System layer (VFS) supports file system operations across a variety of file systems, including
- native file systems such as ext2 and ext3
- foreign file systems such as FAT
- distributed file systems such as NFS, Coda, and Andrew
- in-memory filesystems such ramfs
the VFS is unix-oriented, in that it expects (for example) that each file will be represented by an inode and that a directory will be stored in a file
if a file system does not have inodes (for example, the file metadata is stored in the file itself, rather than in a separate block), the inode object in memory is built by the file system code to reflect the information in the file itself

the superblock object describes a mounted filesystem. There is one per mounted filesystem
the inode object describes a file. There is one per file being accessed, and inodes may be cached while files are not accessed
the dentry object is a cached file name, with a pointer to the corresponding inode
the file object represents an open file. It includes a pointer to a dentry which has a pointer to the inode
each process has a files_struct (open file table) object which represents all files opened by that process (p. 208). This includes an array of file descriptors, the index into which is the fd returned by an open call
each process also has an fs object that records the root and current directories, and a namespace object which allows each process to have a different set of mounted devices

each structure includes a pointer to an _op structure (e.g. i_op for the inodes) which implements functions to be used for that structure
each such function pointer defaults to NULL, in which case the system default is used
if the function pointer is defined (e.g. i_op->rename != NULL in the inode structure), then that function is used
this is a form of object-oriented programming in C: each specific file system implementation can use the default methods or override them
this is implemented entirely by the programmers, not by the language, and so (a) may allow for more flexibility and efficiency, and (b) permits more mistakes
as in many object oriented systems, memory is allocated and deallocated frequently, e.g. whenever a file is opened. This memory comes from the slab layer

implementing a new file system under Linux simply means implementing the required VFS functions
for example:
- write_super should save a copy of the superblock
- truncate should make the file corresponding to the given inode have size zero
- d_revalidate should confirm that the dentry is still valid (the default NULL simply returns that dentries are valid)
- aio_read starts an asynchronous read operation
- writev writes data that is spread across several buffers (i.e. in a vector of buffers)
the VFS code makes certain assumptions about what kind of things the specific functions will do, but is not dependent on how exactly they are implemented
for a disk-based file system, the VFS calls eventually have to be mapped to calls to the block I/O subsystem, which is shared among all parts of the kernel which use the disk

the essential function of block I/O is to transfer data blocks from the disk (disk blocks) to memory (buffers)
specific aspects of data storage on disk are handled outside the block I/O system: for example, finding the correct block (file system), allocating buffers (memory allocation), and actually accessing the disk (device driver)
buffers in memory can be clean, that is, identical to the corresponding block on disk, or dirty, that is, different from the corresponding block on disk
Linux also has locked buffers, whose data is in the process of being transferred to or from disk

reads are probably holding up an application process, so should complete as soon as possible
writes must be completed for correctness and to free up memory, but probably do not affect performance as much as reads
for highest throughput, it is best to minimize seeks:
- minimize the number of times the head is moved by reading the largest possible unit at once: merge adjacent requests so they can be done simultaneously
- minimize the distance the head must seek: sort requests so near requests are satisfied before far requests
maximizing throughput is not sufficient, since it is important that no single request be starved for too long -- this could be rephrased as, "if you wait long enough, every process is a real-time process" (analogously, no process would need special real-time treatment on an infinitely fast machine)
these are conflicting requirements, leading to tradeoffs and approximate solutions