One of the first challenges is what to call this filesystem. It will be S51K-like in many ways (inodes, directories are files like everything else, etc.) but not all ways. So I can't call it S51K. Also, I don't want to have "816" in the name, since a filesystem structure is (should be) portable to any system capable of supporting it. Hmm... Anyone got a good name for a filesystem that works on a 65C816-powered computer but could be run on something else?
Anyhow, I started out by listing the salient features of this filesystem, aside from the basic one of being able to keep data organized:
- Maximum filesystem size of 64 GB.
- Maximum file size of 4 GB.
- Maximum of 65,535 files per filesystem.
- Support for 30 character filenames.
- Support for both hard and symbolic links.
The S51K filesystem supports 24 bit block addresses for files, which meant a file can theoretically be as large as 16.7 GB. In practice, that size is unattainable because the bytes-in-file field in the inode is a 32 bit field, limiting the real file size to 4 GB. Hence my file size limit.
The S51K filesystem gets its name from the fact that it was first introduced in the Bell Labs UNIX System V operating system and that it uses a 1 KB logical block structure. On the DEC PDP-11 hardware (where much UNIX development took place), a physical disk block is 512 bytes in size. However, on the DEC VAX hardware, a disk block is 1024 bytes in size. In order to make the filesystem portable between different machines, the decision was made to standardize on a 1024 byte block size. The low level disk driver in the UNIX kernel took care of this detail by being specific to the hardware on which it was being run—it was one of the few areas in the kernel that was written in assembly language. As it turned out, the 1 KB block size produced better efficiency in the filesystem kernel code as compared to the same code using 512 byte blocks.
Structurally, the S51K filesystem is comprised of a superblock (supervisor block, the first block in the filesystem), an inode array, a data block array, and allocation maps to control usage of the inode and data block arrays. Each file has an inode that describes the file's size, access times, locations of data blocks, type and permissions. The matching directory entry consists of the inode number (a 16 bit quantity) paired with a filename of 1 to 14 characters. Hard links to files could be created, as long as they pointed to a file in the same filesystem. No support was present for symbolic links (symlinks) or links to directories.
Directories are like ordinary files in that they take up space in the data block array, are referred to by an inode and (in the case of subdirectories) an entry in a directory (the parent directory). Like any other file, a directory can be opened and read, which is how UNIX utilities such as ls and find are able to do what they do (the UNIX system function fstat is able to retrieve data from an inode when given an open file descriptor). Only the kernel is permitted to write to a directory.
The inode and data block allocation maps in the S51K filesystem are a set of linked lists. Linked lists made sense on the relatively slow DEC PDP-11 hardware where S51K first came to life. Also, disks of the time were very small by today's standards, so a lot was done to keep filesystem overhead from using too much precious space. However, linked lists become a bottleneck on large filesystems and often lead to severe fragmentation if frequent file creation and deletion occurs. These limitations were addressed by the S51K's descendant, the Acer Fast Filesystem (AFS), which replaced linked lists with bitmaps. The Extended Acer Fast Filesystem (EAFS) added support for much longer filenames via a directory kludge.
My filesystem will incorporate/borrow/steal internal features from both S51K and AFS, as well as have some features that I conjured. Here's a pictorial representation of my filesystem:
Code: Select all
NEW FILESYSTEM LAYOUT Logical Block Offset (LBO)
+=================================+ ==================================
| | starting LBO depends on the number
| data block array | of data blocks in filesystem
+—————————————————————————————————+ $??????
| block allocation map |
+—————————————————————————————————+ $00040a
| block allocation map tag block |
+—————————————————————————————————+ $000409
| inode array |
+—————————————————————————————————+ $000009
| inode allocation map |
+—————————————————————————————————+ $000001
| superblock |
+=================================+ $000000Internally, the following are features that I plan to incorporate:
- 1024 byte logical block size, achieved by pairing contiguous 512 bye disk blocks.
- Logical block offset (LBO) addressing. Everything in the filesystem will be addressed with a number that is a 32 bit offset relative to the superblock. This feature makes the filesystem portable to any hardware that has a kernel capable of accessing the filesystem. it also simplifies the arithmetic operations required to address blocks and compute filesystem geometry values.
- Inodes structured like the S51K model but with an expanded data block address table for faster access. The S51K inode had a table with 10 slots for direct block addresses and three slots for indirect addresses, with all slots being 24 bits wide. I am going to increase the direct block address slots to 16, which means that data in the first 16 KB of any file can usually be read or written with a single block access. The indirect address structure of S51K will be retained. Table slot width will be increased to 32 bits to be compatible with the 64 GB filesystem size limit. The S51K's 24 bit address table slot size limits the maximum block address within the filesystem to 16,777,215, which would prevent full usage of a larger filesystem.
- 16 bit inode numbers, which is what sets the 65,535 maximum files limit. Once I have built POC V2, which will have much more RAM than POC V1, I will look at implementing a 24 bit inode number, which will expand the file limit to 8,388,608 per filesystem. I'm sticking with 16 bits for now to keep the inode allocation map structure down to a smaller size. See next.
- Bitmaps to control inode and data block usage, similar to that used in AFS. The inode allocation bitmap (IAM) is a fixed size of eight logical disk blocks. The data block allocation map (BAM) will be variable in size, depending on the size of the filesystem. The maximum filesystem size limitation of 64 GB is set by the maximum permissible size of the BAM, which is 8192 logical blocks.
A problem that develops with large filesystems is that the BAM also becomes very large and the time required to track down and allocate a free data block greatly increases as the file system fills up. In a 1 KB logical block filesystem, there is one BAM block for every 8192 data blocks, so the BAM block count goes up proportionally with the size of the filesystem's data area.
Consider a maximum-size filesystem of 64 GB. The stats on this filesystem would be:
Code: Select all
Filesystem Blocks: 67,108,864
Usable Blocks: 67,104,758
Data Blocks: 67,096,567
BAM Blocks: 8,191Usable blocks means the blocks available for the BAM and data block array. Note the number of BAM blocks. There are enough blocks to provide one bit for each data block in the filesystem—the last block is partially used, since 67,096,567 bits won't fit into an integral number of blocks.
Now, suppose 75 percent of the data blocks have been allocated to files and no files have ever been deleted. There will be no fragmentation and 75 percent of the BAM will have cleared bits—a cleared bit meaning the corresponding data block is in use. Now, if a file is opened and data is added to it, it is probable that a new data block will have to be allocated. This means the BAM would have to be scanned until a set bit was found. However, in our hypothetical case, the first set bit will be in relative BAM block 6142, which means 6141 full block scans would have to be performed in search of that set bit—a lot of processing. The compute-bound activity would go fast enough, especially since I could use 16 bit loads and stores to manipulate the bitmap. However, a lot of time would be consumed in reading in the BAM from disk—as much as 45 seconds if the disk hasn't cached any tracks and nothing is buffered in core. That sort of performance would make a Commodore 1541 floppy drive seem lightning fast by comparison.
This is a significant problem that I mulled for some time before devising a solution, which was to add an extra block to the BAM I call a "tag block," a term I borrowed from the tag RAM that was used with SRAM caches on 80386 and 80486 motherboards. The following pictorial represents a BAM with a tag block:
Code: Select all
+———————————————————————————+
| |
| BAM Tag Block — 8192 Bits |
------------------------ | | ------------------------
/ / +———————————————————————————* \ \
/ / | \ \
/ / | \ \
/ / | \ \
+—————————————————————+ +—————————————————————+ +—————————————————————+ ... +—————————————————————+ +—————————————————————+
| | | | | | | | | |
| BAM Block $0000 for | | BAM Block $0001 for | | BAM Block $0002 for | | BAM Block $1FFE for | | BAM Block $1FFF for |
| Data Blocks | | Data Blocks | | Data Blocks | | Data Blocks | | Data Blocks |
| $00000000—$00001FFF | | $00002000—$00003FFF | | $00004000—$00005FFF | | $03FFC000—$03FFDFFF | | $03FFE000—$03FFFFFF |
| | | | | | | | | |
+—————————————————————+ +—————————————————————+ +—————————————————————+ ... +—————————————————————+ +—————————————————————+Each bit in the tag block indicates the status of one BAM block, with bit 0 of byte $0000 in the tag block representing BAM block $0000 (address is an LBO). If the tag bit is set then the corresponding BAM block has at least one set bit. When all bits in a BAM block have been cleared because all corresponding data blocks have been allocated, the tag bit is likewise cleared. As there are 8192 bits in the tag block, a maximum of 8192 BAM blocks can be controlled, representing $04000000 (67,108,864) data blocks. The result is a filesystem size of slightly more than 64 GB, once overhead is taken into account.
With the use of the tag block, it isn't necessary to perform a time-consuming sequential scan of the entire BAM looking for a free data block. Instead, the process is reduced to first scanning the tag block to find the first BAM block that points to at least one free data block. Once that BAM block has been determined, it will be scanned to locate the free block. Bitmap scanning is a quick process, as no bit-twiddling is required until a non-zero control byte is encountered in the map. Shifting and masking is then used to pinpoint the first set bit. With the 65C816, 16 bits can be handled at a time, which means the worst-case situation, the last byte in the block is the one with the set bit, amounts to 4096 scan iterations, with only the last one requiring shifting and masking. As a full BAM scan is avoided, no more than two disk reads are required to get the relevant BAM block, followed by one write to update the BAM and another to update the tag block if it is changed. If the relevant blocks are already in core, then disk activity is reduced to one or two writes only. In a true multitasking environment, the write operations would be handle by a "flush dirty buffers" routine that would run at periodic intervals to keep the filesystem consistent.
There are some other filesystem details I have to work out, but most of them are more about algorithms than actual design. One such algorithm that will demand some serious attention is the one that searches a directory for a filename and returns an inode number to the caller. That one should keep me thinking for a while.