HOWTO: Multi Disk System Tuning: File Systems

5. File Systems

Over time the requirements for file systems have increased and the demands for large structures, large files, long file names and more has prompted ever more advanced file systems, the system that accesses and organises the data on mass storage. Today there is a large number of file systems to choose from and this section will describe these in detail.

The emphasis is on Linux but with more input I will be happy to add information for a wider audience.

5.1 General Purpose File Systems

Most operating systems usually have a general purpose file system for every day use for most kinds of files, reflecting available features in the OS such as permission flags, protection and recovery.

`minix`

This was the original fs for Linux, back in the days Linux was hosted on minix machines. It is simple but limited in features and hardly ever used these days other than in some rescue disks as it is rather compact.

`xiafs` and `extfs`

These are also old and have fallen in disuse and are no longer recommended.

`ext2fs`

This is the established standard for general purpose in the Linux world. It is fast, efficient and mature and is under continuous development and features such as ACL and transparent compression are on the horizon.

For more information check the ext2fs home page.

`ext3fs`

This is the name for the successor to ext2fs available in kernel 2.4 and later. Many features are added to ext2fs but to avoid confusion over the name after such a radical upgrade the name will be changed too.

Metadata, data that describes the structure of the files, is written in a joural to the disk. Note that there are 3 modes of operations for ext3fs.

data=writeback

This is metadata-only journalling. File data is written back to the main fs lazily.

After a crash and recovery the file integrity is intact but file content can be old.

Normally this is the fastest mode.

data=ordered

Here file data is written before metadata.

After a crash and recovery the file integrity is intact and files contain correct, recent data.

Normally this is only slightly slower.

data=journal

All data (as well as to metadata) is written to the journal before it is released to the main fs for writeback.

This is a specialised mode for cases where synchronous operations are required, such as for mail spools and synchronous NFS mounts.

`ufs`

This is the fs used by BSD and variants thereof. It is mature but also developed for older types of disk drives where geometries were known. The fs uses a number of tricks to optimise performance but as disk geometries are translated in a number of ways the net effect is no longer so optimal.

`efs`

The Extent File System (efs) is Silicon Graphics' early file system widely used on IRIX before version 6.0 after which xfs has taken over. While migration to xfs is encouraged efs is still supported and much used on CDs.

There is a Linux driver available in early beta stage, available at Linux extent file system home page.

`XFS`

Silicon Graphics Inc (sgi) has started porting its mainframe grade file system to Linux. Source is not yet available as they are busily cleaning out legal encumbrance but once that is done they will provide the source code under GPL.

More information is already available on the XFS project page at SGI.

`reiserfs`

As of July, 23th 1997 Hans Reiser reiser (at) RICOCHET.NET has put up the source to his tree based reiserfs on the web. While his filesystem has some very interesting features and is much faster than ext2fs and is in use by a number of people. It is available in kernel 2.4 and later.

`enh-fs`

The Enhanced File System project is now dead.

`Tux2 fs`

The Tux2 File System project is now dead.

5.2 Microsoft File Systems

This company is responsible for a lot, including a number of filesystems that has at the very least caused confusions.

`fat`

Actually there are 2 fats out there, fat12 and fat16 depending on the partition size used but fortunately the difference is so minor that the whole issue is transparent.

On the plus side these are fast and simple and most OSes understands it and can both read and write this fs. And that is about it.

The minus side is limited safety, severely limited permission flags and atrocious scalability. For instance with fat you cannot have partitions larger than 2 GB.

`fat32`

After about 10 years Microsoft realised fat was about, well, 10 years behind the times and created this fs which scales reasonably well.

Permission flags are still limited. NT 4.0 cannot read this file system but Linux can.

`vfat`

At the same time as Microsoft launched fat32 they also added support for long file names, known as vfat.

Linux reads vfat and fat32 partitions by mounting with type vfat.

`ntfs`

This is the native fs of Win-NT but as complete information is not available there is limited support for other OSes.

5.3 Logging and Journaling File Systems

These take a radically different approach to file updates by logging modifications for files in a log and later at some time checkpointing the logs.

Reading is roughly as fast as traditional file systems that always update the files directly. Writing is much faster as only updates are appended to a log. All this is transparent to the user. It is in reliability and particularly in checking file system integrity that these file systems really shine. Since the data before last checkpointing is known to be good only the log has to be checked, and this is much faster than for traditional file systems.

Note that while logging filesystems keep track of changes made to both data and inodes, journaling filesystems keep track only of inode changes.

Linux has quite a choice in such file systems but none are yet in production quality. Some are also on hold.

Adam Richter from Yggdrasil posted some time ago that they have been working on a compressed log file based system but that this project is currently on hold. Nevertheless a non-working version is available on their FTP server. Check out the Yggdrasil ftp server where special patched versions of the kernel can be found.
Another project is the Linux log-structured Filesystem Project which sadly also is on hold. Nevertheless this page contains much information on the topic.
Then there is the LinLogFS -- A Log-Structured Filesystem For Linux (formerly known as dtfs) which seems to be going strong. Still in alpha but sufficiently complete to make programs run off this file system
Finally there is the Journaling Flash File System designed for their embedded diskless systems such as their Linux based web camera.

Note that ext3fs, XFS and reiserfs also have features for logging or journaling.

5.4 Read-only File Systems

Read-only media has not escaped the ever increasing complexities seen in more general file systems so again there is a large choice to choose from with corresponding opportunities for exciting mistakes.

Note that ext2fs works quite well on a CD-ROM and seems to save space while offering the normal file system features such as long file names and permissions that can be retained when copying files across to read-write media. Also having /dev on a CD-ROM is possible.

Most of these are used with the CD-ROM media but also the new DVD can be used and you can even use it through the loopback device on a hard disk file for verifying an image before burning a ROM.

There is a read-only romfs for Linux but as that is not disk related nothing more will be said about it here.

`High Sierra`

This was one of the earliest standards for CD-ROM formats, supposedly named after the hotel where the final agreement took place.

High Sierra was so limited in features that new extensions simply had to appear and while there has been no end to new formats the original High Sierra remains the common precursor and is therefore still widely supported.

`iso9660`

The International Standards Organisation made their extensions and formalised the standard into what we know as the iso9660 standard.

The Linux iso9660 file system supports both High Sierra as well as Rock Ridge extensions.

`Rock Ridge`

Not everyone accepts limits like short filenames and lack of permissions so very soon the Rock Ridge extensions appeared to rectify these shortcomings.

`Joliet`

Microsoft, not be be outdone in the standards extension game, decided it should extend CD-ROM formats with some internationalisation features and called it Joliet.

Linux supports this standards in kernels 2.0.34 or newer. You need to enable NLS in order to use it.

Trivia

Joliet is a city outside Chicago; best known for being the site of the prison where Jake was locked up in the movie "Blues Brothers." Rock Ridge (the UNIX extensions to ISO 9660) is named after the (fictional) town in the movie "Blazing Saddles."

`UDF`

With the arrival of DVD with up to about 17 GB of storage capacity the world seemingly needed another format, this time ambitiously named Universal Disk Format (UDF). This is intended to replace iso9660 and will be required for DVD and is available in modern Linux kernels.

5.5 Networking File Systems

There is a large number of networking technologies available that lets you distribute disks throughout a local or even global networks. This is somewhat peripheral to the topic of this HOWTO but as it can be used with local disks I will cover this briefly. It would be best if someone (else) took this into a separate HOWTO...

`NFS`

This is one of the earliest systems that allows mounting a file space on one machine onto another. There are a number of problems with NFS ranging from performance to security but it has nevertheless become established.

`AFS`

Also known as Andrew File System, it allows efficient sharing of files across large networks. Starting out as an academic project it is now sold by IBM whose home page gives you more details.

AFS also branched into open source, more information is available on OpenAFS home page.

Derek Atkins, of MIT, ported AFS to Linux and has also set up the Linux AFS mailing List ( linux-afs@mit.edu) for this which is open to the public. Requests to join the list should go to linux-afs-request@mit.edu and finally bug reports should be directed to linux-afs-bugs@mit.edu.

Important: as AFS uses encryption it is restricted software and cannot easily be exported from the US.

IBM who owns Transarc, has announced the availability of the latest version of client as well as server for Linux.

Arla is a free AFS implementation, check the Arla homepage for more information as well as documentation.

Coda

A networking filesystem similar to AFS is underway and is called Coda. This is designed to be more robust and fault tolerant than AFS, and supports mobile, disconnected operations. Currently it does not scale very well, and does not really have proper administrative tools, as AFS does and ARLA is beginning to.

`nbd`

The Network Block Device (Sourceforge project pages) (nbd) is available in Linux kernel 2.2 and later and offers reportedly excellent performance. The interesting thing here is that it can be combined with RAID (see later).

`enbd`

The Enhanced Network Block Device (enbd) is a project to enhance the nbd with features such as block journaled multi channel communications, internal failover and automatic balancing between channels and more.

The intended use is for RAID over the net.

GFS

The Global File System is a file system designed for storage across a wide area network.

5.6 Special File Systems

In addition to the general file systems there is also a number of more specific ones, usually to provide higher performance or other features, usually with a tradeoff in other respects.

`tmpfs` and `swapfs`

For short term fast file storage SunOS offers tmpfs which is about the same as the swapfs on NeXT. This overcomes the inherent slowness in ufs by caching file data and keeping control information in memory. This means that data on such a file system will be lost when rebooting and is therefore mainly suitable for /tmp area but not /var/tmp which is where temporary data that must survive a reboot, is placed.

SunOS offers very limited tuning for tmpfs and the number of files is even limited by total physical memory of the machine.

Linux now features tmpfs since kernel version 2.4 and is enabled by turning on virtual memory file system support (former shm fs ). Under certain circumstances tmpfs can lock up the system in early kernel versions, make sure you use version 2.4.6 or later.

Note that tmpfs is a filesystem which means formatting is not needed. This in contrast to block devices that must be partitioned and formatted before use.

`userfs`

The user file system (userfs) allows a number of extensions to traditional file system use such as FTP based file system, compression (arcfs) and fast prototyping and many other features. The docfs is based on this filesystem. Check the userfs homepage for more information.

`devfs`

When disks are added, removed or just fail it is likely that disk device names of the remaining disks will change. For instance if sdb fails then the old sdc becomes sdb, the old sdc becomes sdb and so on. Note that in this case hda, hdb etc will remain unchanged. Likewise if a new drive is added the reverse may happen.

There is no guarantee that SCSI ID 0 becomes sda and that adding disks in increasing ID order will just add a new device name without renaming previous entries, as some SCSI drivers assign from ID 0 and up while others reverse the scanning order. Likewise adding a SCSI host adapter can also cause renaming.

Generally device names are assigned in the order they are found.

The source of the problem lies in the limited number of bits available for major and minor numbering in the device files used to describe the device itself. You an see these in the /dev directory, info on the numbering and allocation can be found in man MAKEDEV. Currently there are 2 solutions to this problem in various stages of development:

scsidev: works by creating a database of drives and where they belong, check man scsifs and the scsidev home page for more information
devfs: is a more long term project aimed at getting around the whole business of device numbering by making the /dev directory a kernel file system in the same way as /proc is. More information will appear as it becomes available.

`smugfs`

For a number of reasons it is currently difficult to have files bigger than 2 GB. One file system that tries to overcome this limit is smugfs which is very fast but also simple. For instance there are no directories and the block allocation is simple.

It is available as compressed tarred source code and while it worked with kernel version 2.1.85 it is quite possible some work is required to make it fit into newer kernels. Also the low version number (0.0) suggests extra care is required.

5.7 File System Recommendations

There is a jungle of choices but generally it is recommended to use the general file system that comes with your distribution. If you use ufs and have some kind of tmpfs available you should first start off with the general file system to get an idea of the space requirements and if necessary buy more RAM to support the size of tmpfs you need. Otherwise you will end up with mysterious crashes and lost time.

If you use dual boot and need to transfer data between the two OSes one of the simplest ways is to use an appropriately sized partition formatted with fat as most systems can reliably read and write this. Remember the limit of 2 GB for fat partitions.

For more information of file system interconnectivity you can check out the file system page which has been superseded by file system and the article Kragen's Amazing List of Filesystems.

That guide is being superseded by a HOWTO which is underway and a link will be added when it is ready.

To avoid total havoc with device renaming if a drive fails check out the scanning order of your system and try to keep your root system on hda or sda and removable media such as ZIP drives at the end of the scanning order.

Next Previous Contents

minix

xiafs and extfs

ext2fs

ext3fs

ufs

efs

XFS

reiserfs

enh-fs

Tux2 fs

fat

fat32

vfat

ntfs

High Sierra

iso9660

Rock Ridge

Joliet