A Bradbury reference! Excellent!
Redhat 7.3 uses ext3 by default. ext3 is a filesystem obtained by
tacking a journal onto the ext2 filesystem. I will refer to ext2 in
the following, as its limitations are passed on to ext3, and the
information applies to both. Journalling has implications which I can
discuss if you would like (ask for a clarification), but does not
affect the limitations inherent in the filesystem, which I will
discuss.
O file consists of data on the disk. A file descriptor is an inode
that references the data on disk. Multiple inodes can point to the
same data, giving the appearance of multiple files. If two inodes
(created by "hard-linking" to a file) reference the same data, and one
is deleted, the file is still there - only the inode is deleted. The
data remains on disk until the last inode that refers to it is freed -
then the disk space may be reused for new files. Inodes reference
data blocks in disk through 12 direct blocks - each points to part of
the disk. Following the direct blocks is an indirect block, a pointer
to a block which contains pointers to data blocks. Then there is a
doubly indirect block, a pointer to a block which contains pointers to
blocks which point to data. There is also a triply-indirect block.
This is how ext2 avoids fragmentation - the blocks have no need to be
in linear order, so when they are not, the system does not bog down.
The physical disks will have to seek just as they would with any other
filesystem if the file is not linear, but the filesystem itself has no
performance degradation as a result of fragmentation. Random access
is just as fast on a "fragmented" ext2 file as on an unfragmented one
because the filesystem knows where all the blocks are and can fetch
the millionth byte without having to read through and find out after
the first hundred thousand that the rest of the file is elsewhere.
Now that we know a bit about how inodes work, we are in a better
position to understand their weaknesses. The number of blocks that an
inode can reference is fixed, so the maximum filesize depends on your
blocksize, as shown in the following table:
Filesystem block size: 1kB 2kB 4kB 8kB
File size limit: 16GB 256GB 2048GB 2048GB
Filesystem size limit: 2047GB 8192GB 16384GB 32768GB
However, the linux 2.4 kernel limits single block devices
(filesystems, partitions, what have you) to 2TB, even though ext2 can
handle larger filesystems with larger blocks. You do not sound like
you should have to worry about the maximum size of files.
You are starting to accumulate quite a few files in once place, so we
need to explore those limitations as well. There is a limit of 32k
(32768) subdirectories in a single directory, a limitation likely of
only academic interest, as many people don't even have that many files
(though huge mail servers may need to keep that in mind). The ext2
inode specification allows for over 100 trillion files to reside in a
single directory, however because of the current linked-list directory
implementation, only about 10-15 thousand files can realistically be
stored in a single directory. This is why systems such as Squid (
http://www.squid-cache.org ) use cache directories with many
subdirectories - searching through tens of thousands of files in one
directory is sloooooooow. There is however a hashed directory index
scheme under development which should allow up to about 1 million
files to be reasonably stored in a single directory. To overcome this
limitation, I would suggest you build subdirectories to store your
text files in, and implement some simple hash function within your
perl scripts to store the files in the subdirectories.
You say that about 5000 files are being created per year. Would it be
possible to create directories for each year, and put files in them on
that basis? That would keep you below the practical limits of ext2,
and should give easy-to-accomodate splitting of files amongst
directories. If the files need to be accessible as one chunk, or are
not reasonably sorted by date, I would suggest the hashing approach I
mentioned above. Your database project will also help you manage many
small data records, but still leaves you with the problem of how to
arrange your HTML files.
One possible solution to all your problems is to switch from ext2 to a
different filesystem. There are many different filesystems under
development, each with differing strengths and weaknesses. You may
want to keep your eye on the big three, Reiserfs, XFS, and JFS. XFS
is probably the most mature, followed by reiserfs, and JFS. However,
all three are less mature than ext
So far, you're in fine shape - you've got plenty of inodes for several
years, operation. Unfortunately, there is no way to tune the number
of inodes once a filesystem is created - you have to recreate it with
new parameters. It looks to me like your only concern is the number
of files in one place.
Your question is a straightforward one in terms of data gathering, yet
there seems to be relatively little analysis to do. In any event, I
have not put $100 worth of time into this question yet, so if there is
any area you would like more clarifacion on, don't hesitate to ask -
I've barely scratched the surface of filesystem implementation, though
you quickly get bogged down in very technical aspects of the data
structures implementing the filesystem, which you may not care about.
Thanks for an interesting question,
-Haversian
Searches:
Structure of Inodes
ext2 filesystem limits
ext2 large directories
Links:
The data fields in an inode:
http://www.nondot.org/sabre/os/files/FileSystems/ext2fs/ext2fs_7.html
A more verbose link on the above topic, but not quite complete:
http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/vfs-7.html
Straight from the linux documentation:
/usr/src/linux-2.4/Documentation/filesystems/ext2.txt
/usr/src/linux-2.4/Documentation/filesystems/vfs.txt
A very in-depth (thoug easy to read) treatment of ext2 at MIT:
http://web.mit.edu/tytso/www/linux/ext2intro.html
A post, with links to papers, by Daniel Philips on large directories:
http://lwn.net/2001/1213/a/directory-index.php3 |
Clarification of Answer by
haversian-ga
on
10 Dec 2002 01:10 PST
> And as I understand it, the problem will be show up as performance
degradation. Correct?
Correct. The files are stored in a list, and searching the list is an
O(n) operation, so the more files there are, the longer it takes. A
hash table for files should be O(1) meaning it doesn't matter the
number of files. A real hash table though is O(n), except in this
case n is the number of files in each bin, which should be roughly
equal to the number of files total divided by the number of bins.
Basically, it can increase the useful limit from 10,000 to 1,000,000
by using 100 bins. Slightly less space-efficient (perhaps - depends
on implementation), but faster, which is useful.
When a file gets modified, the data gets written to the disk, and the
inode is updated to reflect the new disk blocks. If power dies
between writing data and updating disk blocks, there's data lost. If
the inode is written first, and the power dies, then the inode
references disk blocks that contain gibberish - also very bad. The
process of fscking (FileSystem ChecKing) takes forever because the
entire disk must be read searching for disk blocks that are gibberish
and disk blocks that got written but not added to the inode.
The way a journal works is that it adds extra steps to preserve what
is known as consistency - the inodes and the disk blocks are in sync.
First, a note is written which says "I'm going to update the disk
blocks now". Then the disk blocks are written. Then "I'm done
updating the disk blocks" and then "I'm going to update the inodes"
and then the inodes are updated, and then "I'm done updating inodes".
After a while the old journal entries get deleted because they're not
needed. If power dies between "going to update disk" and "updated
disk" then the data is invalid and is deleted. If the power dies
between "done updating disk" and "done updating inodes", the inodes
get updated and the data is saved - no need to read hundreds to
thousands of gigabytes of data. This is what is known as an atomic
update - the whole thing happens at once, or nothing happens.
The performance implication is that what used to be two operations
(write data, write inodes) is now six (will write data; write data;
wrote data; will write inodes; write inodes; wrote inodes) so things
could be a bit slower. This is not as bad as it sounds though because
the linux filesystem layer is very efficient and caches absolutely
everything. If you give a linux system 4GB of RAM and it only needs a
few hundred meg, it will use the rest to cache the filesystem which
makes it very fast. If only Windows did that instead of its braindead
insistence on writing memory contents to disk even when there are
hundreds of megabytes of RAM free!
In reality, it will be slower on I/O bound servers, but all the tiny
writes to be done will be cached by linux in system RAM, then by the
controller, then by the hard drives; a lightly-loaded system will have
plenty of downtime to write the updates when you're not using the
machine for anything else. If the system is heavily loaded, the disk
writes may take so long to propagate to the physical disks that the
journal operations are already undone (written, verified, and
discarded) so *nothing* ends up being written except your data. For
higher-end applications where data must be written to disk and not
cached, the 6 writes versus 2 becomes an issue, but is offset by very
expensive disk subsystems with large battery-backed redundant onboard
caches. It starts being mind-boggling after that.
Anything else you're curious about?
-Haversian
|