Google Answers: file system limitations RedHat 7.3, ext3

View Question

Q: file system limitations RedHat 7.3, ext3 ( Answered 4 out of 5 stars

, 0 Comments )

Question

Subject: file system limitations RedHat 7.3, ext3
Category: Computers > Operating Systems
Asked by: books451-ga
List Price: $125.00

Posted: 09 Dec 2002 21:47 PST
Expires: 08 Jan 2003 21:47 PST
Question ID: 122241

On an intranet websever I maintain, I need to know if there is a limit
to the number of files in a non-root directory (other than a full
hard-drive). Approx. 5000+ files are being created each year in two
directories (currently have around 7000). The number of new files each
year could easily go up to 10 times as much.

I've actually got two directories with lots of files, a data directory
( .txt files, around 2k each ), the other has a one-to-one
corresponding html file ( up to 20k in size each ). Eventually the
data files will be converted to a RDBMS (planning on Postgresql), but
the html files *won't*.


I also need to know if there will be performance issues later on, any
you can think of. The files are managed with a set of custom Perl
scripts that I've written. Would appreciate all relevant comments too.

I'd like documentation (where you found the info etc) if possible. I
will be a good tipper.


The server is a Dell Poweredge, dual cpus, with a RAID 5 setup, other
info below, the directories in question are in the /home ( /dev/sdb3 )
partition.


# more /etc/redhat-release
Red Hat Linux release 7.3 (Valhalla)

# uname -a
Linux ***edited out*** 2.4.18-10smp #1 SMP Wed Aug 7 11:17:48 EDT 2002
i686 unknown

# df -k
Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/sdb5               381139    202760    158701  57% /
/dev/sda1               248895     33231    202814  15% /boot
/dev/sdb3             25672248   1898552  22469620   8% /home
none                    515492         0    515492   0% /dev/shm
/dev/sdb1              2522048   1161344   1232588  49% /usr
/dev/sdb6               256667    110416    132999  46% /var


# df -i
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/sdb5              98392   22825   75567   24% /
/dev/sda1              64256      63   64193    1% /boot
/dev/sdb3            3260416   50864 3209552    2% /home
none                  128873       1  128872    1% /dev/shm
/dev/sdb1             320640   71310  249330   23% /usr
/dev/sdb6              66264     652   65612    1% /var


# more /etc/fstab
LABEL=/                 /                       ext3    defaults       1 1
LABEL=/boot             /boot                   ext3    defaults       1 2
none                    /dev/pts                devpts  gid=5,mode=620 0 0
LABEL=/home             /home                   ext3    defaults       1 2
none                    /proc                   proc    defaults       0 0
none                    /dev/shm                tmpfs   defaults       0 0
LABEL=/usr              /usr                    ext3    defaults       1 2
LABEL=/var              /var                    ext3    defaults       1 2
/dev/sdb2               swap                    swap    defaults       0 0


# fdisk -l

Disk /dev/sda: 255 heads, 63 sectors, 522 cylinders
Units = cylinders of 16065 * 512 bytes

   Device Boot    Start       End    Blocks   Id  System
/dev/sda1   *         1        32    257008+  83  Linux

Disk /dev/sdb: 255 heads, 63 sectors, 3902 cylinders
Units = cylinders of 16065 * 512 bytes

   Device Boot    Start       End    Blocks   Id  System
/dev/sdb1   *         1       319   2562336   83  Linux
/dev/sdb2           320       573   2040255   82  Linux swap
/dev/sdb3           574      3820  26081527+  83  Linux
/dev/sdb4          3821      3902    658665    f  Win95 Ext'd (LBA)
/dev/sdb5          3821      3869    393561   83  Linux
/dev/sdb6          3870      3902    265041   83  Linux

Clarification of Question by books451-ga on 09 Dec 2002 22:21 PST
Oh yeah,,, I use Samaba 2.2.3a to get to the files from a Win2k machine.

Answer

Subject: Re: file system limitations RedHat 7.3, ext3
Answered By: haversian-ga on 09 Dec 2002 22:56 PST
Rated: 4 out of 5 stars

A Bradbury reference!  Excellent!


Redhat 7.3 uses ext3 by default.  ext3 is a filesystem obtained by
tacking a journal onto the ext2 filesystem.  I will refer to ext2 in
the following, as its limitations are passed on to ext3, and the
information applies to both.  Journalling has implications which I can
discuss if you would like (ask for a clarification), but does not
affect the limitations inherent in the filesystem, which I will
discuss.

O file consists of data on the disk.  A file descriptor is an inode
that references the data on disk.  Multiple inodes can point to the
same data, giving the appearance of multiple files.  If two inodes
(created by "hard-linking" to a file) reference the same data, and one
is deleted, the file is still there - only the inode is deleted.  The
data remains on disk until the last inode that refers to it is freed -
then the disk space may be reused for new files.  Inodes reference
data blocks in disk through 12 direct blocks - each points to part of
the disk.  Following the direct blocks is an indirect block, a pointer
to a block which contains pointers to data blocks.  Then there is a
doubly indirect block, a pointer to a block which contains pointers to
blocks which point to data.  There is also a triply-indirect block. 
This is how ext2 avoids fragmentation - the blocks have no need to be
in linear order, so when they are not, the system does not bog down. 
The physical disks will have to seek just as they would with any other
filesystem if the file is not linear, but the filesystem itself has no
performance degradation as a result of fragmentation.  Random access
is just as fast on a "fragmented" ext2 file as on an unfragmented one
because the filesystem knows where all the blocks are and can fetch
the millionth byte without having to read through and find out after
the first hundred thousand that the rest of the file is elsewhere.

Now that we know a bit about how inodes work, we are in a better
position to understand their weaknesses.  The number of blocks that an
inode can reference is fixed, so the maximum filesize depends on your
blocksize, as shown in the following table:
   Filesystem block size:     1kB        2kB        4kB        8kB
   File size limit:          16GB      256GB     2048GB     2048GB
   Filesystem size limit:  2047GB     8192GB    16384GB    32768GB

However, the linux 2.4 kernel limits single block devices
(filesystems, partitions, what have you) to 2TB, even though ext2 can
handle larger filesystems with larger blocks.  You do not sound like
you should have to worry about the maximum size of files.

You are starting to accumulate quite a few files in once place, so we
need to explore those limitations as well.  There is a limit of 32k
(32768) subdirectories in a single directory, a limitation likely of
only academic interest, as many people don't even have that many files
(though huge mail servers may need to keep that in mind).  The ext2
inode specification allows for over 100 trillion files to reside in a
single directory, however because of the current linked-list directory
implementation, only about 10-15 thousand files can realistically be
stored in a single directory.  This is why systems such as Squid (
http://www.squid-cache.org ) use cache directories with many
subdirectories - searching through tens of thousands of files in one
directory is sloooooooow.  There is however a hashed directory index
scheme under development which should allow up to about 1 million
files to be reasonably stored in a single directory.  To overcome this
limitation, I would suggest you build subdirectories to store your
text files in, and implement some simple hash function within your
perl scripts to store the files in the subdirectories.

You say that about 5000 files are being created per year.  Would it be
possible to create directories for each year, and put files in them on
that basis?  That would keep you below the practical limits of ext2,
and should give easy-to-accomodate splitting of files amongst
directories.  If the files need to be accessible as one chunk, or are
not reasonably sorted by date, I would suggest the hashing approach I
mentioned above.  Your database project will also help you manage many
small data records, but still leaves you with the problem of how to
arrange your HTML files.

One possible solution to all your problems is to switch from ext2 to a
different filesystem.  There are many different filesystems under
development, each with differing strengths and weaknesses.  You may
want to keep your eye on the big three, Reiserfs, XFS, and JFS.  XFS
is probably the most mature, followed by reiserfs, and JFS.  However,
all three are less mature than ext

So far, you're in fine shape - you've got plenty of inodes for several
years, operation.  Unfortunately, there is no way to tune the number
of inodes once a filesystem is created - you have to recreate it with
new parameters.  It looks to me like your only concern is the number
of files in one place.

Your question is a straightforward one in terms of data gathering, yet
there seems to be relatively little analysis to do.  In any event, I
have not put $100 worth of time into this question yet, so if there is
any area you would like more clarifacion on, don't hesitate to ask -
I've barely scratched the surface of filesystem implementation, though
you quickly get bogged down in very technical aspects of the data
structures implementing the filesystem, which you may not care about.

Thanks for an interesting question,

-Haversian


Searches:
Structure of Inodes
ext2 filesystem limits
ext2 large directories

Links:
The data fields in an inode:
http://www.nondot.org/sabre/os/files/FileSystems/ext2fs/ext2fs_7.html

A more verbose link on the above topic, but not quite complete:
http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/vfs-7.html

Straight from the linux documentation:
/usr/src/linux-2.4/Documentation/filesystems/ext2.txt
/usr/src/linux-2.4/Documentation/filesystems/vfs.txt

A very in-depth (thoug easy to read) treatment of ext2 at MIT:
http://web.mit.edu/tytso/www/linux/ext2intro.html

A post, with links to papers, by Daniel Philips on large directories:
http://lwn.net/2001/1213/a/directory-index.php3

Request for Answer Clarification by books451-ga on 10 Dec 2002 00:29 PST

Thanks for the quick response.

I'm confused on something tho.

At one point you said that realistically only 10 - 15 *thousand* files
can be stored in one directory at a time. I've found several webpages
based on the search terms you provided that agree with that. However
near the end you mention that I should be ok for several years since
there are plenty of free inodes, which my own research agrees with.

The two things (number of files in a directory and inodes in general)
are related but they are acutally two different issues.

I'm thinking I'm gonna be in trouble soon because of the number of
files but not any time soon in reference to available inodes. Correct?
And as I understand it, the problem will be show up as performance
degradation. Correct?



The suggestion for using directories for the year is possible though I
don't look forward to the rework! Currently the filenames are based on
day_month_year and I could certainly do directories for the year then
month and drop that info from the filenames.

Go ahead and spill the beans on the implications of journalling. I
certainly don't want you to have too easy a time with this question.
;-)

Clarification of Answer by haversian-ga on 10 Dec 2002 01:10 PST

> And as I understand it, the problem will be show up as performance
degradation. Correct?

Correct.  The files are stored in a list, and searching the list is an
O(n) operation, so the more files there are, the longer it takes.  A
hash table for files should be O(1) meaning it doesn't matter the
number of files.  A real hash table though is O(n), except in this
case n is the number of files in each bin, which should be roughly
equal to the number of files total divided by the number of bins. 
Basically, it can increase the useful limit from 10,000 to 1,000,000
by using 100 bins.  Slightly less space-efficient (perhaps - depends
on implementation), but faster, which is useful.

When a file gets modified, the data gets written to the disk, and the
inode is updated to reflect the new disk blocks.  If power dies
between writing data and updating disk blocks, there's data lost.  If
the inode is written first, and the power dies, then the inode
references disk blocks that contain gibberish - also very bad.  The
process of fscking (FileSystem ChecKing) takes forever because the
entire disk must be read searching for disk blocks that are gibberish
and disk blocks that got written but not added to the inode.

The way a journal works is that it adds extra steps to preserve what
is known as consistency - the inodes and the disk blocks are in sync. 
First, a note is written which says "I'm going to update the disk
blocks now".  Then the disk blocks are written.  Then "I'm done
updating the disk blocks" and then "I'm going to update the inodes"
and then the inodes are updated, and then "I'm done updating inodes". 
After a while the old journal entries get deleted because they're not
needed.  If power dies between "going to update disk" and "updated
disk" then the data is invalid and is deleted.  If the power dies
between "done updating disk" and "done updating inodes", the inodes
get updated and the data is saved - no need to read hundreds to
thousands of gigabytes of data.  This is what is known as an atomic
update - the whole thing happens at once, or nothing happens.

The performance implication is that what used to be two operations
(write data, write inodes) is now six (will write data; write data;
wrote data; will write inodes; write inodes; wrote inodes) so things
could be a bit slower.  This is not as bad as it sounds though because
the linux filesystem layer is very efficient and caches absolutely
everything.  If you give a linux system 4GB of RAM and it only needs a
few hundred meg, it will use the rest to cache the filesystem which
makes it very fast.  If only Windows did that instead of its braindead
insistence on writing memory contents to disk even when there are
hundreds of megabytes of RAM free!

In reality, it will be slower on I/O bound servers, but all the tiny
writes to be done will be cached by linux in system RAM, then by the
controller, then by the hard drives; a lightly-loaded system will have
plenty of downtime to write the updates when you're not using the
machine for anything else.  If the system is heavily loaded, the disk
writes may take so long to propagate to the physical disks that the
journal operations are already undone (written, verified, and
discarded) so *nothing* ends up being written except your data.  For
higher-end applications where data must be written to disk and not
cached, the 6 writes versus 2 becomes an issue, but is offset by very
expensive disk subsystems with large battery-backed redundant onboard
caches.  It starts being mind-boggling after that.

Anything else you're curious about?

-Haversian

books451-ga rated this answer: 4 out of 5 stars

and gave an additional tip of: $18.00

My apologies for the late rating,,, got involved with something at
work and forgot to do a rating.

Comments

There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy