one directory with 500 000 files (5-8mb each)

Rob Mueller robm at fastmail.fm
Mon Jun 23 03:31:44 MSD 2008


>>I am guessing ReiserFS could do quite well - it was designed to handle
>>huge numbers of files. What about others? XFS? Would be interesting to
>>know why so many files need to be in one directory as well...
>
> maybe this post can help.
>
> http://tservice.net.ru/~s0mbre/old/?section=projects&item=fs_contest2

The problem with benchmarks, is when you compare different systems, but you 
only tune one of them and not the others.

They mounted reiserfs with only "noatime". If they did any research at all, 
they would have found that something like 
"noatime,nodiratime,notail,data=ordered" or 
"noatime,nodiratime,notail,data=writeback" will give you much, much (10x) 
better performance. The default "tails" implementation is designed to save 
space, but trades off performance. I think that tradeoff is too large, and 
they should have defaulted to "notail", but that's all history.

Reiserfs really does handle large directories well. We run an email system, 
and had a user with > 1,000,000 emails in a folder, which means > 1,000,000 
files in a directory, and there were no problems accessing individual files 
in that directory at all. That user has trimmed down to 100,000 or so now, 
so I'll show that access is fine with that.

So it is hot in the cache now...

$ time ls | wc -l
161865

real    0m0.596s
user    0m0.500s
sys     0m0.110s

Most time is user time, not system time there.

Accessing a random file the first time.

$ strace -tt -o /tmp/st perl -e 'open(my $F, "1005527."); print scalar <$F>; 
close($F);'

...
19:15:24.463622 open("1005527.", O_RDONLY|O_LARGEFILE) = 3
19:15:24.463693 ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbf8af7c8) = -1 
ENOTTY (Inappropriate ioctl for device)
19:15:24.463755 _llseek(3, 0, [0], SEEK_CUR) = 0
19:15:24.463818 fstat64(3, {st_mode=S_IFREG|0600, st_size=3576, ...}) = 0
19:15:24.463919 fcntl64(3, F_SETFD, FD_CLOEXEC) = 0
19:15:24.463998 read(3, "Return-Path: <192.168.10.239 at xyz"..., 4096) = 3576
19:15:24.477487 write(1, "Return-Path: <192.168.10.239 at xyz"..., 40) = 40
19:15:24.477588 _llseek(3, 40, [40], SEEK_SET) = 0
19:15:24.477649 _llseek(3, 0, [40], SEEK_CUR) = 0
19:15:24.477703 close(3)                = 0
...

All the seeks + stats + fcntls are just perl doing various rubbish around a 
file. You can see there's no big pauses in accessing the file, just the 0.01 
seconds on the first read which seems reasonable on this already loaded 
email server. Lets see a second run.

...
19:18:23.275604 open("1005527.", O_RDONLY|O_LARGEFILE) = 3
19:18:23.275681 ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfc6ab88) = -1 
ENOTTY (Inappropriate ioctl for device)
19:18:23.275744 _llseek(3, 0, [0], SEEK_CUR) = 0
19:18:23.275805 fstat64(3, {st_mode=S_IFREG|0600, st_size=3576, ...}) = 0
19:18:23.275908 fcntl64(3, F_SETFD, FD_CLOEXEC) = 0
19:18:23.275988 read(3, "Return-Path: <192.168.10.239 at xyz"..., 4096) = 3576
19:18:23.276090 write(1, "Return-Path: <192.168.10.239 at xyz"..., 40) = 40
19:18:23.276189 _llseek(3, 40, [40], SEEK_SET) = 0
19:18:23.276250 _llseek(3, 0, [40], SEEK_CUR) = 0
19:18:23.276305 close(3)                = 0

The file is hot in the cache, so the first read is about 0.0001 seconds 
there.

As you can see, I'm a big reiserfs defender, it's worked really well for us, 
and most people who think it sucks usually have one of the following 
problems.

1. They use unreliable hardware. Reiserfs does not cope well in the face of 
unreliable hardware. If writes or reads return IO errors at any time, or any 
data corruption occurs on disk, reiserfs is much more likely to crash 
because of the more complex b-tree structure and no checksums. I think 
that's why using it on user desktop/laptop machines is a big mistake. On 
reliable server hardware though, it's great.
2. They use LVM. From our testing, for some reason, there still seems to 
still be strange LVM/reiserfs interactions. Use hardware RAID
3. They use some stupid mount options which cause shocking performance
4. They use some dumb filesystem speed test (eg untar + retar linux kernel), 
rather than using long term benchmarks that show how a real world filesystem 
performs after years of read/writing/creating/deleting files and 
fragmentation.

Just my experience over 8+ years of trying filesystems in a server 
environment.

Rob

PS. I'm looking forward to BTRFS becoming stable. Chris Mason did a lot of 
work on reiserfs, and he knows the ins and outs of filesystem and linux VM 
development. He also has a good relationship with the rest of the kernel 
team, so hopefully won't suffer the Hans PR nightmare. Initial benchmarks of 
btrfs look very promising, and it's being developed pretty quickly. 
Definitely one to keep an eye on.






More information about the nginx mailing list