FreeBSD NFS performance and OpenSolaris

While setting up and testing our Sun X4540 OpenSolaris NFS server, I noticed that our FreeBSD NFS clients were having severe performance issues while writing to the server. After a few days of digging around, I came across some ancient posts (circa 2005) on the FreeBSD-performance mailing lists describing similar problems.
Here is a brief explanation of how an NFS write can happen:
Assuming we have a generic NFS server, and a generic NFS client mounted over NFSv3/TCP, for every file write you issue an FSYNC. This will signal the NFS server to write to disk immidiately what it has received from the client, and once it has written to disk it will acknowledge to the client that the write was successful so the client can send more data.

What I ran into was that for every 32KB (the default write size over TCP), the FreeBSD NFS client was issuing FSYNCs. Normally this isn’t too much of a problem for most NFS servers, but the issue we have here is the guarantee from ZFS that the on-disk state must always be consistent. ZFS will treat an FSYNC literally, and commit to disk every FSYNC. Most other file system will lie at some level, and acknowledge the write before the data actually hits the disk. While that can improve performance, if you had a crash or power failure before actually commiting to disk, you may lose data.
For my case, lying would be a good thing, as those FSYNCs cause the performance to suffer dramatically.

We have in the FreeBSD source code the following section on/near line 143:


#ifdef VM_AND_BUFFER_CACHE_SYNCHRONIZED
if (S_ISREG(fs->st_mode) && fs->st_size > 0 &&
fs->st_size <= 8 * 1048576) {
if ((p = mmap(NULL, (size_t)fs->st_size, PROT_READ,
MAP_SHARED, from_fd, (off_t)0)) == MAP_FAILED) {
warn("%s", entp->fts_path);
rval = 1;
}

What this roughly approximates to, that for files smaller that 8MB (8 * 1048576) we use Mmap to handle the file. Larger files will use the native write() function.
Normally, Mmap can increase the speed a little bit for writing many small files at once, hence the behavior to use Mmap for files smaller than 8MB. The drawback appears to be that Mmap will issue FSYNCs for every write size over NFS.

To get around this behavior, you can decrease the file size at which Mmap will attempt to be used, in my case I set it to (8 * 8), which you likely wont come across files that small. Now, pretty much all file copy operations will use write(), which will go much faster over NFS, but you may incur a 10-15% performance hit on local disk copies of small files, to gain almost 100 times speed improvement over NFS. While this still gives you the protection of FSYNCs at file close, it is true that under some circumstances you may lose data on the NFS server if there is a crash mid-write, however ZFS ensures that the on-disk state will always be consistent, so your file system should not have any errors.

The ‘cp’ utility (actually, mv, cp are both the same, just linked) already seems to have this logic built in:
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libcmdutils/libcmdutils.h#65

As you can see, it will not use Mmap for really small files (less than 32MB), but it does indeed use Mmap, and will use it in 8MB chunks.

Below are two comparisons, the same hardware, but one compiled with FreeBSD using Mmap for small files, and the other write() for all files.
The data set was 62 Gigabytes of mail, most file sizes are less than 1MB, and commonly about 12KB to 300KB with random dispersion of larger files up to 25MB.

Mmap speed:

# time cp -Rv Maildir/ /mnt/obsmtp/ei_obscan/Maildir01


real 315m5.054s
user 0m8.418s
sys 10m12.513s

Write() speed:

$ time cp -Rv Maildir/ /mnt/obsmtp/ei_obscan/Maildir02


real 199m11.364s
user 0m7.324s
sys 5m39.594s

You can see the obvious performance increase, but it could be greater if file distribution was a bit larger, as for every single file close it will still issue an FSYNC which still slows it down a little bit.
Average data transfer speed went from less than 1MB/second bursts, to over 100MB/sec bursts. Sustained write was limited by FSYNCs on small files.

However, it appears there are other problems with FreeBSD NFS performance, the FreeBSD NFSv4 client always issues an FSYNC for every write block size, regardless if Mmap or write() is used. If you compare the code from the FreeBSD NFSv3 to the NFSv4 clients, you see that NFSv3 has more logic to handle ASYNC or FSYNC writes, and will tend to use ASYNC given the choice. While the NFSv4 client has no such ASYNC code in it from my examination (please correct me if I’m wrong).

I have done some packet snoops of the NFSv4 client, and not a single ASYNC write is ever issued, but mounting the same share via NFSv3 will issue ASYNC writes. Other NFS clients such as the OpenSolaris NFSv4 will use ASYNC and FSCYN appropriately, so this seems limited to only FreeBSD.

I’ll have to bring this discussion to the attention of the FreeBSD people, to see if there any any ways to improve the NFSv4 client, but it seems simply adding ASYNC writes will improve it significantly.

A 552MB ISO being copied with NFSv4 and NFSv3, showing the impact of F

552M JanĀ  2 18:22 7.1-RELEASE-i386-disc1.iso

NFSv4 mounted:
# mount_nfs -4T 10.0.0.19:/pdxfilu01/obsmtp /mnt/obsmtp/


$ time cp 7.1-RELEASE-i386-disc1.iso /mnt/obsmtp/1001/


real 9m39.517s
user 0m0.000s
sys 0m3.349s

And here we have NFSv3 mounted:

# mount_nfs -3T 10.0.0.19:/pdxfilu01/obsmtp /mnt/obsmtp/

$ time cp 7.1-RELEASE-i386-disc1.iso /mnt/obsmtp/1001/
real 0m12.682s
user 0m0.000s
sys 0m1.265s

OpenSolaris 2008.11 NFSv4 client:
# mount -F nfs -o vers=4 10.0.0.19:/pdxfilu01/obsmtp /mnt/obsmtp


$ time cp 7.1-RELEASE-i386-disc1.iso.1 /mnt/obsmtp/101/


real 0m10.997s
user 0m0.008s
sys 0m1.772s

Pretty clear example of how much ASYNC writes improve NFS performance!

One thought on “FreeBSD NFS performance and OpenSolaris

  1. Maybe someone should work to improve FreeBSD’s NFS. I have noticed that it is pokey.

    NFS data should not be written without commit unless the client is willing to roll-back to the uncommitted bit when the server crashes.

    Bob

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>