Optimizing NFS Performance
Optimizing NFS Performance
html
Aside from the general network configuration - appropriate network capacity, faster NICs, full duplex settings in order to reduce collisions, agreement in network speed among the
switches and hubs, etc. - one of the most important client optimization settings are the NFS data transfer buffer sizes, specified by the mount command options rsize and wsize.
The theoretical limit for the NFS V2 protocol is 8K. For the V3 protocol, the limit is specific to the server. On the Linux server, the maximum block size is defined by the value of the kernel
constant NFSSVC_MAXBLKSIZE, found in the Linux kernel source file ./include/linux/nfsd/const.h. The current maximum block size for the kernel, as of 2.4.17, is 8K (8192 bytes), but
the patch set implementing NFS over TCP/IP transport in the 2.4 series, as of this writing, uses a value of 32K (defined in the patch as 32*1024) for the maximum block size.
All 2.4 clients currently support up to 32K block transfer sizes, allowing the standard 32K block transfers across NFS mounts from other servers, such as Solaris, without client
modification.
The defaults may be too big or too small, depending on the specific combination of hardware and kernels. On the one hand, some combinations of Linux kernels and network cards
(largely on older machines) cannot handle blocks that large. On the other hand, if they can handle larger blocks, a bigger size might be faster.
You will want to experiment and find an rsize and wsize that works and is as fast as possible. You can test the speed of your options with some simple commands, if your network
environment is not heavily used. Note that your results may vary widely unless you resort to using more complex benchmarks, such as Bonnie, Bonnie++, or IOzone.
The first of these commands transfers 16384 blocks of 16k each from the special file /dev/zero (which if you read it just spits out zeros really fast) to the mounted partition. We will time it
to see how long it takes. So, from the client machine, type:
This creates a 256Mb file of zeroed bytes. In general, you should create a file that's at least twice as large as the system RAM on the server, but make sure you have enough disk space!
Then read back the file into the great black hole on the client machine (/dev/null) by typing the following:
Repeat this a few times and average how long it takes. Be sure to unmount and remount the filesystem each time (both on the client and, if you are zealous, locally on the server as
well), which should clear out any caches.
Then unmount, and mount again with a larger and smaller block size. They should be multiples of 1024, and not larger than the maximum block size allowed by your system. Note that
NFS Version 2 is limited to a maximum of 8K, regardless of the maximum block size defined by NFSSVC_MAXBLKSIZE; Version 3 will support up to 64K, if permitted. The block size
should be a power of two since most of the parameters that would constrain it (such as file system block sizes and network packet size) are also powers of two. However, some users
have reported better successes with block sizes that are not powers of two but are still multiples of the file system block size and the network packet size.
Directly after mounting with a larger size, cd into the mounted file system and do things like ls, explore the filesystem a bit to make sure everything is as it should. If the rsize/wsize is too
large the symptoms are very odd and not 100% obvious. A typical symptom is incomplete file lists when doing ls, and no error messages, or reading files failing mysteriously with no
error messages. After establishing that the given rsize/wsize works you can do the speed tests again. Different server platforms are likely to have different optimal sizes.
Remember to edit /etc/fstab to reflect the rsize/wsize you found to be the most desirable.
If your results seem inconsistent, or doubtful, you may need to analyze your network more extensively while varying the rsize and wsize values. In that case, here are several pointers to
benchmarks that may prove useful:
Bonnie: http://www.textuality.com/bonnie/
Bonnie++: http://www.coker.com.au/bonnie++
IOZone File System Benchmark:http://www.iozone.org
The official NFS benchmark, SPECsfs97: http://www.spec.org/osg/sfs97/
The easiest benchmark with the widest coverage, including an extensive spread of file sizes, and of IO types - reads, writes, rereads, and rewrites, random access, etc. - seems to be
IOzone. A recommended invocation of IOzone (for which you must have root privileges) includes unmounting and remounting the directory under test, in order to clear out the caches
between tests, and including the file close time in the measurements. Assuming you've already exported /tmp to everyone from the server foo, and that you've installed IOzone in the
local directory, this should work:
The benchmark should take 2-3 hours at most, but of course you will need to run it for each value of rsize and wsize that is of interest. The web site gives full documentation of the
parameters, but the specific options used above are:
-a: Full automatic mode, which tests file sizes of 64K to 512M, using record sizes of 4K to 16M
-R: Generate report in excel spreadsheet form (The "surface plot" option for graphs is best)
-c: Include the file close time in the tests, which will pick up the NFS version 3 commit time
-U: Use the given mount point to unmount and remount between tests; it clears out caches
-f: When using unmount, you have to locate the test file in the mounted file system
Try pinging back and forth between the two machines with large packets using the -f and -s options with ping (see ping(8) for more details) and see if a lot of packets get dropped, or if
they take a long time for a reply. If so, you may have a problem with the performance of your network card.
For a more extensive analysis of NFS behavior in particular, use the nfsstat command to look at nfs transactions, client and server statistics, network statistics, and so forth. The -o net
option will show you the number of dropped packets in relation to the total number of transactions. In UDP transactions, the most important statistic is the number of retransmissions, due
to dropped packets, socket buffer overflows, general server congestion, timeouts, etc. This will have a tremendously important effect on NFS performance, and should be carefully
monitored. Note that nfsstat does not yet implement the -z option, which would zero out all counters, so you must look at the current nfsstat counter values prior to running the
1 of 3 10/21/20, 11:42 AM
5. Optimizing NFS Performance http://nfs.sourceforge.net/nfs-howto/ar01s05.html
benchmarks.
To correct network problems, you may wish to reconfigure the packet size that your network card uses. Very often there is a constraint somewhere else in the network (such as a router)
that causes a smaller maximum packet size between two machines than what the network cards on the machines are actually capable of. TCP should autodiscover the appropriate
packet size for a network, but UDP will simply stay at a default value. So determining the appropriate packet size is especially important if you are using NFS over UDP.
You can test for the network packet size using the tracepath command: From the client machine, execute:
$ tracepath server
1: server (x.x.x.x) 0.274ms pmtu 1500
1: x.x.x.x (x.x.x.x) 0.494ms
2: x.x.x.x (x.x.x.x) 0.424ms
3: x.x.x.x (x.x.x.x) 1.042ms
4: server (x.x.x.x) 0.421ms reached
Resume: pmtu 1500 hops 4 back 4
$
and the path MTU should be reported at the bottom. You can then set the MTU on your network card equal to the path MTU, by using the MTU option to ifconfig, and see if fewer
packets get dropped. See the ifconfig man pages for details on how to reset the MTU.
In addition, netstat -s will give the statistics collected for traffic across all supported protocols. You may also look at /proc/net/snmp for information about current network behavior; see
the next section for more details.
Packets may be dropped for many reasons. If your network topography is complex, fragment routes may differ, and may not all arrive at the Server for reassembly. NFS Server capacity
may also be an issue, since the kernel has a limit of how many fragments it can buffer before it starts throwing away packets. With kernels that support the /proc filesystem, you can
monitor the files /proc/sys/net/ipv4/ipfrag_high_thresh and /proc/sys/net/ipv4/ipfrag_low_thresh. Once the number of unprocessed, fragmented packets reaches the number specified by
ipfrag_high_thresh (in bytes), the kernel will simply start throwing away fragmented packets until the number of incomplete packets reaches the number specified by ipfrag_low_thresh.
Another counter to monitor is IP: ReasmFails in the file /proc/net/snmp; this is the number of fragment reassembly failures. if it goes up too quickly during heavy file activity, you may
have a problem.
The disadvantage of using TCP is that it is not a stateless protocol like UDP. If your server crashes in the middle of a packet transmission, the client will hang and any shares will need to
be unmounted and remounted.
The overhead incurred by the TCP protocol will result in somewhat slower performance than UDP under ideal network conditions, but the cost is not severe, and is often not noticable
without careful measurement. If you are using gigabit ethernet from end to end, you might also investigate the usage of jumbo frames, since the high speed network may allow the larger
frame sizes without encountering increased collision rates, particularly if you have set the network to full duplex.
If you are already encountering excessive retransmissions (see the output of the nfsstat command), or want to increase the block transfer size without encountering timeouts and
retransmissions, you may want to adjust these values. The specific adjustment will depend upon your environment, and in most cases, the current defaults are appropriate.
Several published runs of the NFS benchmark SPECsfs97 specify usage of a much higher value for both the read and write value sets, [rw]mem_default and [rw]mem_max. You might
consider increasing these values to at least 256k. The read and write limits are set in the proc file system using (for example) the files /proc/sys/net/core/rmem_default and /proc/sys
/net/core/rmem_max. The rmem_default value can be increased in three steps; the following method is a bit of a hack but should work and should not cause any problems:
2 of 3 10/21/20, 11:42 AM
5. Optimizing NFS Performance http://nfs.sourceforge.net/nfs-howto/ar01s05.html
If network cards auto-negotiate badly with hubs and switches, and ports run at different speeds, or with different duplex configurations, performance will be severely impacted due to
excessive collisions, dropped packets, etc. If you see excessive numbers of dropped packets in the nfsstat output, or poor network performance in general, try playing around with the
network speed and duplex settings. If possible, concentrate on establishing a 100BaseT full duplex subnet; the virtual elimination of collisions in full duplex will remove the most severe
performance inhibitor for NFS over UDP. Be careful when turning off autonegotiation on a card: The hub or switch that the card is attached to will then resort to other mechanisms (such
as parallel detection) to determine the duplex settings, and some cards default to half duplex because it is more likely to be supported by an old hub. The best solution, if the driver
supports it, is to force the card to negotiate 100BaseT full duplex.
In order to conform with "synchronous" behavior, used as the default for most proprietary systems supporting NFS (Solaris, HP-UX, RS/6000, etc.), and now used as the default in the
latest version of exportfs, the Linux Server's file system must be exported with the sync option. Note that specifying synchronous exports will result in no option being seen in the
server's export list:
Now we can see what the exported file system parameters look like:
# /usr/sbin/exportfs -v
/usr/local *(rw)
/tmp *(rw,async)
If your kernel is compiled with the /proc filesystem, then the file /proc/fs/nfs/exports will also show the full list of export options.
When synchronous behavior is specified, the server will not complete (that is, reply to the client) an NFS version 2 protocol request until the local file system has written all data/metadata
to the disk. The server will complete a synchronous NFS version 3 request without this delay, and will return the status of the data in order to inform the client as to what data should be
maintained in its caches, and what data is safe to discard. There are 3 possible status values, defined an enumerated type, nfs3_stable_how, in include/linux/nfs.h. The values, along
with the subsequent actions taken due to these results, are as follows:
NFS_UNSTABLE: Data/Metadata was not committed to stable storage on the server, and must be cached on the client until a subsequent client commit request assures that the
server does send data to stable storage.
NFS_DATA_SYNC: Metadata was not sent to stable storage, and must be cached on the client. A subsequent commit is necessary, as is required above.
NFS_FILE_SYNC: No data/metadata need be cached, and a subsequent commit need not be sent for the range covered by this request.
In addition to the above definition of synchronous behavior, the client may explicitly insist on total synchronous behavior, regardless of the protocol, by opening all files with the O_SYNC
option. In this case, all replies to client requests will wait until the data has hit the server's disk, regardless of the protocol used (meaning that, in NFS version 3, all requests will be
NFS_FILE_SYNC requests, and will require that the Server returns this status). In that case, the performance of NFS Version 2 and NFS Version 3 will be virtually identical.
If, however, the old default async behavior is used, the O_SYNC option has no effect at all in either version of NFS, since the server will reply to the client without waiting for the write to
complete. In that case the performance differences between versions will also disappear.
Finally, note that, for NFS version 3 protocol requests, a subsequent commit request from the NFS client at file close time, or at fsync() time, will force the server to write any previously
unwritten data/metadata to the disk, and the server will not reply to the client until this has been completed, as long as sync behavior is followed. If async is used, the commit is
essentially a no-op, since the server once again lies to the client, telling the client that the data has been sent to stable storage. This again exposes the client and server to data
corruption, since cached data may be discarded on the client due to its belief that the server now has the data maintained in stable storage.
If you have access to RAID arrays, use RAID 1/0 for both write speed and redundancy; RAID 5 gives you good read speeds but lousy write speeds.
A journalling filesystem will drastically reduce your reboot time in the event of a system crash. Currently, ext3 will work correctly with NFS version 3. In addition, Reiserfs version 3.6
will work with NFS version 3 on 2.4.7 or later kernels (patches are available for previous kernels). Earlier versions of Reiserfs did not include room for generation numbers in the
inode, exposing the possibility of undetected data corruption during a server reboot.
Additionally, journalled file systems can be configured to maximize performance by taking advantage of the fact that journal updates are all that is necessary for data protection.
One example is using ext3 with data=journal so that all updates go first to the journal, and later to the main file system. Once the journal has been updated, the NFS server can
safely issue the reply to the clients, and the main file system update can occur at the server's leisure. The journal in a journalling file system may also reside on a separate device
such as a flash memory card so that journal updates normally require no seeking. With only rotational delay imposing a cost, this gives reasonably good synchronous IO
performance. Note that ext3 currently supports journal relocation, and ReiserFS will (officially) support it soon. The Reiserfs tool package found at ftp://ftp.namesys.com
/pub/reiserfsprogs/reiserfsprogs-3.x.0k.tar.gz contains the reiserfstune tool, which will allow journal relocation. It does, however, require a kernel patch which has not yet been
officially released as of January, 2002.
Using an automounter (such as autofs or amd) may prevent hangs if you cross-mount files on your machines (whether on purpose or by oversight) and one of those machines
goes down. See the Automount Mini-HOWTO for details.
Some manufacturers (Network Appliance, Hewlett Packard, and others) provide NFS accelerators in the form of Non-Volatile RAM. NVRAM will boost access speed to stable
storage up to the equivalent of async access.
Prev Next
4. Setting up an NFS Client Home 6. Security and NFS
3 of 3 10/21/20, 11:42 AM