Aix Io Tuning
Aix Io Tuning
Agenda
The importance of IO tuning The AIX IO stack Data layout Tools to characterize IO Testing IO thruput Tuning IO buffers VMM tuning Mount options Asynchronous IO tuning Queue depth tuning
Actuator and rotational speed increasing relatively slowly Network bandwidth - doubling every 6 months Approximate CPU cycle time Approximate memory access time Approximate disk access time 0.0000000005 seconds 0.000000270 seconds 0.010000000 seconds
Memory access takes 540 CPU cycles Disk access takes 20 million CPU cycles, or 37,037 memory accesses
System bottlenecks are being pushed to the disk Disk subsystems are using cache to improve IO service times
Performance metrics
Disk metrics
MB/s IOPS
System metrics
CPU, memory and IO Size for your peak workloads Size based on maximum sustainable thruputs
Bandwidth and thruput sometimes mean the same thing, sometimes not
For tuning - it's good to have a short running job that's representative of your workload
2009 IBM Corporation
Disk performance
When do you have a disk bottleneck? Random workloads Reads average > 15 ms With write cache, writes average > 2.5 ms Sequential workloads Two sequential IO streams on one disk You need more thruput
IOPS vs IO service time - 15,000 RPM disk
IO service time (ms)
500 400 300 200 100 0 25 50 75 100 125 150 175 200 225 250 275 300 325
IOPS
2009 IBM Corporation
Multi-path IO driver (optional) Disk Device Drivers Queues exist for both adapters and disks Adapter Device Drivers Adapter device drivers use DMA for IO Disk subsystems have read and write cache Disk subsystem (optional) Disks have memory to store commands/data Disk Read cache or memory area used for IO Write cache
IOs can be coalesced (good) or split up (bad) as they go thru the IO stack IOs adjacent in a file/LV/disk can be coalesced IOs greater than the maximum IO size supported will be split up
2009 IBM Corporation
Raw LVs
JFS JFS2
NFS
Other
NFS caches file attributes NFS has a cached filesystem for NFS clients
JFS and JFS2 cache use extra system RAM JFS uses persistent pages for cache JFS2 uses client pages for cache
Data layout
Data layout affects IO performance more than any tunable IO parameter Good data layout avoids dealing with disk hot spots An ongoing management issue and cost Data layout must be planned in advance Changes are often painful iostat and filemon can show unbalanced IO Best practice: evenly balance IOs across all physical disks
Random IO best practice: Spread IOs evenly across all physical disks For disk subsystems Create RAID arrays of equal size and RAID level Create VGs with one LUN from every array Spread all LVs across all PVs in the VG
2009 IBM Corporation
1 2
3
datavg
# mklv lv1 e x hdisk1 hdisk2 hdisk5 # mklv lv2 e x hdisk3 hdisk1 . hdisk4 .. Use a random order for the hdisks for each LV
4 5
RAID array LUN or logical disk PV
Data layout
Sequential IO (with no random IOs) best practice:
Create RAID arrays with data stripes a power of 2 RAID 5 arrays of 5 or 9 disks RAID 10 arrays of 2, 4, 8, or 16 disks Create VGs with one LUN per array Create LVs that are spread across all PVs in the VG using a PP or LV strip size >= a full stripe on the RAID array Do application IOs equal to, or a multiple of, a full stripe on the RAID array N disk RAID 5 arrays can handle no more than N-1 sequential IO streams before the IO becomes randomized N disk RAID 10 arrays can do N sequential read IO streams and N/2 sequential write IO streams before the IO becomes randomized
Data layout
Random and sequential mix best practice:
Use the random IO best practices If the sequential rate isnt high treat it as random Determine sequential IO rates to a LV with lvmstat (covered later)
Data layout
Best practice for VGs and LVs Use Big or Scalable VGs Both support no LVCB header on LVs (only important for raw LVs) These can lead to issues with IOs split across physical disks Big VGs require using mklv T O option to eliminate LVCB Scalable VGs have no LVCB Only Scalable VGs support mirror pools (AIX 6100-02)
For JFS2, use inline logs For JFS, one log per file system provides the best performance If using LVM mirroring, use active MWC Passive MWC creates more IOs than active MWC Use RAID in preference to LVM mirroring Reduces IOs as theres no additional writes for MWC
Use PP striping in preference to LV striping
2009 IBM Corporation
LVM limits
Max hdisks in VG Max LVs in VG Max PPs per VG Max LPs per LV Standard VG 32 256 32,512 32,512 Big VG Scalable VG (-B) AIX 5.3 128 1024 512 4096 130,048 2,097,152 32,512 32,512
Max PPs per VG and max LPs per LV restrict your PP size Use a PP size that allows for growth of the VG Valid LV strip sizes range from 4 KB to 128 MB in powers of 2 for striped LVs The smit panels may not show all LV strip options depending on your ML
Application IO characteristics
Random IO Typically small (4-32 KB) Measure and size with IOPS Usually disk actuator limited Sequential IO Typically large (32KB and up) Measure and size with MB/s Usually limited on the interconnect to the disk actuators To determine application IO characteristics Use filemon
# filemon o /tmp/filemon.out O lv,pv T 500000; sleep 90; trcstop or at AIX 6.1 # filemon o /tmp/filemon.out O lv,pv,detailed T 500000; sleep 90; trcstop
Check for trace buffer wraparounds which may invalidate the data, run filemon with a larger T value or shorter sleep Use lvmstat to get LV IO statistics Use iostat to get PV IO statistics
2009 IBM Corporation
Using filemon
Look at PV summary report Look for balanced IO across the disks Lack of balance may be a data layout problem Depends upon PV to physical disk mapping LVM mirroring scheduling policy also affects balance for reads IO service times in the detailed report is more definitive on data layout issues Dissimilar IO service times across PVs indicates IOs are not balanced across physical disks
Look at most active LVs report Look for busy file system logs Look for file system logs serving more than one file system
Using filemon
Look for increased IO service times between the LV and PV layers Inadequate file system buffers Inadequate disk buffers Inadequate disk queue_depth Inadequate adapter queue depth can lead to poor PV IO service time i-node locking: decrease file sizes or use cio mount option if possible Excessive IO splitting down the stack (increase LV strip sizes) Disabled interrupts Page replacement daemon (lrud): set lru_poll_interval to 10 syncd: reduce file system cache for AIX 5.2 and earlier Tool available for this purpose: script on AIX and spreadsheet
Average LV read IO Time 17.19
2.97
2.29 1.24 14.90 1.73
2009 IBM Corporation
Using iostat
Use a meaningful interval, 30 seconds to 15 minutes
The first report is since system boot (if sys0s attribute iostat=true) Examine IO balance among hdisks
Look for bursty IO (based on syncd interval)
-p for tape statistics (AIX 5.3 TL7 or later) -f for file system statistics (AIX 6.1 TL1)
Using iostat
# iostat <interval> <count> For individual disk and system statistics tty: tin tout avg-cpu: % user % sys % idle % iowait 24.7 71.3 8.3 2.4 85.6 3.6 Disks: % tm_act Kbps tps Kb_read Kb_wrtn hdisk0 2.2 19.4 2.6 268 894 hdisk1 5.0 231.8 28.1 1944 11964 hdisk2 5.4 227.8 26.9 2144 11524 hdisk3 4.0 215.9 24.8 2040 10916 ... # iostat ts <interval> <count> For total system statistics System configuration: lcpu=4 drives=2 ent=1.50 paths=2 vdisks=2 tty: tin tout avg-cpu: % user % sys % idle % iowait physc % entc 0.0 8062.0 0.0 0.4 99.6 0.0 0.0 0.7 Kbps tps Kb_read Kb_wrtn 82.7 20.7 248 0 0.0 13086.5 0.0 0.4 99.5 0.0 0.0 0.7 Kbps tps Kb_read Kb_wrtn 80.7 20.2 244 0 0.0 16526.0 0.0 0.5 99.5 0.0 0.0 0.8
What is %iowait?
Percent of time the CPU is idle and waiting on an IO so it can do some more work
Low %iowait does not necessarily mean you don't have a disk bottleneck The CPUs can be busy while IOs are taking unreasonably long times
Conclusion: %iowait is a misleading indicator of disk performance
Using lvmstat
Provides IO statistics for LVs, VGs and PPs You must enable data collection first for an LV or VG:
# lvmstat e v <vgname> Useful to find busy LVs and PPs
root/ # lvmstat -sv rootvg 3 10 Logical Volume iocnt Kb_read hd8 212 0 hd4 11 0 hd2 3 12 hd9var 2 0 .. hd8 3 0 . hd8 12 0 hd4 1 0 # lvmstat -l lv00 1 Log_part mirror# iocnt 1 1 65536 32768 2 1 53718 26859 Log_part mirror# iocnt 2 1 5420 2710 Log_part mirror# iocnt 2 1 5419 2709 Log_part mirror# iocnt 3 1 4449 2224 2 1 979 489
Kb_wrtn 848 44 0 8 Kbps 24.00 0.23 0.01 0.01
12
48 4
8.00
32.00 2.67
Kb_wrtn 0.02 0.01 Kb_wrtn 14263.16 Kb_wrtn 15052.78 Kb_wrtn 13903.12 3059.38
Kbps
Testing thruput
Sequential IO Test sequential read thruput from a device:
# timex dd if=<device> of=/dev/null bs=1m count=100 Test sequential write thruput to a device:
# timex dd if=/dev/zero of=<device> bs=1m count=100 Note that /dev/zero writes the null character, so writing this character to files in a file system will result in sparse files
For file systems, either create a file, or use the lptest command to generate a file, e.g., # lptest 127 32 > 4kfile Test multiple sequential IO streams use a script and monitor thruput with topas:
dd if=<device1> of=/dev/null bs=1m count=100 & dd if=<device2> of=/dev/null bs=1m count=100 &
Testing thruput
Random IO
# ndisk -R -f ./tempfile_10MB -r 50 -t 60 Command: ndisk -R -f ./tempfile_10MB -r 50 -t 60 Synchronous Disk test (regular read/write) No. of processes = 1 I/O type = Random Block size = 4096 Read-Write = Equal read and write Sync type: none = just close the file Number of files = 1 File size = 33554432 bytes = 32768 KB = 32 MB Run time = 60 seconds Snooze % = 0 percent ----> Running test with block Size=4096 (4KB) . Proc - <-----Disk IO----> | <-----Throughput------> RunTime Num TOTAL IO/sec | MB/sec KB/sec Seconds 1 331550 5517.4 | 21.55 22069.64 60.09
2009 IBM Corporation
Unmount and remount file systems For disk subsystems, use #cat <unused file(s)> > /dev/null
The unused file(s) must be larger than the disk subsystem read cache It's recommended to prime the cache, as most applications will be using it and you've paid for it, so you should use it
Write cache
If we fill up the write cache, IO service times will be at disk speed, not cache speed
Tuning IO buffers
1 Run vmstat v to see counts of blocked IOs
# vmstat -v | tail -7 <-- only last 7 lines needed 0 pending disk I/Os blocked with no pbuf 0 paging space I/Os blocked with no psbuf 8755 filesystem I/Os blocked with no fsbuf 0 client filesystem I/Os blocked with no fsbuf 2365 external pager filesystem I/Os blocked with no fsbuf
2 Run your workload 3 Run vmstat -v again and look for larger numbers 4 Increase the resources For pbufs, increase pv_min_pbuf with ioo or see the next slide For psbufs, stop paging or add paging spaces For fsbufs, increase numfsbufs with ioo For external pager fsbufs, increase j2_nBufferPerPagerDevice (not available in 6.1) and/or j2_dynamicBufferPreallocation with ioo For client filesystem fsbufs, increase nfso's nfs_v3_pdts and nfs_v3_vm_bufs (or the NFS4 equivalents) 5 Unmount and remount your filesystems, and repeat
2009 IBM Corporation
To increase a VGs pbufs: # lvmo v <vgname> -o pv_pbuf_count=<new value> pv_min_pbuf is tuned via ioo and takes effect when a VG is varied on Increase value, collect statistics and change again if necessary
Multi-path IO driver (optional) Disk Device Drivers Adapter Device Drivers Disk subsystem (optional) Disk Write cache Read cache or memory area used for IO
Raw LVs
JFS JFS2
NFS
Other
Where,
Max Read Ahead = max( maxpgahead, j2_maxPageReadAhead)
Number of memory pools = # echo mempool \* | kdb and count them
2009 IBM Corporation
Read ahead
Read ahead detects that we're reading sequentially and gets the data before the application requests it
Reduces %iowait Too much read ahead means you do IO that you don't need Operates at the file system layer - sequentially reading files
Set maxpgahead for JFS and j2_maxPgReadAhead for JFS2 Values of 1024 for max page read ahead are not unreasonable Disk subsystems read ahead too - when sequentially reading disks Tunable on DS4000, fixed on ESS, DS6000, DS8000 and SVC
If using LV striping, use strip sizes of 8 or 16 MB Avoids unnecessary disk subsystem read ahead Be aware of application block sizes that always cause read aheads
Write behind
Write behind tuning for sequential writes to a file Tune numclust for JFS Tune j2_nPagesPerWriteBehindCluster for JFS2
Write behind tuning for random writes to a file Tune maxrandwrt for JFS
Tune j2_maxRandomWrite and j2_nRandomCluster for JFS2 Max number of random writes allowed to accumulate to a file before additional IOs are flushed, default is 0 or off
j2_nRandomCluster specifies the number of clusters apart two consecutive writes must be in order to be considered random
If you have bursty IO, consider using write behind to smooth out the IO rate
2009 IBM Corporation
If you have a lot of file activity, you have to update a lot of timestamps
File timestamps File creation (ctime) File last modified time (mtime) File last access time (atime) New mount option noatime disables last access time updates for JFS2 File systems with heavy inode access activity due to file opens can have significant performance improvements First customer benchmark efix reported 35% improvement with DIO noatime mount (20K+ files) Most customers should expect much less for production environments
APARs
IZ11282 IZ13085 AIX 5.3 AIX 6.1
Mount options
Release behind: rbr, rbw and rbrw Says to throw the data out of file system cache rbr is release behind on read rbw is release behind on write rbrw is both Applies to sequential IO only DIO: Direct IO Bypasses file system cache No file system read ahead No lrud or syncd overhead No double buffering of data Half the kernel calls to do the IO Half the memory transfers to get the data to the application Requires the application be written to use DIO CIO: Concurrent IO The same as DIO but with no i-node locking
2009 IBM Corporation
Raw LVs
JFS JFS2
NFS
Other
i-node locking: when 2 or more threads access the same file, and one is a write, the write will block all read threads at this level
VMM LVM (LVM device drivers) Multi-path IO driver Disk Device Drivers Adapter Device Drivers Disk subsystem (optional) Disk Write cache Read cache or memory area used for IO
Mount options
Direct IO IOs must be aligned on file system block boundaries IOs that dont adhere to this will dramatically reduce performance Avoid large file enabled JFS file systems - block size is 128 KB after 4 MB
Mount options
Concurrent IO for JFS2 only at AIX 5.2 ML1 or later # mount -o cio
# chfs -a options=rw,cio <file system> Assumes that the application ensures data integrity for multiple simultaneous IOs to a file Changes to meta-data are still serialized I-node locking: When two threads (one of which is a write) to do IO to the same file are at the file system layer of the IO stack, reads will be blocked while a write proceeds Provides raw LV performance with file system benefits
j2_syncPageLimit
Overrides j2_syncPageCount when a threshold is reached. This is to guarantee that sync will eventually complete for a given file. Not applied if j2_syncPageCount is off. Default: 16, Range: 1-65536, Type: Dynamic, Unit: Numeric
If application response times impacted by syncd, try j2_syncPageCount settings from 256 to 1024. Smaller values improve short term response times, but still result in larger syncs that impact response times over larger intervals. These will likely require a lot of experimentation, and detailed analysis of IO behavior. Does not apply to mmap() memory mapped files. May not apply to shmat() files (TBD)
IO Pacing
IO pacing - causes the CPU to do something else after doing a specified amount of IO to a file Turning it off (the default) improves backup times and thruput Turning it on ensures that no process hogs CPU for IO, and ensures good keyboard response on systems with heavy IO workload
With N CPUs and N or more sequential IO streams, keyboard response can be sluggish
# chgsys -l sys0 -a minpout=256 maxpout=513 Normally used to avoid HACMP's dead man switch Old values of 33 and 24 significantly inhibit thruput but are reasonable for uniprocessors with non-cached disk AIX 5.3 introduces IO pacing per file system via the mount command mount -o minpout=256 o maxpout=513 /myfs AIX 6.1 uses minpout=4096 maxpout=8193 These values can also be used for earlier releases of AIX
Default 1 10 4096
Asynchronous IO tuning
New -A iostat option monitors AIO (or -P for POSIX AIO) at AIX 5.3 and 6.1
#
iostat -A 1 1 System configuration: lcpu=4 drives=1 ent=0.50 aio: avgc avfc maxg maxf maxr avg-cpu: %user %sys %idle %iow physc %entc 25 6 29 10 4096 30.7 36.3 15.1 17.9 0.0 81.9
Disks: % tm_act Kbps tps Kb_read Kb_wrtn hdisk0 100.0 61572.0 484.0 8192 53380
avgc - Average global non-fastpath AIO request count per second for the specified interval avfc - Average AIO fastpath request count per second for the specified interval for IOs to raw LVs (doesnt include CIO fast path IOs) maxg - Maximum non-fastpath AIO request count since the last time this value was fetched maxf - Maximum fastpath request count since the last time this value was fetched maxr - Maximum AIO requests allowed - the AIO device maxreqs attribute If maxg gets close to maxr or maxservers then increase maxreqs or maxservers
How this adapter is CONNECTED False Dynamic Tracking of FC Devices True FC Fabric Event Error RECOVERY Policy True Adapter SCSI ID False FC Class for Fabric True
sqfull = number of times the hdisk drivers service queue was full At AIX 6.1 this is changed to a rate, number of IOPS submitted to a full queue
device %busy avque r+w/s Kbs/s avwait avserv hdisk0 100 36.1 363 46153 51.1 8.3 hdisk0 99 38.1 350 44105 58.0 8.5 hdisk0 99 37.1 356 45129 54.6 8.4
VIO
The VIO Server (VIOS) uses multi-path IO code for the attached disk subsystems The VIO client (VIOC) always uses SCSI MPIO if accessing storage thru two VIOSs In this case only entire LUNs are served to the VIOC At AIX 5.3 TL5 and VIO 1.3, hdisk queue depths are user settable attributes (up to 256) Prior to these levels VIOC hdisk queue_depth=3 Set the queue_depth at the VIOC to that at the VIOS for the LUN Set MPIO hdisks hcheck_interval attribute to some nonzero value, e.g. 60 when using multiple paths for at least one hdisk