GC Tuning Strategies for Java Performance
GC Tuning Strategies for Java Performance
Memory Profiling
Jeff Taylor
Sun Microsystems
1
Agenda
• Case for GC and Tuning
• Object Lifecycle
• Generational Collection
• Garbage Collectors
• JVM GC Observability Tools
2
Tools
• Verbose GC • ps
• PrintGCStats • mdb
• jps, jstack, jinfo • pmap
• jconsole • sar
• jstat • JVM Ergonomics
• VisualVM • VisualGC
• GCHisto • hprof
3
Sun JVM Options
• Standard options
> All platforms
• -X options
> Not all platforms
• -XX options
> Not all platforms
> May need additional privileges to use
4
A Multitude of Options
JRE Number of Options Options
Version Options Added Removed
1.4 159
1.4.1 224 70 5
1.4.2 260 44 8
5 343 98 10
6 427 102 35
5
On the Shoulders of Giants
• Material in the presentation is
based on previous work,
including:
• JavaOne Online Technical
Sessions
> TS-4887
> TS-6500
6
Objects Need Storage Space
• Age old problems
> How to allocate space efficiently
> How to reclaim unused space (garbage)
efficiently and reliably
• C (malloc and free)
• C++ (new and delete)
• Java (new and Garbage Collection)
7
Why is GC Necessary?
• Alternative to manual deallocation
• Hard to debug errors in storage deallocation
• Deallocation management may lead to a tight
binding between supposedly independent
modules
• Manual Memory Management Breaks
Encapsulation.
• Message Passing Leads to Dynamic Execution
Paths
• Garbage Collection Works
8
Why is GC Tuning Necessary?
• GC is not “one size fits all”
• Best GC characteristics depend on
requirements of the application and
deployment
> Some applications need best throughput
> Other applications need low pause time
> Variations in deployment size, deployment
hardware, and shared resources
9
GC Tuning is an Art
• Unfortunately, we can't give you a flawless
recipe or a flowchart that will apply to all
your GC tuning scenarios
• GC tuning involves a lot of common
pattern recognition
• This pattern recognition requires
experience
10
Object Realities
• Generational hypothesis
• Most objects are very short lived
> 80-98% of all newly allocated objects
die within a few million instructions
> 80-98% of all newly allocated objects
die before another megabyte has been
allocated
• This impacts heavily on choices for
GC algorithms
11
Why Generational?
• Most Java applications
> Conform to the weak generational hypothesis
> Really benefit from generational GC
> Performancewise, generational GC is hard to
beat in most cases
• All GCs in the HotSpot JVM are
generational
12
Generational Garbage Collectors
• Driven by the weak generational hypothesis
• Split the heap into “generations”
> Usually two: young generation / old generation
• Concentrate collection effort on the young generation
> Good payoff (a lot of space reclaimed) for your collection
effort
> Lower GC overhead
> Most pauses are short
• Reduced allocation rate into the old generation
> Young generation acts as a “filter”
13
GC Vocabulary Release Tenured (O ld)
Object Lifecycle
Increm ental
Full
GC
GC
Survivor 1
Tenuring
Release
Release
Survivor 2
Increm ental
new () GC
Eden (Young)
14
Metrics for Collection
• Heap population (aka Live set) • Cycle time
> How much of your heap is alive > How long it takes the collector to
free up memory
• Allocation rate
> How fast you allocate
• Marking time
> How long it takes the collector to
• Mutation rate find all live objects
> How fast your program updates • Sweep time
references in memory
> How long it takes to locate dead
• Heap Shape objects
> The shape of the live object graph > * Relevant for MarkSweep
> * Hard to quantify as a metric...
• Compaction time
• Object Lifetime > How long it takes to free up
> How long objects live memory by relocating objects
> * Relevant for MarkCompact
15
Jconsole
16
Jconsole
17
Jconsole
18
Jconsole
19
Jconsole
20
Which JVM options should be used
for large scale applications?
• Answer: It depends
> Hardware
> Application
> Usage patterns
• One of the fundamental questions that
needs to be answered by every
administrator is: “Is memory being used
efficiently?”
• Solaris has a significant advantage for 32
bit Java
21
Heap Sizing Trade-Offs
• Generally, the larger the heap space, the better
> For both young and old generation
> Larger space: less frequent GCs, lower GC
overhead, objects more likely to become garbage
> Smaller space: faster GCs (not always! see later)
• Sometimes max heap size is dictated by available
memory and/or max space the JVM can address
> You have to find a good balance between young
and old generation size
22
Sizing Heap Spaces
• Xmx<size> : max heap size
> young generation + old generation
• Xms<size> : initial heap size
> young generation + old generation
• Xmn<size> : young generation size
• Applications with emphasis on
performance tend to set Xms and Xmx to
the same value
• When Xms != Xmx, heap growth or
shrinking requires a Full GC
23
Should -Xms == -Xmx?
• Set Xms to what you think would be your
desired heap size
> It's expensive to grow the heap
• If memory allows, set Xmx to something
larger than Xms “just in case”
> Maybe the application is hit with more load
> Maybe the DB gets larger over time
• In most occasions, it's better to do a Full
GC and grow the heap than to get an OOM
and crash
24
Sizing Heap Spaces (ii)
• XX:PermSize=<size> : permanent generation initial
size
• XX:MaxPermSize=<size> : permanent generation
max size
• Applications with emphasis on performance almost
always set XX:PermSize and XX:MaxPermSize to the
same value
> Growing or shrinking the permanent generation
requires a Full GC too
• Unfortunately, the permanent generation occupancy
is hard to predict
25
Priority 1: Eden Heap
• When a well written Java program performs poorly,
there are 3 typical causes.
> The Java heap size is two small causing an
excessive amount of garbage collection.
– CPU test (SPARC vs. Intel/AMD vs. CMT)
> Java heap size that is so big that portions are
paged to virtual memory.
– Out of RAM
> GC pauses can be too long with Java 64bit heaps
– Long pauses
• Eden heap sizing is critical
26
Young Generation Sizing
• Eden size determines
> The frequency of minor GCs
> Which objects will be reclaimed at age 0
• Increasing the size of the Eden will not
always affect minor GC times
> Remember: minor GC times are proportional to
the amount of objects they copy (i.e., the live
objects), not the young generation size
27
Sizing Heap Spaces
• XX:NewSize=<size> : initial young
generation size
• XX:MaxNewSize=<size> : max young
generation size
• XX:NewRatio=<ratio> : young
generation to old generation ratio
• Applications with emphasis on
performance tend to use Xmn to size the
young generation since it combines the use
of XX:NewSize and XX:MaxNewSize
28
Priority 2: Other critical parameters
• Survivor Ratio
• Collector Algorithm
> Concurrent Mark Sweep
> ParaNewGC
> ParallelGC
• Egonomics
> Auto tuning in Java 1.5+
> Good single JRE sizing and needs burnin.
> Use the final values.
29
A Reasonable Goal
• Try to keep garbage collection at 5% or
less of the JVM’s CPU time.
• If you can’t accomplish this by adjusting
the JVM parameters, consider purchasing
additional RAM.
• The information presented in this article is
intended to help you understand how to
measure the current status to achieve this
goal.
30
Technique 1: Minimize Full GC's
• Old generation collections use more resources
> Long pauses
> Uses more CPU cycles
• 1. Increase the size of the Old Generation space
> Full GC's occur when the Old Generation is
nearly full (while respecting the “new
generation guarantee”)
• 2. Minimize the rate at which objects are tenured
> My goal as a tuner is to stop objects from being
unnecessarily tenured.
31
Object Lifecycle Tenured (O ld)
Release
Increm ental
GC
Full
GC
Expensive Full GC's
Survivor 1
Tenuring Increase O ld size
Release M inim ize prom otions
Survivor 2 Release
Increm ental
GC
new ()
Eden (Young)
32
Technique 2: Slow tenuring process
• Why objects are promoted:
• Survive a certain number of young
generation garbage collections.
> Therefore, increasing the time interval between
young generation collections implies that an
object will be older before being tenured.
• Survivor space too small to contain all of
the objects which survive a young
generation collection, in which case the
survivors spill into the old generation.
> Therefore, a large survivor space is a good
thing. 33
Spilling & Pumping Tenured (O ld)
Release
Increm ental
Full
GC
GC
Slow tenuring process
Release
(“pum ping”) Survivor 1
Tenuring
Release Survivor 2
Increm ental
new () GC
If no Space in Survivor,
Eden (Young) objects “spill” to old
34
Technique 3: Keep objects in Young
• The rate at which new Java objects are
created can not be controlled by the
administrator
> Function of the application, usage pattern and
the algorithms implemented by those
programmers
• Interval between young generation
collections is simply the size of Eden
divided by the creation rate.
> Increasing the size of Eden makes the time
interval between new generation collections
longer. 35
Keep objects in Young Tenured (O ld)
Increm ental
Release
Full
GC
GC
Survivor 1 Release
Tenuring
Release Survivor 2
Increm ental
new () GC
Interval between “incremental GC's”
collections is simply the size of
Eden divided by the creation rate.
Eden (Young) Double the size of Eden. Half the rate.
36
VisualGC – Old is growing, spilling
Survivor Full
O ld Grow th
No Tenuring
one and Everything Participating on the
Netw ork
37
VisualGC – Well behaved GC
Gradual slope
Survivor not full
O ld is stable
10 Tenuring
one and Everything Participating on the
generations in use
Netw ork
38
VisualGC – Xms default (too small)
Eden is very sm all
Greyed
grid is
unmapped
RAM
39
Tenuring
• XX:SurvivorRatio=3
• XX:TargetSurvivorRatio=<percent>, e.g., 50
> How much of the survivor space should be filled
– Typically leave extra space to deal with “spikes”
• XX:InitialTenuringThreshold=<threshold>
• XX:MaxTenuringThreshold=<threshold>
• XX:+AlwaysTenure
> Never keep any objects in the survivor spaces
• XX:+NeverTenure
> Very bad idea!
40
Tenuring Threshold Trade-Offs
• Try to retain as many objects as possible in
the survivor spaces so that they can be
reclaimed in the young generation
> Less promotion into the old generation
> Less frequent old GCs
• But also, try not to unnecessarily copy very
longlived objects between the survivors
> Unnecessary overhead on minor GCs
• Not always easy to find the perfect balance
> Generally: better copy more, than promote
more 41
Tenuring Distribution
• Monitor tenuring distribution with
XX:+PrintTenuringDistribution
Desired survivor size 6684672 bytes, new threshold 8 (max 8)
age 1: 2315488 bytes, 2315488 total
age 2: 19528 bytes, 2335016 total
age 3: 96 bytes, 2335112 total
age 4: 32 bytes, 2335144 total
• Young generation seems well tuned here
> We can even decrease the survivor space size
42
Tenuring Distribution (ii)
Desired survivor size 3342336 bytes, new threshold 1 (max 6)
age 1: 3956928 bytes, 3956928 total
• Survivor space too small!
> Increase survivor space and/or eden size
43
Tenuring Distribution (iii)
Desired survivor size 3342336 bytes, new threshold 6 (max 6)
age 1: 2483440 bytes, 2483440 total
age 2: 501240 bytes, 2984680 total
age 3: 50016 bytes, 3034696 total
age 4: 49088 bytes, 3083784 total
age 5: 48616 bytes, 3132400 total
age 6: 50128 bytes, 3182528 total
• Might be able to do better
> Either increase max tenuring threshold
> Or even set max tenuring threshold to 2
– If ages > 6 still have around 50K of surviving bytes
44
VisualVM
45
VisualVM
46
VisualVM
47
VisualVM
48
VisualVM
49
Jstat: Time in GC & Generation Sizes
• VisualGC is pretty, but:
> It is interactive only
– You can't use it for historical analysis
> You can't put the results into a spread sheet
• Instead use jstat:
# pgrep java | xargs n 1 /usr/jdk/jdk1.5.0_06/bin/jstat gc
S0C S1C S0U S1U EC EU OC OU PC PU YGC YGCT FGC FGCT GCT
273024.0 273024.0 0.0 1751.0 1092352.0 882496.2 2048000.0 805586.6 65536.0 31095.8 115 82.588 2 39.730 122.318
273024.0 273024.0 1745.9 0.0 1092352.0 1070228.3 2048000.0 923693.3 65536.0 30294.0 138 97.911 4 90.186 188.097
273024.0 273024.0 0.0 60328.1 1092352.0 892815.4 2048000.0 659197.6 98304.0 62288.3 361 387.152 2 29.291 416.443
50
Jstat – survivor spaces
# jstat gc
S0C S1C S0U S1U
273024.0 273024.0 0.0 1751.0
273024.0 273024.0 1745.9 0.0
273024.0 273024.0 0.0 60328.1
S0C S1C S0U S1U EC EU OC OU PC PU YGC YGCT FGC FGCT GCT
273024.0 273024.0 0.0 1751.0 1092352.0 882496.2 2048000.0 805586.6 65536.0 31095.8 115 82.588 2 39.730 122.318
273024.0 273024.0 1745.9 0.0 1092352.0 1070228.3 2048000.0 923693.3 65536.0 30294.0 138 97.911 4 90.186 188.097
273024.0 273024.0 0.0 60328.1 1092352.0 892815.4 2048000.0 659197.6 98304.0 62288.3 361 387.152 2 29.291 416.443
51
Jstat – Eden and Old
# jstat gc
EC EU OC OU
1092352.0 882496.2 2048000.0 805586.6
1092352.0 1070228.3 2048000.0 923693.3
1092352.0 892815.4 2048000.0 659197.6
S0C S1C S0U S1U EC EU OC OU PC PU YGC YGCT FGC FGCT GCT
273024.0 273024.0 0.0 1751.0 1092352.0 882496.2 2048000.0 805586.6 65536.0 31095.8 115 82.588 2 39.730 122.318
273024.0 273024.0 1745.9 0.0 1092352.0 1070228.3 2048000.0 923693.3 65536.0 30294.0 138 97.911 4 90.186 188.097
273024.0 273024.0 0.0 60328.1 1092352.0 892815.4 2048000.0 659197.6 98304.0 62288.3 361 387.152 2 29.291 416.443
52
Jstat – Garbage Collection Times
# jstat gc
YGC YGCT FGC FGCT GCT
115 82.588 2 39.730 122.318
138 97.911 4 90.186 188.097
361 387.152 2 29.291 416.443
S0C S1C S0U S1U EC EU OC OU PC PU YGC YGCT FGC FGCT GCT
273024.0 273024.0 0.0 1751.0 1092352.0 882496.2 2048000.0 805586.6 65536.0 31095.8 115 82.588 2 39.730 122.318
273024.0 273024.0 1745.9 0.0 1092352.0 1070228.3 2048000.0 923693.3 65536.0 30294.0 138 97.911 4 90.186 188.097
273024.0 273024.0 0.0 60328.1 1092352.0 892815.4 2048000.0 659197.6 98304.0 62288.3 361 387.152 2 29.291 416.443
53
Limitations of jstat
• Jstat is sample based.
> “a finite part of a statistical population whose
properties are studied to gain information
about the whole”
• Details are smoothed over.
• Jstat is not a good tool to answer questions
such as
> “were the user performance complaints that
came in after everyone returned from lunch
due to excessive garbage collections at this
particular time.” Or,
> “how much space is available after the garbage
collection completes?” 54
Monitoring the GC
• Online
> VisualVM:
http://java.sun.com/performance/jvmstat/
> VisualGC: http://visualvm.dev.java.net/
– VisualGC is also available as a VisualVM plugin
– Can monitor multple JVMs within the same tool
• Offline
> GC Logging
> PrintGCStats
> GChisto
55
GC Logging in Production
• Don't be afraid to enable GC logging in
production
> Very helpful when diagnosing production issues
• Extremely low / nonexistent overhead
> Maybe some large files in your file system. :)
> We are surprised that customers are still afraid
to enable it
• Real customer quote:
> “If someone doesn't enable GC logging in
production, I shoot them!”
56
Most Important GC Logging
Parameters
• You need at least:
> XX:+PrintGCTimeStamps
– Add XX:+PrintGCDateStamps if you must
> XX:+PrintGCDetails
– Preferred over verbosegc as it's more detailed
• Also useful:
> Xloggc:<file>
> Separates GC logging output from application
output
57
PrintGCStats
• Summarizes GC logs
• Downloadable script from
> http://java.sun.com/developer/technicalArticle
s/Programming/turbo/PrintGCStats.zip
• Usage
> PrintGCStats v cpus=<num> <gc log file>
– Where <num> is the number of CPUs on the
machine where the GC log was obtained
58
PrintGCStats Parallel GC
what count total mean max stddev
gen0t(s) 193 11.470 0.05943 0.687 0.0633
gen1t(s) 1 7.350 7.34973 7.350 0.0000
GC(s) 194 18.819 0.09701 7.350 0.5272
alloc(MB) 193 11244.609 58.26222 100.875 18.8519
promo(MB) 193 807.236 4.18257 96.426 9.9291
used0(MB) 193 16018.930 82.99964 114.375 17.4899
used1(MB) 1 635.896 635.89648 635.896 0.0000
used(MB) 194 91802.213 473.20728 736.490 87.8376
commit0(MB) 193 17854.188 92.50874 114.500 9.8209
commit1(MB) 193 123520.000 640.00000 640.000 0.0000
commit(MB) 193 141374.188 732.50874 754.500 9.8209
alloc/elapsed_time = 11244.609 MB / 77.237 s = 145.586 MB/s
alloc/tot_cpu_time = 11244.609 MB / 1235.792 s = 9.099 MB/s
alloc/mut_cpu_time = 11244.609 MB / 934.682 s = 12.030 MB/s
promo/elapsed_time = 807.236 MB / 77.237 s = 10.451 MB/s
promo/gc0_time = 807.236 MB / 11.470 s = 70.380 MB/s
gc_seq_load = 301.110 s / 1235.792 s = 24.366%
gc_conc_load = 0.000 s / 1235.792 s = 0.000%
gc_tot_load = 301.110 s / 1235.792 s = 24.366%
59
PrintGCStats CMS
60
GChisto
• Graphical GC log visualizer
• Under development
> Currently, can only show pause times
• Open source at
> http://gchisto.dev.java.net/
61
GChisto
• Graphical GC log visualizer
• Under development
> Currently, can only show pause times
• Open source at
> http://gchisto.dev.java.net/
62
GCHisto
63
How big can you make the heap?
• 1GB is not the limit of the JAVA HEAP!!!!!!!!!!!!!!
• 3 potential limits to the size of the Java Heap:
> Physical memory
– Paging
– Add RAM or reduce usage
> Virtual memory
– Will not start
– malloc() failure.
– Add swap.
> Process address space.
– Cores
– Unstable JRE exits.
– Debugging pmap or mdb
64
Consumption Address Space
●
3 consumers of address space.
●
Java Heap fixed Address range.
●
Native Heap
●
> malloc()
●
> sockets windows, gzip, javac
and JNI code.
●
Thread Count and Size (Xss)
●
Ensure 200MB safety zone above Native
HEAP.
●
Consider 64 bit Java
65
Threads & Address Space
●
Figure out why threads are created!
●
Is is one per session?
●
Is it a fixed server thread count?
●
Estimate maximum thread count.
●
Win32/Linux 32 will have a smaller thread limit.
●
Solaris 32 == 4GB
●
Win32/Linux32 == 2GB (even on a 64bit OS)
●
Java 32 on Win64/Linux64 has 2GB limit
●
Consider 64 bit Java
66
SPECjbb2005: Out of Box Performance
1
Normalized to IBM SDK 5.0 32-bit Linux
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
67
Pmap & Address Space
pmap x `pgrep n java`
java server Xms2600m Xmx2600m Xss1024k XX:NewSize=1400
Address Kbytes RSS Anon Locked Mode Mapped File
..........
0x00400000 241664 241664 241664 rwx [ heap ]
0x38000000 64608 9352 rx libociei.so
...........
What is the address space for the Heap
0x38000000 – 0x00400000 = 0x37C00000 = 892 MB
What heap is in use 241664 KB = 236 MB.
Space for BRK() to map == 892 – 236 = 656 MB before SEGV.
Do a pmap on core if you have one.
Expert preload libumem.so
Use mdb core ; ::umastat
We where able to pin down a memory leak with umem debugging.
68
Using multiple JVM's
• The services of multiple 32bit JVM's can
be used by most applications
> WebLogic
> Sun Java Web Server
> Oracle Application Server
– (Hint: use Solaris Containers)
> Websphere
– (Hint: use Solaris Containers)
69
Java Heap & Paging
• The heap sizes are too big if memory pressure is
causing excessive virtual memory activity.
• Indicator: scan rate (“sr”) from “sar –g” or vmstat p.
• The scan rate should be at or close to zero.
• If the scanner kicks in for a short time but returns to
zero, virtual memory pressure is not having a
significant impact on your performance.
• If the system is always scanning, you need to kill
noncritical processes, reduce the size of your Java
heap, or add more RAM to the system.
70
Working Set, RSS and VSZ
• If you want to increase the size of your heaps, you
will need to determine how much RAM is available.
• It is important to differentiate between a process’s
working set, resident set size and virtual size.
> The working set, the set of memory addresses that
a program will need to use in the near future.
> RSS, resident set size of a process is the size of a
process’s address space that is currently in RAM.
> VSZ, The virtual size is the size of the process’s
memory: pages that are currently in RAM, pages
that the operating system has paged out, and
addresses that have been allocated but not yet
been mapped.
71
Resident Set Size and Virtual Size
• View the resident set size and virtual size:
> ps e orss,vsz,args | grep java | sort n
• Add the sizes of all of your processes’ RSS:
> ps e orss,vsz,args | awk '{printf( "%d+",
$1)}END{print 0}' | bc
> 12354976
– Will always be less than size of RAM.
• Add the sizes of all of your processes’ VSZ
> ps e orss,vsz,args | awk '{printf( "%d+",
$2)}END{print 0}'| bc
> 19383912
– You must have this amount of Swap.
72
Finding memory allocation with mdb
m db k
> ::m em stat
Page Summary Pages MB %Tot
------------ ---------------- ---------------- ----
Kernel 89123 696 4%
Anon 589676 4606 28%
Exec and libs 3973 31 0%
Page cache 196119 1532 9%
Free (cachelist) 141024 1101 7%
Free (freelist) 1067444 8339 51%
73
Heap Conclusion
• Administrator needs to observe and minimize both
Java garbage collection and virtual memory pressure.
> Java garbage collection should take no more than
5% of the JVM’s CPU cycles.
• The virtual memory scan rate should remain at zero
most of the time.
> If the administrator can not accomplish both goals
on a given server, RAM needs to be added to the
server.
> Use a 32bit JVM if possible
• Solaris allows you to go further with a 32bit JVM
74
Garbage collection &
Memory Profiling
Jeff Taylor
[email protected]
75
Agenda
• Case for GC and Tuning
• Object Lifecycle
• Generational Collection
• Garbage Collectors
• JVM GC Observability Tools
76
Your Dream GC
• You would really like a GC that has
> Low GC overhead
> Low GC pause times, and
> Good space efficiency
• Unfortunately, you'll have to pick two (any
two!)
77
New Generation Collectors
• "Serial" GC is a stoptheworld, young
generation, copying collector which uses a
single GC thread.
• Parallel Scavenge" is a stoptheworld,
young generation, copying collector which
uses multiple GC threads.
• "ParNew" is a stoptheworld, young
generation, copying collector which uses
multiple GC threads. "ParNew" does the
synchronization needed so that it can run
during the concurrent phases of CMS.
78
Old Generation Collectors
• “Serial Old" is a stoptheworld, old
generation, marksweepcompact collector
that uses a single GC thread.
• "CMS" is a mostly concurrent, old
generation, lowpause collector.
• "Parallel Old" is an old generation,
compacting collector that uses multiple GC
threads.
79
GC Design Choices
• Serial v. Parallel
• Concurrent v. Stoptheworld
• Compacting v. noncompacting v. copying
80
Serial Collector
• Both young and old generation collections
performed serially
• Marksweepcompact algorithm for old and
permanent generations
• Suited to most desktop applications
• Standard option on nonserver class
machines
> XX:+UseSerialGC
81
Mark Sweep Compact: Old Generation
x
x Before
x
After
82
Parallel Collector
• Also known as throughput collector
• Most machines today have
> multiple cores/CPUs
> Large amounts of memory
• Compare to machines when Java launched
> Default heap size is 64Mb
83
Parallel Collector: Young Generation
• Parallel copy collector
> Still stoptheworld
• Allocates as many threads as CPUs
> Algorithm optimized to minimize contention
• Maximize work throughput
> Work stealing
• Potential locality of reference issue
> Each thread has separate destination in
tenured space
84
Parallel Copy Collector
• XX:+UseParNewGC
> Default copy collector will be used on single
CPU machines
> Only required on preJava SE 5 VMs
• XX:ParallelGCThreads=n
> Default is number of CPUs
> Reduce for multiple application machines
> Can be used to force the parallel copy collector
to be used on single a CPU machine
85
Parallel Scavenge: Old Generation
• Marksweepcompact
• Order of objects is maintained
> No locality of reference issues
• Requires multiple passes
> Mark live data
> Compute new location and move data
> Update all pointers
• Default collector on server class machines
> Java SE 5 onwards
• XX:+UseParallelGC
86
Parallel Compacting Collector
• Introduced in Java SE 5 update 6
> XX:+UseParallelOldGC
• Same parallel copy collector for young generation
• Three phase, sliding compaction algorithm
> marking, summary, compaction
• Not all phases are currently parallel
• Can be good for UltraSPARC T1
> CMS is single threaded
> CMS cannot keep up with mutator threads
> Use parallel old, assuming pause times are
acceptable
87
STW Parallel GC Threads
• The number of parallel GC threads is
controlled by -XX:ParallelGCThreads=<num>
• Default value assumes only one JVM per
system
• Set the parallel GC thread number according
to:
> Number of JVMs deployed on the system /
processor set / zone
> CPU chip architecture
– Multiple hardware threads per chip core, i.e.,
UltraSPARC T1 / T2
88
Parallel GC Tuning Advice
• Tune the young generation as described so far
• Try to avoid / decrease the frequency of major
GCs
• We know of customers who use the Parallel
GC in low-pause environments
> Avoid Full GCs by avoiding / minimizing
promotion
> Maximize heap size
> If the old generation is getting full, redirect load to
another machine while the Full GC is happening
– Mechanism should be there to deal with failures
89
Parallel GC Ergonomics
• The Parallel GC has ergonomics
> i.e., auto-tuning
• Ergonomics help in improving out-of-the-box
GC performance
• To get maximum performance, most customers
we know do manual tuning
90
JVM Ergonomics
• Ergonomics enables the following:
> Throughput garbage collector and Adaptive Sizing
– XX :+UseParallelGC
– XX:+UseAdaptiveSizePolicy
> Initial heap size of 1/64 of physical memory up to
1Gbyte
> Maximum heap size of 1/4 of physical memory up
to 1Gb
> Server runtime compiler (server)
• To enable server ergonomics on 32bit Windows, use
the following flags:
> server Xmx1g XX:+UseParallelGC
> Varying the heap size 91
Using JVM Ergonomics
• Maximum pause time goal
> XX:MaxGCPauseMillis=n
> This is a hint, not a guarantee
> GC will adjust parameters to try and meet goal
> Can adversely effect applicaiton throughput
• Throughput goal
> XX:GCTimeRatio=n
> GC Time : Application time = 1 / (1 + nnn)
> e.g. XX:GCTimeRatio=19 (5% of time in GC)
• Footprint goal
> Only considered if first two goals are met
92
Heap Tuning Beyond Ergonomics
• Increase heap size
> Ergonomics chooses up to 1GB, some
applications need more memory for high
performance
> Xms3g Xmx3g
• Increase the size of the young generation
> Generally: ¼ to ½ the overall heap size
> Sizing above ½ the overall heap size is
supported
> Only makes sense with throughput collector
93
Concurrent Mark Sweep Collector
• Lowpause or lowlatency collector
• Parallel copy collector for young generation
ApplicationThreads
94
Concurrent Mark Sweep Collector
• XX:+UseConcMarkSweepGC
• Concurrent marking phase parallel in JDK
6
> XX:ParallelCMSThreads=n
> Default is ¼ of available CPUs
• Scheduling of collection handled by GC
> Based on statistics in JVM
> Or Occupancy level of tenured generation
> XX:CMSInitiatingOccupancyFraction
95
Incremental CMS
Throughput
reduced to Marking work
50% interleaved with
application work
2 CPU Problem
96
Incremental CMS
• XX:+CMSIncrementalMode (off)
• XX:CMSIncrementalDutyCycle=n% (50)
• XX:CMSIncrementalDutyCycleMin=n% (10)
• XX:+CMSIncrementalPacing (on)
• DutyCycle of 10 and DutyCycleMin of 0
can help certain applications
97
CMS Tuning Advice
• Tune the young generation as described so
far
• Need to be even more careful about
avoiding premature promotion
> Originally we were using an +AlwaysTenure
policy
> We have since changed our mind :)
• Promotion in CMS is expensive (free lists)
• The more often promotion / reclamation
happens, the more likely fragmentation
will settle in 98
CMS Tuning Advice (ii)
• We know customers who tune their
applications to do mostly minor GCs, even
with CMS
> CMS is used as a “safety net”, when
applications load exceeds what they have
provisioned for
> Schedule Full GCs at noncritical times (say,
late at night) to “tidy up” the heap and
minimize fragmentation
99
Fragmentation
• Two types
> External fragmentation
– No free chuck is large enough to satisfy an
allocation
> Internal fragmentation
– Allocator rounds up allocation requests
– Free space wasted due to this rounding up
• Related: dark matter
> Free chunks too small to allocate
100
Fragmentation (ii)
• The bad news: you can never eliminate it!
> It has been proven
• The good news: you can decrease its
likelihood
> Decrease promotion into the CMS old
generation
> Be careful when coding
– Large objects of various sizes are the main cause
• But, when is the heap fragmented anyway?
101
Concurrent CMS GC Threads
• Number of parallel CMS threads is
controlled by
XX:ParallelCMSThreads=<num>
> Available in post 6 JVMs
• TradeOff
> CMS cycle duration vs.
> Concurrent overhead during a CMS cycle
102
Permanent Generation and CMS
• To date, classes will not be unloaded by
default from the permanent generation
when using CMS
> Both XX:+CMSClassUnloadingEnabled and
XX:+PermGenSweepingEnabled need to be
set to enable class unloading in CMS
> The 2nd switch is not needed in post 6u4 JVMs
103
Setting CMS Initiating Threshold
• Again, a tricky tradeoff!
• Starting a CMS cycle too early
> Frequent CMS cycles
> High concurrent overhead
• Starting a CMS cycle too late
> Chance of an evacuation failure / Full GC
• Initiating heap occupancy should be
(much) higher than the application steady
state live size
• Otherwise, CMS will constantly do CMS
cycles 104
Common CMS Scenarios
• Applications that promote nontrivial
amounts of objects to the old generation
> Old generation grows at a nontrivial rate
> Very frequent CMS cycles
> CMS cycles need to start relatively early
• Applications that promote very few or even
no objects to the old generation
> Old generation grows very slowly, if at all
> Very infrequent CMS cycles
> CMS cycles can start quite late
105
Initiating CMS Cycles
• CMS will try to automatically find the best
initiating occupancy
> It first does a CMS cycle early to collect stats
> Then, it tries to start cycles as late as possible,
but early enough not to run out of heap before
the cycle completes
> It keeps collecting stats and adjusting when to
start cycles
> Sometimes, the second cycle starts too late
106
Initiating CMS Cycles (ii)
•
XX:CMSInitiatingOccupancyFraction=<per
cent>
> Occupancy percentage of CMS old generation
that triggers a CMS cycle
• XX:+UseCMSInitiatingOccupancyOnly
> Don't use the ergonomic initiating occupancy
107
Initiating CMS Cycles (iii)
• XX:CMSInitiatingPermOccupancyFraction=<percent>
> Occupancy percentage of permanent
generation that triggers a CMS cycle
> Class unloading must be enabled
108
CMS Cycle Initiation Example
109
CMS Cycle Initiation Example (ii)
110
CMS Cycle Initiation Example (iii)
• This is better:
[ParNew 640710K->546360K(773376K), 0.1839508 secs]
[CMS-initial-mark 548460K(773376K), 0.0883685 secs]
[ParNew 651320K->556690K(773376K), 0.2052309 secs]
[CMS-concurrent-mark: 0.832/1.038 secs]
[CMS-concurrent-preclean: 0.146/0.151 secs]
[CMS-concurrent-abortable-preclean: 0.181/0.181 secs]
[CMS-remark 623877K(773376K), 0.0328863 secs]
[ParNew 655656K->561336K(773376K), 0.2088224 secs]
[ParNew 648882K->554390K(773376K), 0.2053158 secs]
...
[ParNew 489586K->395012K(773376K), 0.2050494 secs]
[ParNew 463096K->368901K(773376K), 0.2137257 secs]
[CMS-concurrent-sweep: 4.873/6.745 secs]
[CMS-concurrent-reset: 0.010/0.010 secs]
[ParNew 445124K->350518K(773376K), 0.1800791 secs]
[ParNew 455478K->361141K(773376K), 0.1849950 secs]
111
Start CMS Cycles Explicitly
• If relying on explicit GCs and want them to
be concurrent, use:
> XX:+ExplicitGCInvokesConcurrent
– Requires a post 6 JVM
> XX:+ExplicitGCInvokesConcurrentAndUnloadClasses
– Requires a post 6u4 JVM
• Useful when wanting to cause references /
finalizers to be processed
112
Consider Serial GC Too
• For small heaps, pause time requirements
may be achieved using the Serial GC
> Consider the Serial GC for heaps up to 128MB
to 256MB
> The Serial GC is easier to tune than CMS
> What is learned from tuning the Serial GC is
useful for the initial CMS tuning
113
Collector Summary
• Many factors affect Java performance
> Application code
> App server settings (Java EE only)
> JVM settings
• Understanding the JVM is essential to
improving performance
• You MUST profile your application!
• If possible, always upgrade to the latest
version of the JVM
114
Garbage collection &
Memory Profiling
Jeff Taylor
[email protected]
115
AGENGA – Misc:
• jps, jinfo, jstack, jmap
• hprof
• Large Page Sizes
• Thread local allocation
• NUMA
• Tiered Compilation
• Sun Java Real-Time System
116
jps
• $ jps
• 7875 JConsole
• 20374 Jps
• 16287
• 7746 Main
• 20218 Main
• $ jps -l
• 7875 sun.tools.jconsole.JConsole
• 16287
• 20510 sun.tools.jps.Jps
• 7746 org.netbeans.Main
• 20218 twoday.Main
117
jmap -histo
• $ jmap -histo 20218 | head -10
•
• num #instances #bytes class name
• ----------------------------------------------
• 1: 25579 522962504 [I
• 2: 7698 929584 <methodKlass>
• 3: 7698 884104 <constMethodKlass>
• 4: 24916 797312
twoday.HeapExampleApp$Big
•
118
jinfo
• $ jinfo -flags 20218
• Attaching to process ID 20218, please wait...
• Debugger attached successfully.
• Server compiler detected.
• JVM version is 14.0-b16
•
• -Xms1g -Xmx1g -Xmn500m -XX:SurvivorRatio=3
-XX:TargetSurvivorRatio=90 -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-Xloggc:/home/user/Desktop/SoftwareAG/HeapExam
pleApp/log/20090907_191929/GC_log.txt -XX:
+PrintTenuringDistribution
•
119
jstack
• "main" prio=10 tid=0x0000000040113000 nid=0x4efb waiting on
condition [0x0000000041dd8000]
• java.lang.Thread.State: TIMED_WAITING (sleeping)
• at java.lang.Thread.sleep(Native Method)
• at twoday.HeapExampleApp.infiniteLoop(HeapExampleApp.java:83)
• at twoday.Main.main(Main.java:20)
•
• "Concurrent Mark-Sweep GC Thread" prio=10
tid=0x000000004016e000 nid=0x4efe runnable
120
jmap -heap
• $ jmap -heap 20218
• Attaching to process ID 20218, please wait...
• Debugger attached successfully.
• Server compiler detected.
• JVM version is 14.0-b16
•
• using parallel threads in the new generation.
• using thread-local object allocation.
• Concurrent Mark-Sweep GC
•
• Heap Configuration:
• MaxHeapSize = 1073741824 (1024.0MB)
121
HPROF
• Command line tool
• Supplied with JDK
• CPU usage
• Heap allocation statistics
• Monitor contention profiles
• Report complete heap dumps
• States of monitors and threads
122
HPROF
• $ java -agentlib:hprof=help
•
• HPROF: Heap and CPU Profiling Agent (JVMTI Demonstration
Code)
•
• hprof usage: java -agentlib:hprof=[help]|
[<option>=<value>, ...]
•
• Option Name and Value Description Default
• --------------------- ----------- -------
• heap=dump|sites|all heap profiling all
• cpu=samples|times|old CPU usage off
• monitor=y|n monitor contention n
123
HPROF Example
• $ java -server
-agentlib:hprof=heap=sites
-Xms1g -Xmx1g ...
• On exit:
> Dumping allocation sites ... done.
> Creates java.hprof.txt
> percent live alloc'ed stack class
> rank self accum bytes objs bytes objs trace name
> 1 99.60% 99.60% 307477984 14996 601710384 29346 301321 int[]
> 2 0.16% 99.76% 479776 14993 939072 29346 301320
twoday.HeapExampleApp$Big
> 3 0.03% 99.78% 80024 1 80024 1 301261 java.lang.Object[]
>
124
125
Large Page Sizes
• XX:+UseLargePages
> Cross platform (only on by default for Solaris)
> Improves utilization of TLB
> Kernel support in Linux 2.6 and Windows 2003
server
> 8m SPARC
> 4m x86
> 2m x86_64
> 256m supported for UltraSPARC T1
• XX:LargePageSizeInBytes=n
> Set to 2m for AMD Opteron systems
126
Multi-processors/cores and Eden Allocation
• Problem: multiple threads creating objects
> All trying to access eden simultaneously
> Multiple CPU machine: contention
• Solution: Thread local allocation
> -XX:+UseTLAB
127
Thread Local Allocation
T2
T1 T2 T3
(Resized)
Eden Space
Allocation Pointers
XX:TLABSize= sizeinbytes
XX:ResizeTLAB
128
NUMA
• NonUniform Memory Access
> Applicable to most SPARC, Opteron, more
recently Intel platforms
• XX:+UseNUMA
• Splits the young generation into partitions
> Each partition “belongs” to a CPU
• Allocates new objects into the partition
that belongs to the allocating CPU
• Big win for some applications
129
Random Things
• Consider disabling explicit GC
> XX:+DisableExplicitGC
• Increase size of permanent generation
> If lots of classes loaded at start
> Can improve startup time
> e.g. NetBeans sets permanent heap size to
20MB
• XX:+AggressiveOpts
> Go fast metaoption (can change between
releases)
•
130
Tiered Compilation
• New in JRE 6
• HotSpot has two compilers, client & server
• JVM starts with client compiler
> Fast warmup
• Switches to server compiler
> Better optimisation
• XX:+TieredCompilation
131
Sun Java Real-Time System
• The Real-Time Specification for Java
(RTSJ) – JSR-001
• RTSJ provides an API set, semantic JVM
enhancements, RTGC, and JVM-to-OS
layer modifications to satisfy real-time
requirements for Java application
development
132
Inside the Java Real-Time System
- Real-Time Garbage Collector
• Based on Roger Henriksson's PhD Thesis,
Lund University, Sweden
• Operating principals:
> GC threads run at lower priority than critical
realtime threads
> Realtime threads are unaffected by GC activity
> Noncritical (nonrealtime) threads pay GC
“tax”
• Pro: Very deterministic, with little to no GCborne
latency
• Con: Nonrealtime threads bear GC burden with
loss of overall throughput
133
Resources
• java.sun.com/performance/reference
• visualvm.dev.java.net
• http://www.j2ee.me/developer/technicalArti
cles/Programming/HPROF.html
134
Why Solaris is a better OS
• What is the biggest Xmx you can start.
• 32bit java on Solaris can use 3.6 GB for
> Java Heap 3.2 GB max
> Threads
> Native Heap
• Windows 32 it address space is limited to
2.0 GB, less mappings
> Java Heap 1.3 GB max
• Linux 32 it address space is limited to 2.0
GB, less mappings
> Java Heap 1.6 GB max 135
Garbage collection &
Memory Profiling
Jeff Taylor