Developer Forum: April 2014

Sunday, April 13, 2014

HotSpot JVM GC and performance tuning

Performance tuning is a very interesting topic which offers many things to learn about the JVM and your application.

Every Java application has its own behaviour and its own requirements thus providing a great scope to learn something new while doing tuning.

Mostly developers are focused on working only on the package they are supposed to code. While coding there are lots of faux pas they commit without realizing how it could bring down the performance in the production. I would like to put some thought on how to analyse the JVM tuning and coding practices which are effective to minimize performance glitches.

Heap is divided into 2 parts:

1. Young Generation

Young generation memory consists of two parts, Eden space and 2 survivor spaces. Most objects are initially allocated in eden. One of the survivor spaces is always empty and serves as the destination of any live objects in eden and the other survivor space during the next copying collection. Objects are copied between survivor spaces in this way until they are old enough to be tenured (copied to the tenured generation). Thus Shortlived objects will be available in Eden space. Every object starts its life from Eden space. When GC happens, if an object is still alive and it will be moved to survivor space and other dereferenced objects will be removed.

2. Old Generation – Tenured and Perm Gen

Old generation memory has two parts, tenured generation and permanent generation (Perm Gen). Perm Gen is a popular term. We used to error like Perm Gen space not sufficient.

GC moves live objects from survivor space to tenured generation. The permanent generation contains meta data of the virtual machine, class and method objects.

Performance criteria:

There are two primary measures of garbage collection performance:

Throughput is the percentage of total time not spent in garbage collection, considered over long periods of time. Throughput includes time spent in allocation (but tuning for speed of allocation is generally not needed).
Pauses are the times when an application appears unresponsive because garbage collection is occurring.
Footprint is the working set of a process, measured in pages and cache lines. On systems with limited physical memory or many processes, footprint may dictate scalability.

In simple terms the goal of tuning is to provide good performance with little or no tuning of command line options by selecting the garbage collector,heap size,and runtime compiler at JVM startup, instead of using fixed defaults.

Selecting a collector:

If the application has a small data set (up to approximately 100MB), then select the serial collector with -XX:+UseSerialGC.

If the application will be run on a single processor and there are no pause time requirements, then let the VM select the collector, or select the serial collector with -XX:+UseSerialGC.

If (a) peak application performance is the first priority and (b) there are no pause time requirements or pauses of one second or longer are acceptable, then let the VM select the collector, or select the parallel collector with -XX:+UseParallelGC and (optionally) enable parallel compaction with -XX:+UseParallelOldGC.

If response time is more important than overall throughput and garbage collection pauses must be kept shorter than approximately one second, then select the concurrent collector with -XX:+UseConcMarkSweepGC. If only one or two processors are available, consider using incremental mode, described below.

Reducing Garbage-Collection Pause Time:

There are two general ways to reduce garbage-collection pause time and the impact it has on application performance:

The garbage collection can itself leverage the existence of multiple CPUs and be executed in parallel. Although the application threads remain fully suspended during this time, the garbage collection can be done in a fraction of the time, effectively reducing the suspension time.

The second approach is to leave the application running, and execute garbage collection concurrently with the application execution.

These two logical solutions have led to the development of serial, parallel, and concurrent garbage-collection strategies, which represent the foundation of all existing Java garbage-collection implementations

The serial collector suspends the application and executes the mark-and-sweep algorithm in a single thread. It is the simplest and oldest form of garbage collection in Java and is still the default in the Oracle HotSpot JVM.

The parallel collector uses multiple threads to do its work. It can therefore decrease the GC pause time by leveraging multiple CPUs. It is often the best choice for throughput applications.

The concurrent collector does the majority of its work concurrent with the application execution. It has to suspend the application for only very short amounts of time. This has a big benefit for response-time–sensitive applications, but is not without drawbacks.

(Mostly) Concurrent Marking and Sweeping

Concurrent garbage-collection strategies complicate the relatively simple mark-and-sweep algorithm a bit. The mark phase is usually sub-divided into some variant of the following:

In the initial marking, the GC root objects are marked as alive. During this phase, all threads of the application are suspended.

During concurrent marking, the marked root objects are traversed and all reachable objects are marked. This phase is fully concurrent with application execution, so all application threads are active and can even allocate new objects. For this reason there might be another phase that marks objects that have been allocated during the concurrent marking. This is sometimes referred to as pre-cleaning and is still done concurrent to the application execution.

In the final marking, all threads are suspended and all remaining newly allocated objects are marked as alive. This is indicated in Figure 2.6 by the re-mark label.

The concurrent mark works mostly, but not completely, without pausing the application. The tradeoff is a more complex algorithm and an additional phase that is not necessary in a normal stop-the-world GC: the final marking.

The Oracle JRockit JVM improves this algorithm with the help of a keep area, which, if you’re interested, is described in detail in the JRockit documentation. New objects are kept separately and not considered garbage during the first GC. This eliminates the need for a final marking or re-mark.

In the sweep phase of the CMS, all memory areas not occupied by marked objects are found and added to the free list. In other words, the objects are swept by the GC. This phase can run at least partially concurrent to the application. For instance, JRockit divides the heap into two areas of equal size and sweeps one then the other. During this phase, no threads are stopped, but allocations take place only in the area that is not actively being swept.

The downsides of the CMS algorithm can be quickly identified:

As the marking phase is concurrent to the application’s execution, the space allocated for objects can surpass the capacity of the CMS, leading to an allocation error.

The free lists immediately lead to memory fragmentation and all this entails.

The algorithm is more complicated than the other two and consequently requires more CPU cycles.

The algorithm requires more fine-tuning and has more configuration options than the other approaches.

These disadvantages aside, the CMS will nearly always lead to greater predictability and better application response time.

Reducing the Impact of Compacting

Modern garbage collectors execute their compacting processes in parallel, leveraging multiple CPUs. Nevertheless, nearly all of them have to suspend the application during this process. JVMs with several gigabytes of memory can be suspended for several seconds or more. To work around this, the various JVMs each implements a set of parameters that can be used to compact memory in smaller, incremental steps instead of as a single big block. The parameters are as follows:

Compacting is executed not for every GC cycle, but only once a certain level of fragmentation is reached (e.g., if more than 50% of the free memory is not continuous).

One can configure a target fragmentation. Instead of compacting everything, the garbage collector compacts only until a designated percentage of the free memory is available as a continuous block.

This works, but the optimization process is tedious, involves a lot of testing, and needs to be done again and again for every application to achieve optimum results.

Sizing of Heap and Various Ratios:

A number of parameters affect generation size. The following diagram illustrates the difference between committed space and virtual space in the heap. At initialization of the virtual machine, the entire space for the heap is reserved. The size of the space reserved can be specified with the -Xmx option. If the value of the -Xms parameter is smaller than the value of the -Xmx parameter, not all of the space that is reserved is immediately committed to the virtual machine. The uncommitted space is labeled "virtual" in this figure. The different parts of the heap (permanent generation, tenured generation and young generation) can grow to the limit of the virtual space as needed.

Some of the parameters are ratios of one part of the heap to another.

Total Heap

Since collections occur when generations fill up, throughput is inversely proportional to the amount of memory available. Total available memory is the most important factor affecting garbage collection performance.

By default, the virtual machine grows or shrinks the heap at each collection to try to keep the proportion of free space to live objects at each collection within a specific range. This target range is set as a percentage by the parameters -XX:MinHeapFreeRatio=<minimum> and -XX:MaxHeapFreeRatio=<maximum>, and the total size is bounded below by -Xms<min> and above by -Xmx<max>. The default parameters for the 32-bit Solaris Operating System (SPARC Platform Edition) are shown in this table:

Parameter

Default Value

MinHeapFreeRatio 40

MaxHeapFreeRatio 70

-Xms 3670k

-Xmx 64m

Lets understand these parameters:

If MinHeapFreeRatio is 40(%) , if the percent of free space in a generation falls below 40%, the generation will be expanded to maintain 40% free space, up to the maximum allowed size of the generation. Similarly, if the free space exceeds 70%, the generation will be contracted so that only 70% of the space is free, subject to the minimum size of the generation.

Generally initial heap size for application is very small, thumb rule is unless you have problems with pauses, try granting as much memory as possible to the virtual machine. The default size (64MB) is often too small.

Setting -Xms and -Xmx to the same value increases predictability by removing the most important sizing decision from the virtual machine. However, the virtual machine is then unable to compensate if you make a poor choice.

In general, increase the memory as you increase the number of processors, since allocation can be parallelized.

The Young Generation

The second most influential knob is the proportion of the heap dedicated to the young generation. Larger young generation means lesser minor collections.However, for a fixed heap size if young generation is large that means space for tenured gets reduced and hence major GC occurs frequently.

By default, the young generation size is controlled by NewRatio. For example, setting -XX:NewRatio=3 means that the ratio between the young and tenured generation is 1:3. In other words, the combined size of the eden and survivor spaces will be one fourth of the total heap size.

The parameters NewSize and MaxNewSize bound the young generation size from below and above. Setting these to the same value fixes the young generation, just as setting -Xms and -Xmx to the same value fixes the total heap size. This is useful for tuning the young generation at a finer granularity than the integral multiples allowed by NewRatio.

Survivor Space Sizing

Though we can confiure the survivor space sizing but from performance perspective it is not that important.For example, -XX:SurvivorRatio=n sets the ratio between eden and a survivor space to 1:n. In other words, each survivor space will be one nth the size of eden, and thus one (n+2)th the size of the young generation (not one (n+1)th, because there are two survivor spaces).

If survivor spaces are too small, copying collection overflows directly into the tenured generation. If survivor spaces are too large, they will be uselessly empty.

Here are the default values for the 32-bit Solaris Operating System (SPARC Platform Edition); the default values on other platforms are different.

Default Value

Parameter

Client JVM

Server JVM

NewRatio 8 2

NewSize 2228K 2228K

MaxNewSize not limited not limited

SurvivorRatio 32 32

The maximum size of the young generation will be calculated from the maximum size of the total heap and NewRatio. The "not limited" default value for MaxNewSize means that the calculated value is not limited by MaxNewSize unless a value for MaxNewSize is specified on the command line.

The steps for server applications for setting parameters are:

First decide the maximum heap size you can afford to give the virtual machine. Then tune the young generation size to give you optimum/best results.

Note that the maximum heap size should always be smaller than the amount of memory installed on the machine, to avoid excessive page faults and thrashing.

If the total heap size is fixed, increasing the young generation size requires reducing the tenured generation size. Keep the tenured generation large enough to hold all the live data used by the application at any given time, plus some amount of slack space (10-20% or more).

Subject to the above constraint on the tenured generation:

Grant plenty of memory to the young generation.

Increase the young generation size as you increase the number of processors, since allocation can be parallelized.

Optimization:

Optimization is an art. There are no magical data structures capable of solving every problem. As you can see, you have to fight for every byte. Memory optimization is a complex process. Remember that you should design your data so each object can be referenced from different collections (instead of having to copy data). It is usually better to use semantically immutable objects because you can easily share them instead of copying them. And from my experience, in a well-designed application, optimization and tuning can reduce memory usage by 30-50%.

Friday, April 11, 2014

HOT SPOT JVM memory Optimization and tools

In this blog post, I want to discuss optimization of java memory usage. The Sun JDK has two simple-but-powerful tools for memory profiling -- jmap and jhat.

jmap has two important capabilities for memory profiling. It can:
• create a heap dump file for any live java process
• show a heap distribution histogram

Neither of these capabilities requires any special parameters for the Java virtual machine (JVM). Below is a heap distribution histogram produced by jmap.

gyadav@gyadav-ubuntu:jmap -histo:live 5241

num #instances #bytes class name
----------------------------------------------
1: 123406 19940016 <constMethodKlass>
2: 198852 16388720 [C
3: 123406 15831360 <methodKlass>
4: 11531 14892200 <constantPoolKlass>
5: 11531 8938112 <instanceKlassKlass>
6: 9596 8641280 <constantPoolCacheKlass>
7: 91819 4703840 [B
8: 176758 4242192 java.lang.String
9: 88520 2832640 java.util.HashMap$Entry
10: 27069 1948968 java.lang.reflect.Field
11: 12444 1577032 java.lang.Class
12: 10453 1383296 [Ljava.util.HashMap$Entry;
13: 17116 1321848 [S
14: 29401 1318704 [Ljava.lang.Object;
15: 24378 1143144 [I
16: 1734 999016 <methodDataKlass>
17: 18074 922552 [[I
18: 15235 680328 [Ljava.lang.String;
19: 896 480256 <objArrayKlassKlass>
20: 9605 461040 org.eclipse.core.internal.registry.ReferenceMap$SoftRef
21: 9394 450912 java.util.HashMap
22: 10832 433280 java.util.LinkedHashMap$Entry
23: 17229 413496 java.util.ArrayList

...
Total 1317091 118936448

For each class we can see the class name, the number of instances of the class in the heap, and the number of bytes used in the heap by all instances of the class. The table is sorted by consumed heap space.

You can also try jhat. jhat can read and explore a heap dump. It has a web interface and you can click though your dumped object graph.

gyadav@gyadav-ubuntu:~$ jmap -dump:file=eclipse.heap 5241
Dumping heap to /home/gyadav/eclipse.heap ...
Heap dump file created

gyadav@gyadav-ubuntu:~$ jhat -J-Xmx512m eclipse.heap
Reading from eclipse.heap...
Dump file created Thu May 01 12:43:48 IST 2014
Snapshot read, resolving...
Resolving 1365807 objects...
Chasing references, expect 273 dots.................................................................................................................................................................................................................................................................................
Eliminating duplicate references.................................................................................................................................................................................................................................................................................
Snapshot resolved.
Started HTTP server on port 7000

Server is ready.

Now open http://localhost:7000 and you can see a summary of all classes. You can use standard queries via links or go straight to the “Execute Object Query Language (OQL) query” link at the bottom and type your own query. Query language is quite awkward (it is based on the java script engine) and may not work well for large numbers of objects, but it is a very powerful tool.

Enterprise applications are 80% strings and maps

Let’s look at the jmap histogram again. In the top row, we can see "[C" class consuming most of the heap space. Actually, these char arrays are part of String objects, and we can see that String instances are also consuming considerable space. From my experience, 60-80% of heaps in an enterprise application are consumed by strings and hash maps.
Strings
Let's look how the JVM is storing strings. String objects are semantically immutable. Each instance has four fields (all except hash are marked final):
• Reference to char array
• Integer offset
• Integer count of character
• Integer string hash (lazily evaluated, and once evaluated never changes)

How much memory does one String instance consume?

Here and below are size calculations for the Sun JVM (32bit). This should be similar for other vendors.
Object header (8 bytes) + 3 refs (12 bytes) + int (4 bytes) = 24 bytes. But the String instance is only a header. Actual text is stored in char array (2 bytes each character + 12 bytes array header).

String instances can share char arrays with each other. When you call substring(…) on a String instance, no new char array allocation happens. Instead, a new String referencing subrange of existing char arrays is created. Sometimes this can become a problem. Imagine you are loading a text file (e.g., CSV). First you are loading the entre line as string, then you seek the position of the field and call substring(…). Now your small field value string object has a reference to an entry line of text. Sometime later, a string header for the text line object is collected, but characters are still in memory because they are referenced via other string object.

If you are creating a String object with a constructor, a new char array is always allocated. To copy content of a substring you can use the following construct (it looks a bit strange, but it works):
new String(a.substring(…))

String.intern() – think twice
String class has a method intern() that can guarantee the following:
a.equals(b) => a.intern() == b.intern()

There is a table inside of the JVM used to store normal forms of strings. If some text is found in a table then value from a table will be returned from intern(), else the string will be added to table. So a reference returned by intern() is always an object from JVM intern table.

String intern table keep weak references to its objects, so unused strings can be collected as garbage when no other references exist except from intern table itself. It looks like a great idea to use intern(), and eliminate all duplicated strings in an application. Many have tried … and many have regretted such decision. I cannot say this is true for every JVM vendor, but if you are using Sun’s JVM, you should never do this. Why?

JVM string intern tables are stored in PermGen -- a separate region of the heap (not included in -Xmx size) used for the JVM’s internal needs. Garbage collection in this area is expensive and size is limited (though configurable).
You would have to insert a new string into the table which has O(n) complexity, where n is table size.

String intern tables in JVMs work perfectly for the JVM’s needs (new entries are added only while loading new classes so insertion time is not a big issue) and it is very compact. But it is completely unsuitable for storing millions of application strings. It just was not designed for such a use case.
Removing of duplicates

Maps and sets
Good old java.util.HashMap is used everywhere. Standard implementation in JDKs is to use the open hash table data structure.

References to key and value are stored in the Entry object (which also keeps the hash for the key for faster access). If several keys are mapped to same hash slot, they are stored as a list of interlinked entries. The size of each entry structure: object header + 3 references + int = 24 bytes. java.util.HashSet is using java.util.HashMap under the hood so your overhead will be the same. Can we store map/set in more compact form? Sure.

Sorted array map
If the keys are comparable, we can store the map or set as sorted arrays of interleaved key/value pairs (array of keys in the case of a set). Binary search can be used for fast lookups in an array, but insertion and deletion of entries will have to shift elements. Due to the high cost of updates in such a structure, it will be effective only for smaller collection sizes, and for operation patterns that are mostly read. Fortunately, this is usually what we have -- map/set of a few dozen objects that are read more often than modified.

Closed hash table
If we have a collection of a larger size, we should use a hashtable. Can we make the hashtable more compact? Again, the answer is yes. There are data structures called closed hashtables that do not require entry objects.

In closed hashtables, references from the hashtable point directly to an object (key). What if we want to put in a reference to a key but the hash slot is occupied already? In such a case, we should find another slot (e.g., next one). How do you lookup a key? Search through all adjacent slots until key or null reference is found. As you can see from algorithm, it is very important to have enough empty slots in your table. Density of closed hash tables should be kept below 0.5 to avoid performance degradation.

Conclusion
Optimization is an art. There are no magical data structures capable of solving every problem. As you can see, you have to fight for every byte. Memory optimization is a complex process. Remember that you should design your data so each object can be referenced from different collections (instead of having to copy data). It is usually better to use semantically immutable objects because you can easily share them instead of copying them. And from my experience, in a well-designed application, optimization and tuning can reduce memory usage by 30-50%.