Sunday, April 13, 2014

HotSpot JVM GC and performance tuning


Performance tuning is a very interesting topic which offers many things to learn about the JVM and your application.
Every Java application has its own behaviour and its own requirements thus providing a great scope to learn something new while doing tuning.
Mostly developers are focused on working only on the package they are supposed to code. While coding there are lots of faux pas they commit without realizing how it could bring down the performance in the production. I would like to put some thought on how to analyse the JVM tuning and coding practices which are effective to minimize performance glitches.


Heap is divided into 2 parts:



1. Young Generation

Young generation memory consists of two parts, Eden space and 2 survivor spaces. Most objects are initially allocated in eden. One of the survivor spaces is always empty and serves as the destination of any live objects in eden and the other survivor space during the next copying collection. Objects are copied between survivor spaces in this way until they are old enough to be tenured (copied to the tenured generation). Thus Shortlived objects will be available in Eden space. Every object starts its life from Eden space. When GC happens, if an object is still alive and it will be moved to survivor space and other dereferenced objects will be removed.

2. Old Generation – Tenured and Perm Gen

Old generation memory has two parts, tenured generation and permanent generation (Perm Gen). Perm Gen is a popular term. We used to error like Perm Gen space not sufficient.

GC moves live objects from survivor space to tenured generation. The permanent generation contains meta data of the virtual machine, class and method objects.


Performance criteria:

There are two primary measures of garbage collection performance:
  1. Throughput is the percentage of total time not spent in garbage collection, considered over long periods of time. Throughput includes time spent in allocation (but tuning for speed of allocation is generally not needed).
  2. Pauses are the times when an application appears unresponsive because garbage collection is occurring.
  3. Footprint is the working set of a process, measured in pages and cache lines. On systems with limited physical memory or many processes, footprint may dictate scalability. 


In simple terms the goal of tuning is to provide good performance with little or no tuning of command line options by selecting the garbage collector,heap size,and runtime compiler at JVM startup, instead of using fixed defaults.

Selecting a collector:

If the application has a small data set (up to approximately 100MB), then select the serial collector with -XX:+UseSerialGC.
If the application will be run on a single processor and there are no pause time requirements, then let the VM select the collector, or select the serial collector with -XX:+UseSerialGC.
If (a) peak application performance is the first priority and (b) there are no pause time requirements or pauses of one second or longer are acceptable, then let the VM select the collector, or select the parallel collector with -XX:+UseParallelGC and (optionally) enable parallel compaction with -XX:+UseParallelOldGC.
If response time is more important than overall throughput and garbage collection pauses must be kept shorter than approximately one second, then select the concurrent collector with -XX:+UseConcMarkSweepGC. If only one or two processors are available, consider using incremental mode, described below.








Reducing Garbage-Collection Pause Time:

There are two general ways to reduce garbage-collection pause time and the impact it has on application performance:

The garbage collection can itself leverage the existence of multiple CPUs and be executed in parallel. Although the application threads remain fully suspended during this time, the garbage collection can be done in a fraction of the time, effectively reducing the suspension time.
The second approach is to leave the application running, and execute garbage collection concurrently with the application execution.
These two logical solutions have led to the development of serial, parallel, and concurrent garbage-collection strategies, which represent the foundation of all existing Java garbage-collection implementations





The serial collector suspends the application and executes the mark-and-sweep algorithm in a single thread. It is the simplest and oldest form of garbage collection in Java and is still the default in the Oracle HotSpot JVM.

The parallel collector uses multiple threads to do its work. It can therefore decrease the GC pause time by leveraging multiple CPUs. It is often the best choice for throughput applications.

The concurrent collector does the majority of its work concurrent with the application execution. It has to suspend the application for only very short amounts of time. This has a big benefit for response-time–sensitive applications, but is not without drawbacks.

(Mostly) Concurrent Marking and Sweeping
Concurrent garbage-collection strategies complicate the relatively simple mark-and-sweep algorithm a bit. The mark phase is usually sub-divided into some variant of the following:

In the initial marking, the GC root objects are marked as alive. During this phase, all threads of the application are suspended.
During concurrent marking, the marked root objects are traversed and all reachable objects are marked. This phase is fully concurrent with application execution, so all application threads are active and can even allocate new objects. For this reason there might be another phase that marks objects that have been allocated during the concurrent marking. This is sometimes referred to as pre-cleaning and is still done concurrent to the application execution.
In the final marking, all threads are suspended and all remaining newly allocated objects are marked as alive. This is indicated in Figure 2.6 by the re-mark label.
The concurrent mark works mostly, but not completely, without pausing the application. The tradeoff is a more complex algorithm and an additional phase that is not necessary in a normal stop-the-world GC: the final marking.

The Oracle JRockit JVM improves this algorithm with the help of a keep area, which, if you’re interested, is described in detail in the JRockit documentation. New objects are kept separately and not considered garbage during the first GC. This eliminates the need for a final marking or re-mark.

In the sweep phase of the CMS, all memory areas not occupied by marked objects are found and added to the free list. In other words, the objects are swept by the GC. This phase can run at least partially concurrent to the application. For instance, JRockit divides the heap into two areas of equal size and sweeps one then the other. During this phase, no threads are stopped, but allocations take place only in the area that is not actively being swept.

The downsides of the CMS algorithm can be quickly identified:

As the marking phase is concurrent to the application’s execution, the space allocated for objects can surpass the capacity of the CMS, leading to an allocation error.
The free lists immediately lead to memory fragmentation and all this entails.
The algorithm is more complicated than the other two and consequently requires more CPU cycles.
The algorithm requires more fine-tuning and has more configuration options than the other approaches.
These disadvantages aside, the CMS will nearly always lead to greater predictability and better application response time.

Reducing the Impact of Compacting
Modern garbage collectors execute their compacting processes in parallel, leveraging multiple CPUs. Nevertheless, nearly all of them have to suspend the application during this process. JVMs with several gigabytes of memory can be suspended for several seconds or more. To work around this, the various JVMs each implements a set of parameters that can be used to compact memory in smaller, incremental steps instead of as a single big block. The parameters are as follows:

Compacting is executed not for every GC cycle, but only once a certain level of fragmentation is reached (e.g., if more than 50% of the free memory is not continuous).
One can configure a target fragmentation. Instead of compacting everything, the garbage collector compacts only until a designated percentage of the free memory is available as a continuous block.
This works, but the optimization process is tedious, involves a lot of testing, and needs to be done again and again for every application to achieve optimum results.

Sizing of Heap and Various Ratios:



A number of parameters affect generation size. The following diagram illustrates the difference between committed space and virtual space in the heap. At initialization of the virtual machine, the entire space for the heap is reserved. The size of the space reserved can be specified with the -Xmx option. If the value of the -Xms parameter is smaller than the value of the -Xmx parameter, not all of the space that is reserved is immediately committed to the virtual machine. The uncommitted space is labeled "virtual" in this figure. The different parts of the heap (permanent generation, tenured generation and young generation) can grow to the limit of the virtual space as needed.

Some of the parameters are ratios of one part of the heap to another.


Total Heap


Since collections occur when generations fill up, throughput is inversely proportional to the amount of memory available. Total available memory is the most important factor affecting garbage collection performance.

By default, the virtual machine grows or shrinks the heap at each collection to try to keep the proportion of free space to live objects at each collection within a specific range. This target range is set as a percentage by the parameters -XX:MinHeapFreeRatio=<minimum> and -XX:MaxHeapFreeRatio=<maximum>, and the total size is bounded below by -Xms<min> and above by -Xmx<max>. The default parameters for the 32-bit Solaris Operating System (SPARC Platform Edition) are shown in this table:

Parameter
Default Value
MinHeapFreeRatio       40
MaxHeapFreeRatio      70
-Xms    3670k
-Xmx   64m

Lets understand these parameters:

If MinHeapFreeRatio is 40(%) ,  if the percent of free space in a generation falls below 40%, the generation will be expanded to maintain 40% free space, up to the maximum allowed size of the generation. Similarly, if the free space exceeds 70%, the generation will be contracted so that only 70% of the space is free, subject to the minimum size of the generation.

Generally initial heap size for application is very small, thumb rule is unless you have problems with pauses, try granting as much memory as possible to the virtual machine. The default size (64MB) is often too small.

Setting -Xms and -Xmx to the same value increases predictability by removing the most important sizing decision from the virtual machine. However, the virtual machine is then unable to compensate if you make a poor choice.
In general, increase the memory as you increase the number of processors, since allocation can be parallelized.


The Young Generation


The second most influential knob is the proportion of the heap dedicated to the young generation. Larger young generation means lesser minor collections.However, for a fixed heap size if young generation is large that means space for tenured gets reduced and hence major GC occurs frequently.

By default, the young generation size is controlled by NewRatio. For example, setting -XX:NewRatio=3 means that the ratio between the young and tenured generation is 1:3. In other words, the combined size of the eden and survivor spaces will be one fourth of the total heap size.

The parameters NewSize and MaxNewSize bound the young generation size from below and above. Setting these to the same value fixes the young generation, just as setting -Xms and -Xmx to the same value fixes the total heap size. This is useful for tuning the young generation at a finer granularity than the integral multiples allowed by NewRatio.

Survivor Space Sizing

Though we can confiure the survivor space sizing but from performance perspective it is not that important.For example, -XX:SurvivorRatio=n sets the ratio between eden and a survivor space to 1:n. In other words, each survivor space will be one nth the size of eden, and thus one (n+2)th the size of the young generation (not one (n+1)th, because there are two survivor spaces).

If survivor spaces are too small, copying collection overflows directly into the tenured generation. If survivor spaces are too large, they will be uselessly empty.

Here are the default values for the 32-bit Solaris Operating System (SPARC Platform Edition); the default values on other platforms are different.

            Default Value
Parameter
Client JVM
Server JVM
NewRatio         8          2
NewSize          2228K 2228K
MaxNewSize   not limited         not limited
SurvivorRatio   32        32
The maximum size of the young generation will be calculated from the maximum size of the total heap and NewRatio. The "not limited" default value for MaxNewSize means that the calculated value is not limited by MaxNewSize unless a value for MaxNewSize is specified on the command line.



The steps for server applications for setting parameters are:

First decide the maximum heap size you can afford to give the virtual machine. Then tune the young generation size to give you optimum/best results.
Note that the maximum heap size should always be smaller than the amount of memory installed on the machine, to avoid excessive page faults and thrashing.
If the total heap size is fixed, increasing the young generation size requires reducing the tenured generation size. Keep the tenured generation large enough to hold all the live data used by the application at any given time, plus some amount of slack space (10-20% or more).

Subject to the above constraint on the tenured generation:
Grant plenty of memory to the young generation.
Increase the young generation size as you increase the number of processors, since allocation can be parallelized.


Optimization:


Optimization is an art. There are no magical data structures capable of solving every problem. As you can see, you have to fight for every byte. Memory optimization is a complex process. Remember that you should design your data so each object can be referenced from different collections (instead of having to copy data). It is usually better to use semantically immutable objects because you can easily share them instead of copying them. And from my experience, in a well-designed application, optimization and tuning can reduce memory usage by 30-50%. 

1 comment: