- Benchmarking G1 and other Java 7 Garbage Collectors
- Controlling GC pauses with the GarbageFirst Collector
- G1 Garbage Collector is mature in Java 9, finally
The G1 garbage collector is the default collector in Java 9. So it is time to reevaluate its performance which in 2013 I had criticized in a previous blog article that compared G1’s performance in late Java 7 and early Java 8 to the traditional collectors. The improvements achieved in the meantime indeed are very impressive as I will show in this article.
First, let’s have an updated look at the situation with Java 8’s GC. The overall picture did not change very much with Java 8u144 when measured with the same benchmark program on the same hardware, and with the same JVM settings.
The following plot shows an update of the GC throughput as a function of the new generation size when measured using the following JVM configuration:
java -Xms6g –Xmx6g –XX:+Use<ConcMarkSweep/Parallel/G1>GC –XX:NewSize=x –XX:MaxNewSize=x de.am.gc.benchmarks.MixedRandomList 100 8 12500000
The dashed horizontal lines show the throughput achieved with an unspecified NewSize:
java -Xms6g –Xmx6g –XX:+Use<ConcMarkSweep/Parallel/G1>GC de.am.gc.benchmarks.MixedRandomList 100 8 12500000
These benchmark results still show two conspicuous features:
- The G1 collector (cyan lines) clearly trails the traditional collectors with respect to achievable throughput. But since Java 7u79 (grey lines) there was considerable improvement, and the throughput achieved by G1 rises from 5000 to 6000 MB/s.
- There still is the same serious bug in the default Parallel GC collector (red line) which depresses throughput to a very low level beyond NewSize=1800m and hence for the default value of NewRatio which is 2048 MB=1/3*6 GB. This bug hadn’t been there in Java 6, but in Java 7, Java 8-ea and, as it turned out, it still is in Java 8u144.
More improvements in Java 9
The following graph shows the corresponding results for Java 9.0.1(again same benchmark, same hardware, same parameters):
The graph shows two interesting improvements:
First, the G1 benchmark performance has largely caught up and is now at 6800 MB/s, within reaching distance of the two others. Note that the default performance of the G1 collector (dashed blue line) is better than anything achievable with a fixed value of NewSize (solid blue line). As mentioned above, it has never been recommended to set NewSize with G1, because this impedes its inner workings and its adaptability to variable loads. But now, with any fixed value of NewSize, it is no longer possible to achieve the default throughput, even when the load is constant as in this benchmark. This has not been the case with Java 8u144 (see above), and indicates that interesting optimizations have recently been added to G1.
In addition to lower throughput, setting NewSize to an inconvenient value might force G1 to recur to Full GC with pauses which in fact are longer than with ParallelGC. Thus G1’s low-pause advantage would entirely be lost.
The second improvement is that the Parallel GC bug has been fixed in Java 9 (but this fix apparently was not backported to Java 8). With default settings (no NewSize configured) the Parallel GC collector now achieves the best performance, because the performance drop now occurs at much higher values of NewSize (when the shrinking old generation heap size approaches the size of the live heap). The optimal (but not the default) performance for both ParallelGC and CMS is now almost the same at about 7800 MB/s.
G1 has raced to catch up
The following graph shows the step-by-step improvements of the default performance (as measured with my synthetic MixedRandomList benchmark) of the G1 collector and compares it to the two classic collectors in Java 9:
As shown in the previous graph both CMS and ParallelGC can be tuned, by setting NewSize to a convenient value, to about 114% of G1’s performance in Java 9. But G1 itself doesn’t need any tuning in order to reach its 100% performance: you get it safely out-of-the box. The slight disadvantage in throughput is no longer a serious drawback for G1 in real-world-applications as I will show further down.
The low-pause promise fulfilled
In Java 9 the G1 collector has reached rather competitive throughput although this has never been its development focus. It was designed for short pauses and improved scalability on multi-CPU systems. So let’s have a look to what extent the low-pause goal has been reached.
The following graph shows heap usage (blue line) and GC pause duration (grey columns) for the ParallelGC collector when run with the default value of NewSize set to 2048 MB:
Note that an early Full GC pause takes almost 9 seconds, a later one 4.5 seconds before even Full GC pauses become as short as about 1 second. These pauses however are peculiar to this synthetic benchmark which exhibits a very simple reference graph by construction: all live objects are contained in a single List. Usually real world applications in a production environment with a 6 GB heap permanently show Full GC pauses in the range of a few to many seconds.
The G1 collector in contrast reaches very short pauses:
The longest pause is about 0.5s, most are below 0.15s. Note that G1 ran with the default value of the pause time target which is 200 milliseconds. From this viewpoint, G1 fulfilled the low-pause promise in a great way.
G1 performance in a real application
In many respects, real-world applications place a heavier burden on the garbage collection than a synthetic benchmark: reference graphs are diverse, chaotic and deep; load varies as a function of time; there are many additional complications like Weak- and SoftReferences, unloading of classes and class loaders. In this section, I will discuss the performance of the three garbage collectors in a load test on a real application.
The load test is made up of three use cases which are executed with a load profile that increases the load in steps. As each use case comprises a flow of many HTTP requests this translates to a rather high overall request rate:
As can be seen from this graph, above the first load plateau at 15000 requests/min, the application cannot follow the load profile anymore and reaches an overload situation for any of the available garbage collectors. At this point, it is unclear why throughput reaches this rather hard limit below 18000 requests/min = 300 requests/s on a 20-CPU-system. I will discuss this further down.
Unlike the synthetic benchmark discussed above, the differences between garbage collectors are marginal in this load test with Java 8u144.
The following graph visualizes GC pauses in this load test with the ParallelGC collector:
Full GC pauses are in the range of up to 2 seconds with outliers beyond 3 seconds. The red colored key GC performance indicators point to a major problem: object creation rate / GC rate is about 2800 MB/s on average. On load level 1 (2000-4000 seconds) it is already 2700 MB/s. Is it possible that GC itself creates the bottleneck that limits throughput slightly above that level? The JVM is stopped for GC pauses for about 14.3% of time, which is quite a lot.
Let’s now have a look how G1 copes with that (over)load:
G1 delivers pauses of up to 0.5 seconds, an outlier reaches almost 0.9 seconds. And it is able to cut the GC overhead by half to about 7% pause time. The GC throughput is even slightly higher than with ParallelGC which reflects the slightly higher request throughput (see above).
In every respect, G1 outmatches ParallelGC for this application already with Java 8u144.
The CMS collector can compete with G1 with respect to pause duration, but outliers are more frequent and longer. GC overhead (10%) is between G1’s (7%) and ParallelGC’s (14%):
How is it possible that G1 outmatches ParallelGC and CMS in an overall perspective although in the benchmark shown above it still is considerably behind these two with Java 8u144?
The following plot may answer this question:
The first observation is that with any garbage collector the JVM is not able to make full use of available CPU power. With ParallelGC it uses no more than 69% when it hits the ceiling, with CMS 78% and with G1 84%. G1 can achieve ParallelGC’s throughput because it is able to use spare CPU power which the JVM cannot apply to useful work anyway. Unlike rather simple benchmarks, real word applications with their many latencies, locks and similar obstacles frequently leave spare CPU power which G1 can use for GC without any loss in throughput.
Another conspicuous feature in the CPU plot is the overswing of CMS and G1 when the JVM reaches the plateau of the first load level. This shows their design kinship in the sense that both work with concurrent GC threads whose CPU usage does not directly follow the load. G1, however, seems to work in a more cushioned manner than CMS.
The productive challenge
The following plot of GC pauses shows the same application in production (where many more use cases are active) over a period / uptime of 6 weeks:
On average, the GC throughput is much lower than in the load test (273.2 MB/s). Even in peak hours it reaches only around 1000 MB/s. To clean this up, the ParallelGC collector executes around 10 Full GCs every day which usually take 3.5 to 5 seconds. But sometimes, there are outliers which stop the JVM for up to 11.5 seconds in this case. This means that normal productive Full GC pauses take twice as long as those seen in our load test on the same application, while outliers take more than three times as long.
The primary goal of our switch to G1 was to get rid of these long Full GC pauses while avoiding the configuration subtleties and instabilities of the CMS collector. This is the very purpose G1 was designed for. The first six weeks of productive G1 use were very promising:
Note that for seasonal reasons during this period GC load on average was about 1/4 lower than during the same period with ParallelGC above. Nevertheless, G1 has shown that it can deliver much shorter GC pauses without falling back into (potentially extremely long) Full GCs. As with ParallelGC, however, GC pauses are about twice as long in production as in our load tests and sometimes reach 1-1.5 seconds. This is of course much more than the (default) pause target of 200 milliseconds. Note also that after three weeks there was a significant change in GC pause distribution which we have not yet understood.
On another server instance of the same productive portal we observe much shorter G1 pauses of no more than 0.4-0.7 seconds which comes much closer to the 200 millis pause time target:
This one indeed looks like an almost ideal world, except that same sudden change after 21 days.
The difference between the two Tomcat instances is in the number of virtual CPUs: the second one runs in a VM with only 10 instead of 20 virtual CPUs. This looks significant because the underlining hardware has 12 CPU cores with 2 hyperthreads each and the JVM so far runs with -XX:ParallelGCThreads=20. First tests have confirmed that it would be beneficial to reduce the number of parallel GC threads to 12 (=the number of cores) on a 20-CPU-VM. On a 10-CPU-VM the OS can schedule all virtual CPUs on different cores which means that two parallel GC threads will never run on the same core at the same time and block each other during memory access. Tests suggest that reducing the number of parallel GC threads to 12 helps to achieve a similar result on a 20-CPU-VM. We will bring that change to production in the next deployment.
Altogether, the production switch to G1 looks very much like a success already in Java 8u144. And the benchmark results from above promise more improvements to come with the update to Java 9.
More than 10 years ago, the creator of the G1 collector, Dr. Antonios Printezis, came to Munich and gave a talk to our common customer about garbage collection in general and his ideas for a new collector which he planned to name “Garbage First”. Now, it looks like the fruit of his ideas has come to maturity and G1 for some if not many applications is the best available garbage collector which promises the liberation from most of the subtle issues which sometimes made GC tuning and troubleshooting kind of an art.
4 years ago, G1 was not there, yet. Much work spent on it, however, proved wrong the pessimism that many held at that time. Congratulations to Antonios Printezis and all those who made his ideas finally work in practice!
Update March 2018
After the change from 20 to 12 parallel GC threads which we hoped could further reduce GC pauses server 1 has now been running for more than 5 weeks (on 20 virtual CPUs on 12-core hardware). This graph again shows GC pauses (grey columns) and in addition heap usage (blue line):
The graph shows 2 things:
1. The jump in heap usage after 16 days is the result of a server reconfiguration, a feature built into our portal which allows reload of configuration and JAR files by a custom class loader. This reconfiguration lifted permanent heap usage above the 45% threshold which by default triggers concurrent G1 activity. While before the reconfiguration this threshold was only reached during daytime activity, concurrent G1 threads are now running permanently which has removed the clear day-night difference in GC pauses.
2. While in figure 12 quite a few GC pauses exceeded 1s of duration, this has become very rare by the change to 12 parallel GC threads. Almost all pauses now stay below 1 s, not a single one exceeds 1.5 s anymore. Before the change they sometimes exceeded 2 s. Therefore, reduction of the ParallelGCThreads parameter to 12 was successful to some extent, but server 1 still has slightly longer pauses than server 2 (compare figure 13).