Drastically reducing GC pause times for YQL

Update 2: JRockit Real Time 3.1.2 using -XgcPrio:deterministic performed even better than this configuration in testing though it is not yet an approved VM at Yahoo! we will continue to test with it.

Update: One issue we still have is that after many hours of deployment with this configuration the heap fragments and we start to get concurrent mode failures.  We only saw this though during periods of peak activity.

We were struggling with some long pause times due to GC for YQL that we couldn’t stomach for our internal property SLAs. The secret turned out to be a collection of parameters for the Java 6 garbage collector:

-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode \
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+CMSIncrementalPacing

I have so far been blown away by the results of this combination of parameters. Not only does it work very efficiently under moderate load, under crushing load it rises the occasion and aggressively keeps down the heap in such a way as to never cause the:

Concurrent Mode Failure
The concurrent collector uses one or more garbage collector threads that run simultaneously with the
application threads with the goal of completing the collection of the tenured and permanent generations
before either becomes full. As described above, in normal operation, the concurrent collector does most
of its tracing and sweeping work with the application threads still running, so only brief pauses are seen
by the application threads. However, if the concurrent collector is unable to finish reclaiming the
unreachable objects before the tenured generation fills up, or if an allocation cannot be satisfied with
the available free space blocks in the tenured generation, then the application is paused and the
collection is completed with all the application threads stopped. The inability to complete a collection
concurrently is referred to as concurrent mode failure and indicates the need to adjust the concurrent
collector parameters.

If you have any other GC secrets for the JVM, leave them here. Just as a heads up I compared it with all the other GC’s available for Java 6 including the experimental G1 and none of them were as effective.

This entry was posted in Technology and tagged , . Bookmark the permalink.
  • I found some GC secrets that I want to share with you. Your article was the key to solve the out of memory errors problems with my Tomcat server. You can see my JVM settings in my blog: http://wp.me/pq115-1

    Thanks for sharing,

    Alberto
  • chirag
    You can probably remove -XX:+PrintGCDetails -XX:+PrintGCTimeStamps for prod environments
  • I often like to keep those in for diagnosis when Ops shows up and asks why something was unresponsive.
  • Uhm, our work with the JVM shows Concurrent collector highly unstable and several folks in high places who shall remain nameless say never to use it. Also, OS / platform is important. On Solaris / CMT, run LargePageSizes and about as many ParallelGC threads as you have CMT threads (32, 64, or 128).

    -XX:SurvivorRatio=8 – large survivor spaces for short-lived objects
    -XX:+UseParallelGC
    -XX:ParallelGCThreads=20
    -XX:+UseParallelOldGC
    -Xmn1g
    -server switch

    Last, if you insist on CMS, then we have used ParNewGC + CMS and initiatingOccupancyFraction to get the GC to kick off aggressively and keep down the Full pauses.

    G1 is going to have to dig us all out of the whole the CMS is burying us in, but Parallel w/ LOTS of threads has proved way more stable for us w/o the stop-the-worlds from CMS that can be like 3 hours...
  • Could it be the type of workloads you are throwing at it? We've had 30 VMs under load in production for the last week with great, low pause times without a hint of instability. I'll take a crack at it with your options and see what we see. G1 so far has been useless with full pauses being long and frequent under moderate testing. Ran a 24 hour test last night of JRockit and their deterministic GC with excellent results for our workloads. We also don't use Solaris. It is all RHEL.
  • Which specific version of 1.6 are you using? We have seen some GC related crashes with 1.6_06, 14,15 while using the CMS collector so we switched back to using the throughput collector. Did you happen to run into this?
  • We are currently on java version 1.6.0_16. We haven't yet seen a crash with this configuration.
  • Tight. I love the black magic that you can get with some of the JVM parameters. It's a whole new world over them thar hills.
blog comments powered by Disqus