Rants about Java and other internet technologies by Sam Pullara

Drastically reducing GC pause times for YQL

was published on November 3rd, 2009 and is listed in Technology http://www.javarants.com/ytpkw

Update 2: JRockit Real Time 3.1.2 using -XgcPrio:deterministic performed even better than this configuration in testing though it is not yet an approved VM at Yahoo! we will continue to test with it.

Update: One issue we still have is that after many hours of deployment with this configuration the heap fragments and we start to get concurrent mode failures.  We only saw this though during periods of peak activity.

We were struggling with some long pause times due to GC for YQL that we couldn’t stomach for our internal property SLAs. The secret turned out to be a collection of parameters for the Java 6 garbage collector:

-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode \
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+CMSIncrementalPacing

I have so far been blown away by the results of this combination of parameters. Not only does it work very efficiently under moderate load, under crushing load it rises the occasion and aggressively keeps down the heap in such a way as to never cause the:

Concurrent Mode Failure
The concurrent collector uses one or more garbage collector threads that run simultaneously with the
application threads with the goal of completing the collection of the tenured and permanent generations
before either becomes full. As described above, in normal operation, the concurrent collector does most
of its tracing and sweeping work with the application threads still running, so only brief pauses are seen
by the application threads. However, if the concurrent collector is unable to finish reclaiming the
unreachable objects before the tenured generation fills up, or if an allocation cannot be satisfied with
the available free space blocks in the tenured generation, then the application is paused and the
collection is completed with all the application threads stopped. The inability to complete a collection
concurrently is referred to as concurrent mode failure and indicates the need to adjust the concurrent
collector parameters.

If you have any other GC secrets for the JVM, leave them here. Just as a heads up I compared it with all the other GC’s available for Java 6 including the experimental G1 and none of them were as effective.

"Drastically reducing GC pause times for YQL" was published on November 3rd, 2009 and is listed in Technology.

Follow comments via the RSS Feed | Leave a comment | Trackback URL

  • Tight. I love the black magic that you can get with some of the JVM parameters. It's a whole new world over them thar hills.
  • Which specific version of 1.6 are you using? We have seen some GC related crashes with 1.6_06, 14,15 while using the CMS collector so we switched back to using the throughput collector. Did you happen to run into this?
  • We are currently on java version 1.6.0_16. We haven't yet seen a crash with this configuration.
  • Uhm, our work with the JVM shows Concurrent collector highly unstable and several folks in high places who shall remain nameless say never to use it. Also, OS / platform is important. On Solaris / CMT, run LargePageSizes and about as many ParallelGC threads as you have CMT threads (32, 64, or 128).

    -XX:SurvivorRatio=8 – large survivor spaces for short-lived objects
    -XX:+UseParallelGC
    -XX:ParallelGCThreads=20
    -XX:+UseParallelOldGC
    -Xmn1g
    -server switch

    Last, if you insist on CMS, then we have used ParNewGC + CMS and initiatingOccupancyFraction to get the GC to kick off aggressively and keep down the Full pauses.

    G1 is going to have to dig us all out of the whole the CMS is burying us in, but Parallel w/ LOTS of threads has proved way more stable for us w/o the stop-the-worlds from CMS that can be like 3 hours...
  • Could it be the type of workloads you are throwing at it? We've had 30 VMs under load in production for the last week with great, low pause times without a hint of instability. I'll take a crack at it with your options and see what we see. G1 so far has been useless with full pauses being long and frequent under moderate testing. Ran a 24 hour test last night of JRockit and their deterministic GC with excellent results for our workloads. We also don't use Solaris. It is all RHEL.
blog comments powered by Disqus

YUI-Mainstream Theme by Buzzdroid.com

 Premium Wordrpess Theme