/ March 2012 ~ Java EE Support Patterns

3.20.2012

Java Thread deadlock – Case Study

This article will describe the complete root cause analysis of a recent Java deadlock problem observed from a Weblogic 11g production system running on the IBM JVM 1.6.

This case study will also demonstrate the importance of mastering Thread Dump analysis skills; including for the IBM JVM Thread Dump format.

Environment specifications

-        Java EE server: Oracle Weblogic Server 11g & Spring 2.0.5
-        OS: AIX 5.3
-        Java VM: IBM JRE 1.6.0
-        Platform type: Portal & ordering application

Monitoring and troubleshooting tools

-        JVM Thread Dump (IBM JVM format)
-        Compuware Server Vantage (Weblogic JMX monitoring & alerting)

Problem overview

A major stuck Threads problem was observed & reported from Compuware Server Vantage and affecting 2 of our Weblogic 11g production managed servers causing application impact and timeout conditions from our end users.

Gathering and validation of facts

As usual, a Java EE problem investigation requires gathering of technical and non-technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:

·        What is the client impact? MEDIUM (only 2 managed servers / JVM affected out of 16)
·        Recent change of the affected platform? Yes (new JMS related asynchronous component)
·        Any recent traffic increase to the affected platform? No
·        How does this problem manifest itself?  A sudden increase of Threads was observed leading to rapid Thread depletion
·        Did a Weblogic managed server restart resolve the problem? Yes, but problem is returning after few hours (unpredictable & intermittent pattern)

-        Conclusion #1: The problem is related to an intermittent stuck Threads behaviour affecting only a few Weblogic managed servers at the time
-        Conclusion #2: Since problem is intermittent, a global root cause such as a non-responsive downstream system is not likely

Thread Dump analysis – first pass

The first thing to do when dealing with stuck Thread problems is to generate a JVM Thread Dump. This is a golden rule regardless of your environment specifications & problem context. A JVM Thread Dump snapshot provides you with crucial information about the active Threads and what type of processing / tasks they are performing at that time.

Now back to our case study, an IBM JVM Thread Dump (javacore.xyz format) was generated which did reveal the following Java Thread deadlock condition below:

1LKDEADLOCK    Deadlock detected !!!
NULL           ---------------------
NULL          
2LKDEADLOCKTHR  Thread "[STUCK] ExecuteThread: '8' for queue: 'weblogic.kernel.Default (self-tuning)'" (0x000000012CC08B00)
3LKDEADLOCKWTR    is waiting for:
4LKDEADLOCKMON      sys_mon_t:0x0000000126171DF8 infl_mon_t: 0x0000000126171E38:
4LKDEADLOCKOBJ      weblogic/jms/frontend/FESession@0x07000000198048C0/0x07000000198048D8:
3LKDEADLOCKOWN    which is owned by:
2LKDEADLOCKTHR  Thread "[STUCK] ExecuteThread: '10' for queue: 'weblogic.kernel.Default (self-tuning)'" (0x000000012E560500)
3LKDEADLOCKWTR    which is waiting for:
4LKDEADLOCKMON      sys_mon_t:0x000000012884CD60 infl_mon_t: 0x000000012884CDA0:
4LKDEADLOCKOBJ      weblogic/jms/frontend/FEConnection@0x0700000019822F08/0x0700000019822F20:
3LKDEADLOCKOWN    which is owned by:
2LKDEADLOCKTHR  Thread "[STUCK] ExecuteThread: '8' for queue: 'weblogic.kernel.Default (self-tuning)'" (0x000000012CC08B00)

This deadlock situation can be translated as per below:

-        Weblogic Thread #8 is waiting to acquire an Object monitor lock owned by Weblogic Thread #10
-        Weblogic Thread #10 is waiting to acquire an Object monitor lock owned by Weblogic Thread #8

Conclusion: both Weblogic Threads #8 & #10 are waiting on each other; forever!

Now before going any deeper in this root cause analysis, let me provide you a high level overview on Java Thread deadlocks.

Java Thread deadlock overview

Most of you are probably familiar with Java Thread deadlock principles but did you really experience a true deadlock problem?

From my experience, true Java deadlocks are rare and I have only seen ~5 occurrences over the last 10 years. The reason is that most stuck Threads related problems are due to Thread hanging conditions (waiting on remote IO call etc.) but not involved in a true deadlock condition with other Thread(s).

A Java Thread deadlock is a situation for example where Thread A is waiting to acquire an Object monitor lock held by Thread B which is itself waiting to acquire an Object monitor lock held by Thread A. Both these Threads will wait for each other forever. This situation can be visualized as per below diagram:


Thread deadlock is confirmed…now what can you do?

Once the deadlock is confirmed (most JVM Thread Dump implementations will highlight it for you), the next step is to perform a deeper dive analysis by reviewing each Thread involved in the deadlock situation along with their current task & wait condition.

Find below the partial Thread Stack Trace from our problem case for each Thread involved in the deadlock condition:

** Please note that the real application Java package name was renamed for confidentiality purposes **

Weblogic Thread #8

"[STUCK] ExecuteThread: '8' for queue: 'weblogic.kernel.Default (self-tuning)'" J9VMThread:0x000000012CC08B00, j9thread_t:0x00000001299E5100, java/lang/Thread:0x070000001D72EE00, state:B, prio=1
(native thread ID:0x111200F, native priority:0x1, native policy:UNKNOWN)
Java callstack:
       at weblogic/jms/frontend/FEConnection.stop(FEConnection.java:671(Compiled Code))
       at weblogic/jms/frontend/FEConnection.invoke(FEConnection.java:1685(Compiled Code))
       at weblogic/messaging/dispatcher/Request.wrappedFiniteStateMachine(Request.java:961(Compiled Code))
       at weblogic/messaging/dispatcher/DispatcherImpl.syncRequest(DispatcherImpl.java:184(Compiled Code))
       at weblogic/messaging/dispatcher/DispatcherImpl.dispatchSync(DispatcherImpl.java:212(Compiled Code))
       at weblogic/jms/dispatcher/DispatcherAdapter.dispatchSync(DispatcherAdapter.java:43(Compiled Code))
       at weblogic/jms/client/JMSConnection.stop(JMSConnection.java:863(Compiled Code))
       at weblogic/jms/client/WLConnectionImpl.stop(WLConnectionImpl.java:843)
       at org/springframework/jms/connection/SingleConnectionFactory.closeConnection(SingleConnectionFactory.java:342)
       at org/springframework/jms/connection/SingleConnectionFactory.resetConnection(SingleConnectionFactory.java:296)
       at org/app/JMSReceiver.receive()
……………………………………………………………………

Weblogic Thread #10

"[STUCK] ExecuteThread: '10' for queue: 'weblogic.kernel.Default (self-tuning)'" J9VMThread:0x000000012E560500, j9thread_t:0x000000012E35BCE0, java/lang/Thread:0x070000001ECA9200, state:B, prio=1
 (native thread ID:0x4FA027, native priority:0x1, native policy:UNKNOWN)
Java callstack:
       at weblogic/jms/frontend/FEConnection.getPeerVersion(FEConnection.java:1381(Compiled Code))
       at weblogic/jms/frontend/FESession.setUpBackEndSession(FESession.java:755(Compiled Code))
       at weblogic/jms/frontend/FESession.consumerCreate(FESession.java:1025(Compiled Code))
       at weblogic/jms/frontend/FESession.invoke(FESession.java:2995(Compiled Code))
       at weblogic/messaging/dispatcher/Request.wrappedFiniteStateMachine(Request.java:961(Compiled Code))
       at weblogic/messaging/dispatcher/DispatcherImpl.syncRequest(DispatcherImpl.java:184(Compiled Code))
       at weblogic/messaging/dispatcher/DispatcherImpl.dispatchSync(DispatcherImpl.java:212(Compiled Code))
       at weblogic/jms/dispatcher/DispatcherAdapter.dispatchSync(DispatcherAdapter.java:43(Compiled Code))
       at weblogic/jms/client/JMSSession.consumerCreate(JMSSession.java:2982(Compiled Code))
       at weblogic/jms/client/JMSSession.setupConsumer(JMSSession.java:2749(Compiled Code))
       at weblogic/jms/client/JMSSession.createConsumer(JMSSession.java:2691(Compiled Code))
       at weblogic/jms/client/JMSSession.createReceiver(JMSSession.java:2596(Compiled Code))
       at weblogic/jms/client/WLSessionImpl.createReceiver(WLSessionImpl.java:991(Compiled Code))
       at org/springframework/jms/core/JmsTemplate102.createConsumer(JmsTemplate102.java:204(Compiled Code))
       at org/springframework/jms/core/JmsTemplate.doReceive(JmsTemplate.java:676(Compiled Code))
       at org/springframework/jms/core/JmsTemplate$10.doInJms(JmsTemplate.java:652(Compiled Code))
       at org/springframework/jms/core/JmsTemplate.execute(JmsTemplate.java:412(Compiled Code))
       at org/springframework/jms/core/JmsTemplate.receiveSelected(JmsTemplate.java:650(Compiled Code))
       at org/springframework/jms/core/JmsTemplate.receiveSelected(JmsTemplate.java:641(Compiled Code))
       at org/app/JMSReceiver.receive()
……………………………………………………………

As you can see in the above Thread Strack Traces, such deadlock did originate from our application code which is using the Spring framework API for the JMS consumer implementation (very useful when not using MDB’s). The Stack Traces are quite interesting and revealing that both Threads are in a race condition against the same Weblogic JMS consumer session / connection and leading to a deadlock situation:

-        Weblogic Thread #8 is attempting to reset and close the current JMS connection
-        Weblogic Thread #10 is attempting to use the same JMS Connection / Session in order to create a new JMS consumer
-        Thread deadlock is triggered!

Root cause: non Thread safe Spring JMS SingleConnectionFactory implementation

A code review and a quick research from Spring JIRA bug database did reveal the following Thread safe defect below with a perfect correlation with the above analysis:

# SingleConnectionFactory's resetConnection is causing deadlocks with underlying OracleAQ's JMS connection

A patch for Spring SingleConnectionFactory was released back in 2009 which did involve adding proper synchronized{} block in order to prevent Thread deadlock in the event of a JMS Connection reset operation:

synchronized (connectionMonitor) {
  //if condition added to avoid possible deadlocks when trying to reset the target connection
  if (!started) {
    this.target.start();
    started = true;
  }
}

Solution

Our team is currently planning to integrate this Spring patch in to our production environment shortly. The initial tests performed in our test environment are positive.

Conclusion

I hope this case study has helped understand a real-life Java Thread deadlock problem and how proper Thread Dump analysis skills can allow you to quickly pinpoint the root cause of stuck Thread related problems at the code level. Please don’t hesitate to post any comment or question.

3.13.2012

Plumbr: Java memory leak detector


This article is to inform you that I will be performing a review and trial of the Plumbr Java memory leak detector tool. This review will be available from this Blog along with the sample (leaking) Java EE program I created for that purpose.

Plumbr tool creator’s mission

I can assure you that Java memory leaks can be quite complex to solve as there are different memory leak patterns and the analysis phase can be tricky; especially when you attempt to differentiate between leaking Java objects and healthy memory footprint.

That being said, I agree with Plumbr creators that Java memory leak analysis should not be only reserved to Java experts; this is one of the reasons I created this Blog at the first place, for the less experienced Java EE individuals who desire to learn key problem patterns such as Java memory leaks and how to resolve them.

Learning these advanced troubleshooting principles is always easier when we are equipped with the proper tools.

Plumbr’s primary objective is to assist you in your troubleshooting effort by learning about your Java / Java EE application and then to detect potential memory leak source(s) and location(s) within your implementation.

Now let’s begin...

Please bookmark this article and stay tuned for the full review and case study.

3.08.2012

IBM JVM tuning – gencon GC policy

This article will provide you detail on an important Java Heap space tuning consideration when migrating from a Java VM such as HotSpot or JRockit to the IBM JVM. This tuning recommendation is based on a recent troubleshooting and tuning mandate I performed for one of my IT clients.

IBM JVM overview

As you probably saw from my other articles, the IBM JVM is different than the HotSpot JVM in some aspects as it does not have a PermGen memory space for example. From a garbage collection perspective, it does provide you with advanced algorithms that can take advantage of a multi physical cores machine; similar to the HotSpot JVM.

From a troubleshooting perspective, IBM provides you with many tools; including out-of-the-box Thread Dump and Heap Dump generation capabilities from its JVM implementation.

The IBM JVM Thread Dump for example is particularly powerful as it provides you extra data on your JVM such as the active JVM environment variables, GC policies, loaded classes in each active class-loader etc. We will explore this in more detail on the part 4 of our Thread Dump Training plan.

IBM VM – default GC behaviour

Now back to our primary topic, it is very important that you understand the default behaviour of the IBM JVM garbage collector (version 1.5 & 1.6). By default, the Java Heap space is created using tenured memory only e.g. it does not create a separate YoungGen (nursery) space. This means that any memory allocation goes to the tenured space (short lived and long lived objects) which later gets collected by the default collector (via a Full GC).

** Please note that the IBM VM default GC policy has now been changed to gencon for JDK 1.7

Find below a verbose GC snapshot showing you the default GC memory breakdown with explanations:

<af type="tenured" id="5" timestamp="Mar 01 13:40:30 2012" intervalms="0.000">
  <minimum requested_bytes="48" />
  <time exclusiveaccessms="0.106" meanexclusiveaccessms="0.106" threads="0" lastthreadtid="0x000000011A846B00" />
  <tenured freebytes="20131840" totalbytes="2013265920" percent="0" >
    <soa freebytes="0" totalbytes="1993134080" percent="0" />
    <loa freebytes="20131840" totalbytes="20131840" percent="100" />
  </tenured>
  <gc type="global" id="8" totalid="2492" intervalms="2017588.587">
    <finalization objectsqueued="199" />
    <timesms mark="808.286" sweep="9.341" compact="0.000" total="818.292" />
    <tenured freebytes="1362331024" totalbytes="2013265920" percent="67" >
      <soa freebytes="1344212368" totalbytes="1995147264" percent="67" />
      <loa freebytes="18118656" totalbytes="18118656" percent="100" />
    </tenured>
  </gc>
  <tenured freebytes="1362330976" totalbytes="2013265920" percent="67" >
    <soa freebytes="1344212320" totalbytes="1995147264" percent="67" />
    <loa freebytes="18118656" totalbytes="18118656" percent="100" />
  </tenured>
  <time totalms="818.750" />
</af>



Ok, default IBM JVM GC policy is different… what is the problem?

The problem with this default JVM policy is that all Java objects are copied to the tenured space and collected via a global collection (Full GC). For many Java EE applications, the ratio of short lived vs. long lived objects is much higher. This means that your JVM will need to perform quite a lot of major collections to clean up the short lived objects; results: increased frequency of Full GC, increased JVM pause time, increased CPU utilization and performance degradation!

This is exactly what we observed while performing load testing following a migration to JVM HotSpot 1.5 (using incremental & parallel GC) to IBM JVM 1.6 with default GC policy. Heavy GC process was identified as the root cause as per the above explanation.

Solution please!

The good news is that the IBM JVM introduced generational & concurrent GC collector since version 1.5. This GC policy is providing exactly what we want:

-        It does split the Java Heap space between nursery and tenured spaces
-        Nursery (YoungGen) space objects are collected separately via the scavenger GC collector
-        Tenured space objects are collected via the global GC collector
-        The GC collector is concurrent and taking advantage of any multi physical cores machine

Results:

-        Reduced major collection frequency (Full GC)
-        Reduced Full GC elapsed time & pause time
-        Increase JVM throughput
-        Increase performance & capacity of your application

You can enable it by adding this JVM paremeter below:

-Xgcpolicy:gencon

Find below what you can expect in your verbose GC log after enabling this GC policy:

<af type="nursery" id="15" timestamp="Mar 08 05:34:06 2012" intervalms="1289096.227">
  <minimum requested_bytes="40" />
  <time exclusiveaccessms="0.085" meanexclusiveaccessms="0.085" threads="0" lastthreadtid="0x0000000118113900" />
  <refs soft="18043" weak="204057" phantom="27" dynamicSoftReferenceThreshold="32" maxSoftReferenceThreshold="32" />
  <nursery freebytes="0" totalbytes="530716672" percent="0" />
  <tenured freebytes="1887422016" totalbytes="2013265920" percent="93" >
    <soa freebytes="1786758720" totalbytes="1912602624" percent="93" />
    <loa freebytes="100663296" totalbytes="100663296" percent="100" />
  </tenured>
  <gc type="scavenger" id="15" totalid="15" intervalms="1289097.271">
    <flipped objectcount="1486449" bytes="129908000" />
    <tenured objectcount="1176" bytes="184144" />
    <finalization objectsqueued="3082" />
    <scavenger tiltratio="73" />
    <nursery freebytes="364304408" totalbytes="495208448" percent="73" tenureage="10" />
    <tenured freebytes="1886766656" totalbytes="2013265920" percent="93" >
      <soa freebytes="1786103360" totalbytes="1912602624" percent="93" />
      <loa freebytes="100663296" totalbytes="100663296" percent="100" />
    </tenured>
    <time totalms="233.886" />
  </gc>
  <nursery freebytes="364238872" totalbytes="495208448" percent="73" />
  <tenured freebytes="1886766656" totalbytes="2013265920" percent="93" >
    <soa freebytes="1786103360" totalbytes="1912602624" percent="93" />
    <loa freebytes="100663296" totalbytes="100663296" percent="100" />
  </tenured>
  <refs soft="17992" weak="5344" phantom="27" dynamicSoftReferenceThreshold="32" maxSoftReferenceThreshold="32" />
  <time totalms="236.858" />
</af>


Please keep in mind that it is still possible that your application may not benefit from this GC policy (bigger footprint of long lived objects etc.) so my recommendation to you is to always do your due diligence and perform proper capacity planning & load testing of your application before implementing any major tuning recommendations.

Conclusion

I hope this article has helped you understand the default IBM JVM 1.5/1.6 GC policy and how your Java EE application can benefit from this GC policy gencon tuning recommendation.

Please do not hesitate to post a comment or question at the end of this article. I’m also looking forward for your experience with the IBM JVM.

3.05.2012

OutOfMemoryError: Out of swap space - Problem Patterns

Today we will revisit a common Java HotSpot VM problem that you probably already experienced at some point in your JVM troubleshooting experience on Solaris OS; especially on a 32-bit JVM.

This article will provide you with a description of this particular type of OutOfMemoryError, the common problem patterns and the recommended resolution approach.

If you are not familiar with the different HotSpot memory spaces, I recommend that you first review the article Java HotSpot VM Overview before going any further in this reading.

java.lang.OutOfMemoryError: Out of swap space? – what is it?

This error message is thrown by the Java HotSpot VM (native code) following a failure to allocate native memory from the OS to the HotSpot C-Heap or dynamically expand the Java Heap etc... This problem is very different than a standard OutOfMemoryError (normally due to an exhaustion of the Java Heap or PermGen space).

A typically error found in your application / server logs is:

Exception in thread "main" java.lang.OutOfMemoryError: requested 53459 bytes for ChunkPool::allocate. Out of swap space?

Also, please note that depending of the OS that you use (Windows, AIX, Solaris etc.) some OutOfMemoryError due to C-Heap exhaustion may not give you detail such as “Out of swap space”. In this case, you will need to review the OOM error Stack Trace and determine if the computing task that triggered the OOM and determine which OutOfMemoryError problem pattern your problem is related to (Java Heap, PermGen or Native Heap exhaustion).

Ok so can I increase the Java Heap via –Xms & -Xmx to fix it?

Definitely not! This is the last thing you want to do as it will make the problem worse. As you learned from my other article, the Java HotSpot VM is split between 3 memory spaces (Java Heap, PermGen, C-Heap). For a 32-bit VM, all these memory spaces compete between each other for memory. Increasing the Java Heap space will further reduce capacity of the C-Heap and reserve more memory from the OS.

Your first task is to determine if you are dealing with a C-Heap depletion or OS physical / virtual memory depletion.

Now let’s see the most common patterns of this problem.

Common problem patterns

There are multiple scenarios which can lead to a native OutOfMemoryError. I will share with you what I have seen in my past experience as the most common patterns.

-        Native Heap (C-Heap) depletion due to too many Java EE applications deployed on a single 32-bit JVM (combined with large Java Heap e.g. 2 GB) * most common problem *
-        Native Heap (C-Heap) depletion due to a non-optimal Java Heap size e.g. Java Heap too large for the application(s) needs on a single 32-bit JVM
-        Native Heap (C-Heap) depletion due to too many created Java Threads e.g. allowing the Java EE container to create too many Threads on a single 32-bit JVM
-        OS physical / virtual memory depletion preventing the HotSpot VM to allocate native memory to the C-Heap (32-bit or 64-bit VM)
-        OS physical / virtual memory depletion preventing the HotSpot VM to expand its Java Heap or PermGen space at runtime (32-bit or 64-bit VM)
-        C-Heap / native memory leak (third party monitoring agent / library, JVM bug etc.)

Troubleshooting and resolution approach

Please keep in mind that each HotSpot native memory problem can be unique and requires its own troubleshooting & resolution approach.

Find below a list of high level steps you can follow in order to further troubleshoot:

-        First, determine if the OOM is due to C-Heap exhaustion or OS physical / virtual memory. For this task, you will need to perform close monitoring of your OS memory utilization and Java process size. For example on Solaris, a 32-bit JVM process size can go to about 3.5 GB (technically 4 GB limit) then you can expect some native memory allocation failures. The Java process size monitoring will also allow you to determine if you are dealing with a native memory leak (growing overtime / several days…)

-        The OS vendor and version that you use is important as well. For example, some versions of Windows (32-bit) by default support a process size up to 2 GB only (leaving you with minimal flexibility for Java Heap and Native Heap allocations). Please review your OS and determine what is the maximum process size e.g. 2 GB, 3 GB or 4 GB or more (64-bit OS)

-        Like the OS, it is also important that you review and determine if you are using a 32-bit VM or 64-bit VM. Native memory depletion for a 64-bit VM typically means that your OS is running out of physical / virtual memory

-        Review your JVM memory settings. For a 32-bit VM, a Java Heap of 2 GB+ can really start to add pressure point on the C-Heap; depending how many applications you have deployed, Java Threads etc… In that case, please determine if you can safely reduce your Java Heap by about 256 MB (as a starting point) and see if it helps improve your JVM memory “balance”.

-        Analyze the verbose GC output or use a tool like JConsole to determine your Java Heap footprint. This will allow you to determine if you can reduce your Java Heap in a safe manner or not

-        When OutOfMemoryError is observed. Generate a JVM Thread Dump and determine how many Threads are active in your JVM; the more Threads, the more native memory your JVM will use. You will then be able to combine this data with OS, Java process size and verbose GC; allowing to determine where the problem is

Once you have a clear view of the situation in your environment and root cause, you will be in a better position to explore potential solutions as per below:

-        Reduce the Java Heap (if possible / after close monitoring of the Java Heap) in order to give that memory back to the C-Heap / OS
-        Increase the physical RAM / virtual memory of your OS (only applicable if depletion of the OS memory is observed; especially for a 64-bit OS & VM)
-        Upgrade your HotSpot VM to 64-bit (for some Java EE applications, a 64-bit VM is more appropriate) or segregate your applications to different JVM’s (increase demand on your hardware but reduce utilization of C-Heap per JVM)
-        Native memory leak are trickier and requires deeper dive analysis such as analysis of the Solaris pmap / AIX svmon data and review of any third party library (e.g. monitoring agents). Please also review the Oracle Sun Bug database and determine if your HotSpot version you use is exposed to known native memory problems

Still struggling with this problem? Don’t worry, simply post a comment / question at the end of this article. I also encourage you to post your problem case to the root cause analysis forum.