This article will provide you with complete root cause analysis and solution of a java.util.HashMap Thread race condition affecting a Weblogic 10 Portal application.
This case study will also demonstrate how you can combine Solaris PRSTAT and Thread Dump analysis to pinpoint you top CPU contributor Threads within your Java VM(s).
Environment specifications
- Java EE server: Oracle Weblogic Portal 10
- Middleware OS: Solaris 10
- Java VM: Oracle Jrockit 1.5.0
- Platform type: Portal application
Monitoring and troubleshooting tools
- Compuware Server Vantage (OS & Server level monitoring agent)
- Solaris PRSTAT (CPU and Thread breakdown command)
- JVM Thread Dump (thread analysis and PRSTAT data corrleation)
Problem overview
- Problem type: Intermittent High CPU observed from our production environment
A high CPU problem was observed from Solaris server hosting a Weblogic portal 10 application.
Gathering and validation of facts
As usual, a Java EE problem investigation requires gathering of technical and non technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:
· What is the client impact? MEDIUM
· Recent change of the affected platform? No
· Any recent traffic increase to the affected platform? No, traffic also appears to be fairly well distributed across all Weblogic managed servers / JVM
· Since how long this problem has been observed? A sudden CPU increase was observed from one of our JVM
· Is the CPU problem global or isolated? A high CPU utilization was only observed from one JVM (60%+ for this Java process) but our other JVM’s are showing a healthy 5% average utilization
- Conclusion #1: The high CPU problem appears to be isolated on one JVM only vs. a global problem
CPU utilization analysis
Review of the Compuware Server Vantage CPU data div reveal a sharp and sudden increase of CPU from one of our 4 JVM running our Weblogic 10 portal application e.g. 60%+ with no sign of CPU going down e.g. sustained high CPU utilization.
Java process CPU breakdown by Thread
One of the best troubleshooting approaches to deal with this type of issue is to generate a PRSTAT snapshot (if using Solaris OS) combined with Thread Dump. This is achieved by executing the command below:
prstat -L 0 1 | grep <Java PID>;kill -3 <Java PID>
For our case study, the following PRSTAT output data was generated as per below:
bash-2.05$ prstat -L 0 1 | grep 13853
13853 bea 9982M 9909M cpu100 0 0 8:48:41 5.9% java/90
13853 bea 9982M 9909M cpu1 0 0 2:49:10 5.9% java/91
13853 bea 9982M 9909M cpu98 0 0 8:37:40 5.9% java/85
13853 bea 9982M 9909M cpu101 0 0 2:50:21 5.9% java/88
13853 bea 9982M 9909M cpu103 0 0 1:04:24 5.9% java/28
13853 bea 9982M 9909M cpu102 0 0 4:52:06 5.8% java/94
13853 bea 9982M 9909M cpu6 0 0 4:50:04 5.8% java/32
13853 bea 9982M 9909M sleep 40 0 0:01:11 1.2% java/12289
13853 bea 9982M 9909M sleep 59 0 0:05:36 1.1% java/89
13853 bea 9982M 9909M sleep 30 0 0:19:17 0.7% java/62
13853 bea 9982M 9909M sleep 40 0 0:05:54 0.7% java/93
As you can see in the above snapshot, several Threads were found taking ~ 6% each, ultimately contributing to this total ~ 50-60% CPU.
Such Threads found were: 90, 91, 85, 88, 28, 94, 32
Thread Dump analysis and PRSTAT correlation
Once you obtain a list of the culprit Threads consuming lot of your CPU, the next step is to correlate this data with the Thread Dump data and identify the source / culprit at the code level.
For a JRockit VM Thread Dump, the PRSTAT Thread id corresponds exactly to the decimal format of the Thread dump tid field which is located at the first line of each Thread.
For our case study, Thread Dump was generated which did reveal the culprit Threads as per below.
Thread #90 Stack Trace can be found below:
"[STUCK] ExecuteThread: '13' for queue: 'weblogic.kernel.Default (self-tuning)'" id=81 idx=0x168 tid=90 prio=1 alive, daemon
at java/util/HashMap.put(HashMap.java:1665)
at <App>.execute()
............................................
at weblogic/servlet/internal/WebAppServletContext$ServletInvocationAction.run(WebAppServletContext.java:3393)
at weblogic/security/acl/internal/AuthenticatedSubject.doAs(AuthenticatedSubject.java:321)
at weblogic/security/service/SecurityManager.runAs(Lweblogic/security/acl/internal/AuthenticatedSubject;(Unknown Source)
at weblogic/servlet/internal/WebAppServletContext.securedExecute(WebAppServletContext.java:2140)
at weblogic/servlet/internal/WebAppServletContext.execute(WebAppServletContext.java:2046)
at weblogic/servlet/internal/ServletRequestImpl.run(ServletRequestImpl.java:1366)
at weblogic/work/ExecuteThread.execute(ExecuteThread.java:200)
at weblogic/work/ExecuteThread.run(ExecuteThread.java:172)
at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
-- end of trace
Thread Dump analysis – HashMap Thread Race condition!
As you can see from the above Thread Stack Trace of Thread #90, the Thread is currently stuck in an infinite loop / Thread race condition over a java.util.HashMap object. This finding was quite interesting.
All other Threads were also found at the exact same hanging condition which did confirm the Thread race condition against this application HashMap.
Root cause: non Thread safe HashMap!
A review of our application code which is creating & managing read and write operations of this HashMap did confirm that this HashMap was static and not Thread safe e.g. no synchronization for write operations and no usage of Collections.synchronizedMap().
As you may know, a HashMap by default is not Thread safe. This means that it is exposed to Thread race condition executing read and especially write operations. As per our case study, such non Thread safe HashMap, under high concurrent load, can lead to intermittent but severe Thread race condition, infinite looping and high CPU.
Solution: ConcurrentHashMap!
The solution to this problem did involve a replacement of the original non Thread Safe HashMap code implementation to a java.util.concurrent.ConcurrentHashMap.
For high concurrent Thread processing data structure, a ConcurrentHashMap is more applicable since it does provide you with Thread safe operations and non-blocking get operations e.g. get() operations may overlap with update operations (including put and remove).
This implementation strategy did eliminate 100% of the observed Thread Race conditions and at no noticeable performance impact.
Conclusion
I hope this case study has helped you understand how to pinpoint culprit of high CPU Threads at the code level and the importance of proper Thread safe data structure for high concurrent Thread / processing applications.
Please don’t hesitate to post any comment or question.
2 comments:
Thank you for the post!
We have the same problem in PROD:
"[STUCK] ExecuteThread: '2' for queue: 'weblogic.kernel.Default (self-tuning)'" id=39 idx=0xdc tid=15362 prio=1 alive, native_blocked, daemon
at java/util/HashMap.put(HashMap.java:468)[optimized]
at java/util/HashSet.add(HashSet.java:200)[optimized]
..............
at weblogic/work/ExecuteThread.execute(ExecuteThread.java:201)[optimized]
at weblogic/work/ExecuteThread.run(ExecuteThread.java:173)
at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
-- end of trace
As we do not know which HashMap from our code is generating the problem, the only solution is to make a code review of our application and to replace the HashMap in cause with ConcurrentHashMap?
Thank you for your answer,
Ali
Hi anonymous and thanks for your comments,
You should be able to identify the HashMap from the stuck above stuck thread. Perform a code review of your application code calling: HashSet.add(). You should find the culprit.
Then review the code and see if the HashMap is "shared" across threads e.g. static HashMap or via Singleton, global cache etc.. If it is then yes, you may have to consider using the ConcurrentHashmap data structures which is design for concurrent read and thread safe write operations.
Please see my other article for performance differences.
http://javaeesupportpatterns.blogspot.com/2012/08/java-7-hashmap-vs-concurrenthashmap.html
Thanks.
P-H
Post a Comment