This post describes the RCA (root cause analysis) analytic approach I have followed and developed over the last 10 years of developing, supporting and improving critical production Java EE enterprise systems.
My goal is to share how you should attack a major production problem and ultimately find the root cause along with the solution. It is important to note that the full RCA process must be done after the initial recovery of your environment since the #1 priority is to recover your client environment and business.
For this post, I will use the example of a sudden Oracle 10g SQL execution plan change as the root cause.
You can refer to the post below for a real life example on how I applied the same RCA approach.
OutOfMemoryError using EJB3 and JBoss 5
The first step is to recap and create a brief summary of the problem. At this stage, you should at least be aware on how the problem did manifest itself, including any error found from your server logs and problem(s) reported by your end user(s).
Ask yourself and answer the following questions:
At the second step, it is important to understand and document the client impact.
Ask yourself and answer the following questions:
At this point, you should have a high level view of the problem along with its associated business impacts.
The next step is quite crucial. At this stage, you need to revisit and gather all problem facts. Reproducing the problem may be required for some types of problems; especially if you are missing too much information / technical facts to move forward with your investigation.
For the next step, I apply some concepts of forward chaining reasoning, similar to what is used by a rule engine system.
The idea is to build a list of technical and non technical facts so you can either derive other facts and/or conclude on the root cause as per Java EE troubleshooting rules. By applying a forward chaining type of reasoning and analysis, you will eventually derive the root cause from all your technical and non technical input facts.
Please note that once you gain more experience, you will be able to also do some backward chaining RCA reasoning instead e.g. quickly start with a short list of possible root cause candidates and verify your Java EE problem facts until you get a match. This process is normally faster but requires the application support person to have more experience and knowledge on a multitude of Java EE problems and patterns.
The following statements are example of technical and non technical facts captured during the RCA process:
Facts derivation & conclusions:
At this point, the idea is to inject new facts along with the current derived facts and/or conclusions. This process should also allow you to isolate the problem further, reducing gradually the list of possible root cause candidates.
Find below some examples:
Facts derivation & conclusions:
Once you have isolated the problem to a certain level; you can now complete your RCA by deriving the root cause itself.
Find below some examples:
At this point, you should have derived 1 or more possible root cause for your problem.
Root cause:
Please keep in mind that you will face situation where you will never find the root cause for your problem. The reason is simple: lack of proper input facts, preventing you to derive the exact root cause. This can happen if you fail to gather proper performance data such as JVM Thread Dump, Heap Dump, proper application logs and/or you fail to replicate the problem.
Don’t be discouraged as you can still try to narrow it down to a list of root cause candidates and improve your input fact gathering process in the event of any re-occurrence of your problem and for future problems.
Once you have found the root cause, you are now at the point to explore and implement a solution
I hope this post did help you understand how I normally approach and resolve complex Java EE production problems. Please feel free to post any comment on your own RCA process experience along with your own suggestions.
My goal is to share how you should attack a major production problem and ultimately find the root cause along with the solution. It is important to note that the full RCA process must be done after the initial recovery of your environment since the #1 priority is to recover your client environment and business.
For this post, I will use the example of a sudden Oracle 10g SQL execution plan change as the root cause.
You can refer to the post below for a real life example on how I applied the same RCA approach.
OutOfMemoryError using EJB3 and JBoss 5
Problem overview
Ask yourself and answer the following questions:
· What is the problem reported by the end users? Major clocking of the ordering application
· What are the errors seen from your client applications? Browser clocking and blank page when using the ordering application
· What are the errors found in your server logs? Several errors and timeouts observed when trying to extract data from our internal Oracle 10g database
Business impact
Ask yourself and answer the following questions:
· What is the business impact level (low, medium or high)? HIGH
· What is the % of your end users affected (25%, 50% or 100%)? 100%
· Is there any possible revenue loss for your client(s)? Yes, users are unable to purchase online products from our client
Gathering and validation of facts
At this point, you should have a high level view of the problem along with its associated business impacts.
The next step is quite crucial. At this stage, you need to revisit and gather all problem facts. Reproducing the problem may be required for some types of problems; especially if you are missing too much information / technical facts to move forward with your investigation.
For the next step, I apply some concepts of forward chaining reasoning, similar to what is used by a rule engine system.
The idea is to build a list of technical and non technical facts so you can either derive other facts and/or conclude on the root cause as per Java EE troubleshooting rules. By applying a forward chaining type of reasoning and analysis, you will eventually derive the root cause from all your technical and non technical input facts.
Please note that once you gain more experience, you will be able to also do some backward chaining RCA reasoning instead e.g. quickly start with a short list of possible root cause candidates and verify your Java EE problem facts until you get a match. This process is normally faster but requires the application support person to have more experience and knowledge on a multitude of Java EE problems and patterns.
The following statements are example of technical and non technical facts captured during the RCA process:
· Recent change of the affected platform (yes or no)? No
· Any recent traffic increase to the affected platform (yes or no)? No
· Is this a new or existing problem and what is the frequency? New problem, first time observed
· What is the outcome of the JVM Thread Dump analysis (any pattern found such as downstream system hanging, internal deadlock etc.)? All threads are hanging and waiting for a response from our Oracle 10g database for a particular SQL e.g. Threads hanging in Socket.read() operation
· What is the outcome of the JVM Heap Dump analysis (Java Heap leak or footprint problem etc.)? No Heap Dump generated and no OutOfMemoryError found in logs
· Any other technical or non technical fact you can use as input? No other input fact collected
· Problem appears to be isolated between our application server and the database server and possibility more on database side (as per Thread Dump analysis); next step will require analysis of the Oracle 10g AWR report.
Processing of facts (forward chaining reasoning)
Find below some examples:
· What is the outcome of the Oracle 10g AWR report? The AWR report is showing very long running SQL, which correlates with the JVM Thread Dump analysis. Next step is to analyse the Oracle execution plan of the affected SQL
· Analysis of the Oracle AWR report before and during the problem did reveal a sudden degradation of the affected SQL
Facts derivation & conclusions:
· Problem appears to be isolated at the database level for one particular SQL
Process the final facts and derive the root cause
Find below some examples:
· What is the outcome of the Oracle SQL execution plan? The analysis did reveal that Oracle decided to change on the fly its execution plan to a sub optimal plan; causing a surge in the SQL elapsed time and CPU resource utilization
Root cause
Root cause:
· Root cause appear to be related to a sub optimal execution plan activated by Oracle following some dynamic query input parameter injection
Please keep in mind that you will face situation where you will never find the root cause for your problem. The reason is simple: lack of proper input facts, preventing you to derive the exact root cause. This can happen if you fail to gather proper performance data such as JVM Thread Dump, Heap Dump, proper application logs and/or you fail to replicate the problem.
Don’t be discouraged as you can still try to narrow it down to a list of root cause candidates and improve your input fact gathering process in the event of any re-occurrence of your problem and for future problems.
Solution
· Lock down the SQL execution plans by carefully controlling CBO statistics, using stored outlines (optimizer plan stability)
· Add detailed hints to the SQL
· Use Oracle10g SQL Profiles
· Upgrade to Oracle 11g which offer a more elegant and evolutive execution plan solution
Conclusion
I hope this post did help you understand how I normally approach and resolve complex Java EE production problems. Please feel free to post any comment on your own RCA process experience along with your own suggestions.
3 comments:
I use the RCA toolkit found here:
http://matureit.blogspot.com/p/itil.html
It was very helpful for me when doing an actual Root Cause Analysis
Thanks Dad, I will have a look at this tool.
Regards,
P-H
I appreciate all of the information that you have shared. Thank you for the hard work!
- RCA Software
Post a Comment