This case study describes the complete steps from root cause analysis to resolution of an intermittent Weblogic Connection Pool connectivity problem experienced between an Oracle Weblogic 8.1 application server and an Oracle 10g database.
It will also demonstrate the importance for an application support person of mastering some basic network troubleshooting skill sets and techniques in order to do proper problem isolation and root cause analysis of such type of problem.
· Java EE server: Oracle Weblogic Platform 8.1 SP6
· OS: AIX 5.3 TL9 64-bit
· JDK: IBM JRE 1.4.2 SR13 32-bit
· RDBMS: Oracle 10gr2
· Platform type: Ordering application
· AIX 5.3 PING comman
· AIX 5.3 TRACEROUTE command
· Weblogic dbping utility
- Problem type: The DBMS driver exception was: Io exception: The Network Adapter could not establish the connection
An intermittent connectivity problem was observed in our production environment between our application server and database server. Such Weblogic error was spotted in our log during problem reproduction.
Initial problem mitigation did involve restarting the affected Weblogic managed server(s) almost on a daily basis until successful connection established with the remote database server.
Gathering and validation of facts
A Java EE problem investigation requires gathering of technical and non technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:
· What is the client impact? Low, problem was intermittent and our platform has proper load balancing and fail-over in place
· Recent change of the affected platform? No
· Any recent traffic increase to the affected platform? No
· Any recent activity or restart of the application server or database server? Yes, the application server is restarted on daily basis. The remote database server was last physically restarted a few weeks ago following a network incident in the server farm
· Since how long this problem has been observed? Since a few weeks
· Is the JDBC Connection Pool connectivity problem is consistent or intermittent? Problem is intermittent
· Did a restart of the Weblogic server resolve the problem? No, currently only used as a mitigation strategy
· Did the DBA team found any problem with the Oracle 10g database? No problem found with the database itself
· Did the support team analysis the Weblogic logs and any error? Yes, Weblogic JDBC error was found and as per Weblogic / JDBC Driver documentation, this indicates that the JDBC driver is unable to physically connect to the remote Oracle database
· Conclusion #1: The problem and error type appear to point to a network / connectivity problem between Weblogic application server and remote database server
· Conclusion #2: The recent network problem and physical restart of the Oracle database server is a potential trigger suspect
Weblogic error log review
The error below was found during problem reproduction. Such error prevented the initialization and deployment of our primary application JDBC Data Source and application.
<Warning> <JDBC> <BEA-001129> <Received exception while creating connection for pool "<App Conn Pool>": Io exception: The Network Adapter could not establish the connection
<Error> <JDBC> <BEA-001150> <Connection Pool "<App Conn Pool>" deployment failed with the following error: 0:Could not create pool connection. The DBMS driver exception was: Io exception: The Network Adapter could not establish the connection.>
<Error> <JDBC> <BEA-001151> <Data Source "<App DS>" deployment failed with the following error: DataSource(App DS) can't be created with non-existent Pool (connection or multi) (App Conn Pool).>
Network health check using PING, TRACEROUTE and other utilities
Given the intermittent behaviour of this problem, the support team decided to perform some additional analysis of the network situation between our application and database server. The AIX PING command was used for that purpose as per below.
# Send 5 packets of 64 bytes to the remote database IP address
ping -c 5 -q -s 64 <IP address>
# Validate the connectivity and route through the different HOP(s)) from the source server (Weblogic application server) to the destination server (Oracle 10g database server)traceroute <IP address>
As per the above results, ~50% of loss packets were found between our application and database server. The intermittent connectivity problem was also reproduced using the traceroute command.
Please note that Weblogic has also a database "ping" utility that you can use to test your network connectivity and database listener availability from the WebLogic physical server to the remote DB server. This utility basically simulates the creation of a new JDBC Connection via the java.sql.DriverManager.
DB Ping Usage
../<JAVA_HOME>/bin/java -classpath ../<WL_HOME>/<WL_SERVER_HOME>/server/lib/weblogic.jar utils.dbping ORACLE_THIN <dbUserName> <dbPasswoes> <dbURL>
DB Ping - Other RDBMS Provider Usage
java utils.dbping DB2B [-d dynamicSections] USER PASS HOST:PORT/DBNAME\nor
java utils.dbping JCONN2 USER PASS HOST:PORT/DBNAME\nor
java utils.dbping JCONN2 USER PASS HOST:PORT/DBNAME\nor
java utils.dbping JCONN3 USER PASS HOST:PORT/DBNAME\nor
java utils.dbping JCONNECT USER PASS HOST:PORT/DBNAME\nor
java utils.dbping INFORMIXB USER PASS HOST:PORT/DBNAME/INFORMIXSERVER\nor
java utils.dbping MSSQLSERVERB USER PASS HOST:PORT/[DBNAME]\nor
java utils.dbping MYSQL USER PASS [HOST][:PORT]/[DBNAME]\nor
java utils.dbping ORACLEB USER PASS HOST:PORT/DBNAME\nor
java utils.dbping ORACLE_THIN USER PASS HOST:PORT:DBNAME\nor
java utils.dbping POINTBASE USER PASS HOST[:PORT]/DBNAME\nor
java utils.dbping SYBASEB USER PASS HOST:PORT/DBNAME");
telnet <DB hostname> <DB listener port>
Network sniffer analysis
Following the findings, our application support team did engage a network sniffer team to troubleshoot the problem further. Analysis was done by sniffing the inbound and outbound traffic packets generated by the ping and traceroute commands at the network switch level between the source and destination server.
The sniffer team found that the lost packets were actually not coming out of the remote database server; which did isolate the problem further at the remote database server level.
Suspected root cause
The combination of the gathered facts along with application and network support teams did conclude on a routing problem affecting the Oracle 10g database and causing intermittent but consistent packet loss with our application server.
Given the recent network problem in the server farm and physical reboot of the server, it was suspected that the root cause of the problem was related to an invalid ARP table of the network switch and/or our Oracle database server.
ARP stands for address resolution protocol. The delivery an IP packet to the next hop is encapsulated in an Ethernet frame. Such frame must contain a destination address which is determined by inspecting the ARP cache table. If the table does not have an entry the switch will issue an ARP request and wait for a response from the next hop. Any problem with this cached table could lead to routing and connectivity problem; requiring a reset.
Resolution and results
As per the root cause analysis, the physical server support team did proceed with a physical reboot of the affected database server which did reset/clear both ARP tables at the switch and server level.
The results were quite conclusive post restart and the % of packet loss dropped from 50% to 0%. The traceroute command also indicated fast connectivity with no delay.
Conclusion and recommendations
· When facing “The Network Adapter could not establish the connection” problem with Weblogic, do not assume that the database server is down; gather all the data and facts instead and proceed with simple network problem isolation using ping & traceroute and telnet (port health check
· Perform your due diligence and problem isolation before engaging a network sniffer team, this will help you speed up the root cause analysis process
· Please make sure to keep track on key events/deployments inside your environment, including network problem episodes as those type of events are often trigger of Java EE related problems