/ ORA-01115 IO error reading solution ~ Java EE Support Patterns

7.24.2011

ORA-01115 IO error reading solution

This case study describes the complete root cause analysis and resolution of an Oracle 10gR2 database problem (ORA-01115) affecting a Weblogic Integration 9.2 Java EE application.

Environment specifications

-          Java EE server: Oracle Weblogic Integration 9.2 MP2
-          Database server: Oracle 10gR2
-          Middleware OS: Sun Solaris 5.10
-          Database server OS: IBM AIX 5.3 TL5
-          JDK: Sun Java HotSpot(TM) Server VM Version 1.5.0_10-b03
-          Platform type: BPM

Problem overview

-      Problem type: java.io.IOException: [BEA][Oracle JDBC Driver][Oracle]ORA-01115: IO error reading block from file 30 (block # 255462)

An IO Exception error was observed from the Weblogic Integration server logs. This Exception was thrown from the remote Oracle database when trying to insert data within WLI internal tables.

Gathering and validation of facts

As usual, a Java EE problem investigation requires gathering of technical and non technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:

·       What is the client impact? HIGH (full outage of our application)
·       Recent change of the affected platform? No
·       Any recent traffic increase to the affected platform? No
·       Since how long this problem has been observed?  This error suddenly appeared for the first time early in the morning
·       What is the health of the Oracle database server? Our Oracle DBA did confirm a problem regarding access to one of our Oracle data file (/oracle/db/oracle_data01.dbf')
·       Did a restart of the Weblogic Integration server resolve the problem? No

·         Conclusion #1: The problem seems to be isolated on the Oracle database server side and related to a possible IO problem accessing one of the Oracle internal data files.

Weblogic log file analysis: ORA-01115!

A first analysis of the problem was done by reviewing the Weblogic Integration managed server log errors.

java.io.IOException: [BEA][Oracle JDBC Driver][Oracle]ORA-01115: IO error reading block from file 30 (block # 255462)
ORA-01110: data file 30: '/oracle/db/oracle_data01.dbf'
ORA-27091: unable to queue I/O
ORA-27072: File I/O error
IBM AIX RISC System/6000 Error: 110: Media surface error
Additional information: 7
Additional information: 255461
Additional information: -1

        at weblogic.jdbc.base.BaseBlobOutputStream.write(Unknown Source)
        at weblogic.jdbc.base.BaseBlobOutputStream.write(Unknown Source)
        at com.bea.wli.knex.runtime.core.bean.BMPContainerBean$OracleTableAccess.doStoreByInsert(BMPContainerBean.java:860)
        at com.bea.wli.knex.runtime.core.bean.BMPContainerBean.doInsert(BMPContainerBean.java:1835)
        at com.bea.wli.knex.runtime.core.bean.BMPContainerBean.ejbStore(BMPContainerBean.java:1787)
        at com.bea.wli.bpm.runtime.ProcessContainerBean.ejbStore(ProcessContainerBean.java:204)
        at com.bea.wli.bpm.runtime.ProcessContainerBean.ejbStore(ProcessContainerBean.java:191)
        at com.bea.wlwgen.PersistentContainer_sq78z1_Impl.ejbStore(PersistentContainer_sq78z1_Impl.java:191)
        at weblogic.ejb.container.manager.ExclusiveEntityManager.beforeCompletion(ExclusiveEntityManager.java:609)
        at weblogic.ejb.container.manager.ExclusiveEntityManager.beforeCompletion(ExclusiveEntityManager.java:580)
        at weblogic.ejb.container.internal.TxManager$TxListener.beforeCompletion(TxManager.java:1028)
        at weblogic.transaction.internal.ServerSCInfo$CallBeforeCompletionsAction.run(ServerSCInfo.java:1529)
        at weblogic.security.acl.internal.AuthenticatedSubject.doAs(AuthenticatedSubject.java:363)
        at weblogic.security.service.SecurityManager.runAs(SecurityManager.java:147)
        at weblogic.transaction.internal.ServerSCInfo.doBeforeCompletion(ServerSCInfo.java:1161)
        at weblogic.transaction.internal.ServerSCInfo.callBeforeCompletions(ServerSCInfo.java:1143)
        at weblogic.transaction.internal.ServerSCInfo.startPrePrepareAndChain(ServerSCInfo.java:115)
        at weblogic.transaction.internal.ServerTransactionImpl.localPrePrepareAndChain(ServerTransactionImpl.java:1288)
        at weblogic.transaction.internal.ServerTransactionImpl.globalPrePrepare(ServerTransactionImpl.java:2100)
        at weblogic.transaction.internal.ServerTransactionImpl.internalCommit(ServerTransactionImpl.java:259)
        at weblogic.transaction.internal.ServerTransactionImpl.commit(ServerTransactionImpl.java:227)
        at weblogic.ejb.container.internal.MDListener.execute(MDListener.java:463)
        at weblogic.ejb.container.internal.MDListener.transactionalOnMessage(MDListener.java:335)
        at weblogic.ejb.container.internal.MDListener.onMessage(MDListener.java:291)
        at weblogic.jms.client.JMSSession.onMessage(JMSSession.java:4072)
        at weblogic.jms.client.JMSSession.execute(JMSSession.java:3962)
        at weblogic.jms.client.JMSSession$UseForRunnable.run(JMSSession.java:4490)
        at weblogic.work.ServerWorkManagerImpl$WorkAdapterImpl.run(ServerWorkManagerImpl.java:518)
        at weblogic.work.ExecuteThread.execute(ExecuteThread.java:209)
        at weblogic.work.ExecuteThread.run(ExecuteThread.java:181)

As you can see, the ORA-01115 error was thrown (along with ORA-01110 ORA-27091 ORA-27072) during the insert operation of WLI data. An AIX related error was also thrown (Media surface error).

-          Conclusion #2: The problem seems to be related to a physical inaccessibility (OS and / or disk) of this critical Oracle data file. A logical file corruption problem was also kept in mind as a possible root cause.

OS problem replication: The media surface is damaged!

After the initial log review and problem isolation, an AIX system admin was engaged to further troubleshoot this problem.  A simple copy of the affected Oracle data file was attempted:

:> cp /oracle/db/oracle_data01.dbf .
cp: /oracle/db/oracle_data01.dbf: The media surface is damaged.


This simple problem replication step did confirm that the AIX OS also had lost control of the file.
The following AIX command was also executed in order to review the AIX OS error report:

errpt –a
…………………………………………………………………………………..
Description
DISK OPERATION ERROR

Probable Causes
MEDIA

User Causes
MEDIA DEFECTIVE

        Recommended Actions
        FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
        PERFORM PROBLEM DETERMINATION PROCEDURES

Failure Causes
MEDIA
DISK DRIVE

        Recommended Actions
        FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
        PERFORM PROBLEM DETERMINATION PROCEDURES
…………………………………………………………………………………..

-          Conclusion #3: At this point, it was determined that the problem was very likely related to a physical problem with the SAN / RAID cluster. A technician was dispatched to perform further analysis of our RAID (Redundant Array of Independent Disks) environment.

Root cause: RAID disk failure!

Health check of the RAID environment did confirm that one of the 7 available disks failed earlier in the morning; just a few minutes before the first occurrence of the ORA-01115 error in the Weblogic logs.

Unfortunately, the disk failure triggered corruption of the RAID parity; affecting one of our Oracle data file. New files could be written to the RAID (6 disks left) but some files were not available due to parity corruption.

A disk failure has negative impact on parity. A disk can fail for various reasons such as the disk may be overused and due to excess variation in the power supply, it may result in the failure of the disk. Failure to replace it immediately can cause adverse affects on the system performance and may even damage the disks which are used in backup. The RAID parity provides protection against drive failures but if it encounters a parity error, it will start writing a bad data on the drive.

In our case, the technician was able to resolve the parity corruption without having to restore any data from the backup tapes. Resolution of the parity automatically allowed Oracle to gain back control of the affected date file; without any Oracle listener or database restart.

A complete Weblogic restart was performed as the final resolution step in order to re-initialize the affected WLI JDBC Connection pools.

Conclusion and recommendations

-          When facing ORA-01115 with Oracle, please ensure to do a fast and complete root cause analysis
-          Avoid taking unnecessary and non efficient resolution steps such as an early Weblogic restart or Oracle database restart

0 comments:

Post a Comment