2011-05-06

database corruption after shutdown immediate in 11.2

ORA-600 or Data Corruption possible during shutdown normal/transactional/immediate of RAC instances in a rolling fashion [ID 1318986.1]

Applies to:

Oracle Server - Enterprise Edition - Version: 11.2.0.1 to 11.2.0.2 - Release: 11.2 to 11.2
Information in this document applies to any platform.

Description

Bug 10205230 has been identified that under certain circumstance could cause data corruption or ORA-600 errors in RAC environments. The issue can occur when shutting down Oracle RAC instances using the shutdown mode "normal, transactional, or immediate" and the instance(s) are shutdown in a rolling fashion; i.e. other instances remain to operate normally. The issue does not occur, if the instances are shutdown using "shutdown abort" or the database is shutdown as a whole as a workaround. Oracle ASM is not impacted by this issue.

This issue could occur in 11.2.0.1 and 11.2.0.2. It has been fixed in 11.2.0.1 Exadata BP9, 11.2.0.2 PSU 2, 11.2.0.2 Exadata BP2 and after. We strongly recommend customers in earlier 11.2 releases to proactively apply the patch for this fix as soon as possible.

Likelihood of Occurrence

The issue could happen under certain timing conditions during shutdown normal/ transactional/ immediate of RAC instances in a rolling fashion when the other instances are still under normal workload. Note that this issue does not happen with shutdown abort. Oracle ASM is not impacted by this issue.

Possible Symptoms

The known downstream effect includes the following:

* Data corruption occurs around shutdown one or more of the RAC instances

* One of the following ORA-600 asserts:
- ORA-600 [kclchkblk_3]
- ORA-600 [kclwcrs_6]
- ORA-600 [ktubko_1]
- ORA-600 [kcratr_scan_lostwrt]
- ORA-600[3020] on the standby database

Workaround or Resolution

1. Shutdown database as a whole using abort option:
SQL> alter system checkpoint;
srvctl stop database -d <db_uniqueue_name> -o abort -f
The first command writes out dirty buffers for all instances to minimize instance recovery; The srvctl command shutdown all instances with abort option and its dependent resources. Shutdown abort completely by-passes the vulnerable code path of the bug.

2. If shutdown instances in rolling fashion is needed, instead of shutdown normal/transactional/immediate, one should shutdown each instance with the following 2 commands:

SQL> alter system checkpoint local;
SQL> shutdown abort;

The first command writes out dirty buffers for this instance to minimize instance recovery. Shutdown abort completely by-passes the vulnerable code path of the bug. Instance can also be shutdown abort using "srvctl stop instance -d <db_unique_name> -i <instance_name> -o abort".

Recovery

Because this bug may cause logical corruption to redo stream. Below are the standard methods Oracle supports to recover from logical corruption:
  • Drop and recreate the corrupt object. This may work for objects like indexes.
  • Use DBMS_REPAIR package to repair and skip the corrupted blocks, and recreate the object with create table as select. This may work for objects like tables and partition tables, but may result in data loss.
  • Failover to a standby database that has not been affected by the data corruption. This may result in data loss.
  • Flashback or perform point-in-time recovery. This may also result in data loss.
If a standby database encounters ORA-600[3020] stuck recovery, you can invoke trial recovery to see if there are other corrupt blocks:

SQL> recover automatic standby database allow n corruption test;

Here n can be 1 or any other integer.

One can also use the following command to allow standby recovery to continue by marking the problematic block as corrupt:

SQL> recover automatic standby database allow n corruption;

For more details, see Oracle support note 1265884.1.

Depending on the nature of the corruption, there might be other recovery methods. Call Oracle support for other options.

Patches

If possible, shutdown the database as a whole (Option 1 in workaround section) to apply the patch. If not, the patch is still rolling upgradable with a slightly modified procedure as shutdown in a rolling fashion in the rolling upgrade exposes the upgrade procedure to the bug. Instead of shutdown normal/transactional/immediate, one should shutdown each instance with the following 2 commands:

SQL> alter system checkpoint local;
SQL> shutdown abort;

The first command writes out dirty buffers for this instance to minimize instance recovery. Shutdown abort completely by-passes the vulnerable code path of the bug. Instance can also be shutdown abort using "srvctl stop instance -d <db_unique_name> -i <instance_name> -o abort".

Besides this special step, normal rolling upgrade procedure applies.

Niciun comentariu:

Trimiteți un comentariu