2013-02-07

top issues for node eviction in Oracle RAC

Top 5 issues for Instance Eviction [ID 1374110.1]
1.
An ORA-29740 error occurs when an instance evicts another instance in a RAC database.  The instance that gets evicted reports ora-29740 error in the alert.log.
Some of the reasons for this are a communications error in the cluster, failure to issue a heartbeat to the control file, and other reasons.  

Checking the lmon trace files of all instances is very important to determine the reason code.  Look for the line with "kjxgrrcfgchk: Initiating reconfig".
This will give a reason code such as "kjxgrrcfgchk: Initiating reconfig, reason 3".  Most of the ora-29740 error when an instance is evicted is due to reason 3 which means "Communications Failure".

The Document 219361.1 (Troubleshooting ORA-29740 in a RAC Environment) states the following as the likely cause of the ora-29740 error with reason 3:

a) Network Problems.
b) Resource Starvation (CPU, I/O, etc..)
c) Severe Contention in Database.
d) An Oracle bug.

2
In RAC, processes like lmon, lmd, and lms processes constantly talk to processes in other instances.  The lmd0 process  is responsible for managing enqueues while lms processes are responsible for managing data block resources and transferring data blocks to support the cache fusion.  When one or more of these processes are stuck, spin, or are extremely busy with the load, then these processes can cause the "IPC send timeout" error.

Another cause of "IPC send timeout" error reported by lmon, lms, and lmd processes is the  network problem or the server resource (CPU and memory) issue.  Those processes may not get scheduled to run on CPU or the network packet sent by those processes can get lost.

The communication problem involving lmon, lmd, and lms processes causes an instance eviction.  The alert.log of the evicting instance shows messages similar to

IPC Send timeout detected.Sender: ospid 1519
Receiver: inst 8 binc 997466802 ospid 23309

If an instance is evicted, the "IPC Send timeout detected" in alert.log is normally followed by other issues like ora-29740 and "Waiting for clusterware split-brain resolution"

3.
Different processes such as lmon, lmd, and lms communicate with corresponding processes on other instances, so when the instance and database hang, those processes may be waiting for a resource such as a latch, an enqueue, or a data block.  Those processes that are waiting can not respond to the network ping or send any communication over the network to the remote instances.  As a result, other instances evict the problem instance.

You may see a message similar to the following in the alert.log of the instance that is evicting another instance:
Remote instance kill is issued [112:1]: 8
or
Evicting instance 2 from cluster

4.
The lmon process sends a network ping to remote instances, and if lmon processes on the remote instances do not respond, a split brain at the instance level occurred.  Therefore, finding out the reason that the lmon can not communicate with each other is important in resolving this issue.

The common causes are:
1) The instance level split brain is frequently caused by the network problem, so checking the network setting and connectivity is important.  However, since the clusterware (CRS) would have failed if the network is down, the network is likely not down as long as both CRS and database use the same network.   
2) The server is very busy and/or the amount of free memory is low -- heavy swapping and scanning or memory will prevent lmon processes from getting scheduled.  
3) The database or instance is hanging and lmon process is stuck.
4) Oracle bug

The above causes are similar to the causes for the issue #1 (The alert.log shows ora-29740 as a reason for instance crash/eviction).

5.
The alert.log of the instance that is asking CRS to kill the problem instance shows
Remote instance kill is issued [112:1]: 8

For example, the above message means that the member kill request to kill the instance 8 is sent to CRS.

The problem instance is hanging for any reason and is not responsive.  This could be due to the node having CPU and memory problem, and the processes for the problem instance is not getting scheduled to run on CPU.

The second common cause is a severe contention in the database is preventing the problem instance from realizing that remote instances evicted the instance. 

Another cause could be due to the one or more processes surviving the "shutdown abort" when the instance tries to abort itself.  Unless all processes for the instance is killed, CRS does not think the instance terminated and will not inform other instances that the problem instance aborted.  One common problem for this is that one or more processes become defunct processes and do not terminate.
This leads to the recycle of CRS either through a node reboot or a rebootless restart of CRS (node does not get rebooted but CRS gets restarted).  
In this case, the alert.log if the problem instance shows
Instance termination failed to kill one or more processes
Instance terminated by LMON, pid = 23305

Niciun comentariu:

Trimiteți un comentariu