2011-06-01

ASM Crashes as HAIP Does not Failover When Two or More Private Network Fails

ASM Crashes as HAIP Does not Failover When Two or More Private Network Fails (Doc ID 1323995.1):

 

 

Applies to:

Oracle Server - Enterprise Edition - Version: 11.2.0.2 to 11.2.0.2 - Release: 11.2 to 11.2
Information in this document applies to any platform.

Symptoms


When two or more cluster_interconnect fails, HAIP does not fail over as orarootagent.bin core dumps; as a result, ASM/DB instance may crash. The private network failure does not have to be local - in other word, if two or more private network fails on remote node while there's no such issue at all on local node, orarootagent could core dump on local node.


  • In this case, two private network failed:
Apr 11 14:59:13 srac1 nxge: [ID 339653 kern.notice] NOTICE: nxge7: xcvr addr:0x0a - link is down
Apr 11 15:06:41 srac1 nxge: [ID 339653 kern.notice] NOTICE: nxge6: xcvr addr:0x0b - link is down
  • $GRID_HOME/log/<node>/ohasd/ohasd.log
..
2011-04-11 15:06:54.293: [ CRSCOMM][21][FFAIL] Ipc: Couldnt clscreceive message, no message: 11
2011-04-11 15:06:54.293: [ CRSCOMM][21] Ipc: Client disconnected.
2011-04-11 15:06:54.293: [ CRSCOMM][21][FFAIL] IpcL: Listener got clsc error 11 for memNum. 11
2011-04-11 15:06:54.293: [ CRSCOMM][21] IpcL: connection to member 11 has been removed
2011-04-11 15:06:54.293: [CLSFRAME][21] Removing IPC Member:{Relative|Node:0|Process:11|Type:3}
2011-04-11 15:06:54.293: [CLSFRAME][21] Disconnected from AGENT process: {Relative|Node:0|Process:11|Type:3}
2011-04-11 15:06:54.294: [   CRSPE][29] {0:0:1262} Disconnected from server:
2011-04-11 15:06:54.294: [    AGFW][24] {0:0:1264} Agfw Proxy Server received process disconnected notification, count=1
2011-04-11 15:06:54.294: [    AGFW][24] {0:0:1264} /ocw/grid/bin/orarootagent_root disconnected.
2011-04-11 15:06:54.294: [    AGFW][24] {0:0:1264} Agent /ocw/grid/bin/orarootagent_root[27021] stopped!
2011-04-11 15:06:54.294: [ CRSCOMM][24] {0:0:1264} IpcL: removeConnection: Member 11 does not exist.
2011-04-11 15:06:54.294: [    AGFW][24] {0:0:1264} Restarting the agent /ocw/grid/bin/orarootagent_root
2011-04-11 15:06:54.294: [    AGFW][24] {0:0:1264} Starting the agent: /ocw/grid/bin/orarootagent with user id: root and incarnation:6
2011-04-11 15:06:54.333: [    AGFW][24] {0:0:1264} Starting the HB [Interval =  30000, misscount = 6kill allowed=1] for agent: /ocw/grid/bin/orarootagent_root
  • $GRID_HOME/log/<node>/agent/ohasd/orarootagent_root/orarootagent_root.log
2011-04-11 15:06:42.047: [    AGFW][10] {0:0:902} Agent received the message: AGENT_HB[Engine] ID 12293:26724
2011-04-11 15:06:42.590: [ora.crf][43] {0:0:892} [check] clsdmc_respget return: status=0, ecode=0
2011-04-11 15:06:42.590: [ora.crf][43] {0:0:892} [check] Check return = 0, state detail = NULL

>> orarootagent terminated and restarted

2011-04-11 15:06:54.484: [    AGFW][1] Starting the agent: /ocw/grid/log/srac2/agent/ohasd/orarootagent_root/
2011-04-11 15:06:54.484: [   AGENT][1] Agent framework initialized, Process Id = 24308
2011-04-11 15:06:54.488: [ USRTHRD][1] Utils::getCrsHome crsHome /ocw/grid
>>
  • orarootagent.bin call stack
mutex_lock_impl(0x40000000130, 0x0, 0xfffffd7fff7c0f30, 0x88, 0x0,
mutex_lock(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff2a37f8
lfiwr(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffc7df997
clsdf_nativewrite(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ff4b06647
clsdprln_native(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ff4b09b26
clsd_logThread(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ff4b0299c

  • alert.log for ASM and DB may have the following
SKGXP: ospid 5553: network interface with IP address 169.254.164.135 no longer running (check cable)

IPC Send timeout detected. Receiver ospid 5551

Received an instance abort message from instance 3

>>
>>

Cause

This is caused by bug 12325672 which is closed as duplicate of bug 12310608

Solution

bug 12310608 is fixed in 11.2.0.3., interim patch may exist under patch 12310608 or patch 12546712

If one-off patch is not available for your platform/version, please engage Oracle Support to request one. Please note the issue happens only when two or more private network fails.

References

BUG:12325672 - DISCONNECTING INTERCONNECT CAUSE INSTANCE EVICTIONS

Niciun comentariu:

Trimiteți un comentariu