Sunday, October 15, 2006

Case Study on How to diagnose a node reboot

The following article looks at how to diagnose a node reboot in a 10G RAC cluster.

Configuration:
2 node RAC on 10.2.0.1 on RH ES 4.0 (kernel 2.6.9-34.0.2.ELsmp)

To diagnose such issues it is very important that you get the timelines of the occurrences of the reboot spot on. It is a recommended practice to have the NTP daemon running on all nodes in the cluster so that the wall clocks of all nodes in the cluster are in sync.

hippo >ps -ef | grep ntp | grep -v 'grep'
ntp 3349 1 0 Oct15 ? 00:00:00 ntpd -u ntp:ntp -p /var/run/ntpd.pid -g

1. Check the Linux log file (/var/log/messages) to determine the time of the reboot

In this case the reboot occurred at 02:09 hours on the node hippo

Oct 12 09:24:34 hippo sshd(pam_unix)[5885]: session opened for user frego by frego (uid=0)
Oct 15 02:09:17 hippo syslogd 1.4.1: restart.
Oct 15 02:09:17 hippo syslog: syslogd startup succeeded
Oct 15 02:09:17 hippo kernel: klogd 1.4.1, log source = /proc/kmsg started.

Check the Linux log file on the other node (elephant) at the same point in time.

Oct 14 17:42:01 elephant su(pam_unix)[18041]: session closed for user root
Oct 15 02:09:12 elephant syslogd 1.4.1: restart.
Oct 15 02:09:12 elephant syslog: syslogd startup succeeded
Oct 15 02:09:12 elephant kernel: klogd 1.4.1, log source = /proc/kmsg started.

From the above output it is clear that both the nodes rebooted at approximately the same time.

2. Check which process is responsible for the reboot.

Review the log file of the CSS daemon. The log file resides in $ORA_CRS_HOME/log/

hippo >pwd
/u00/app/oracle/product/10.2.0/crs/log/hippo

hippo >cd cssd

hippo >pwd
/u00/app/oracle/product/10.2.0/crs/log/hippo/cssd

hippo >ls -ltr
total 9376
drwxr-xr-x 2 oracle dba 4096 Sep 8 18:08 oclsmon
-rw-r----- 1 oracle dba 105 Oct 15 02:02 ocssd.trc
-rw-r--r-- 1 oracle dba 5 Oct 15 02:11 hippo.pid
-rw-r--r-- 1 oracle dba 17 Oct 15 02:11 cssdOUT.log
-rw-r--r-- 1 oracle dba 9567227 Oct 16 15:13 ocssd.log

From the ocssd.log file

[ CSSD]2006-10-15 02:02:48.697 [50609072] >WARNING: clssnmDiskPMT: long disk latency (54990 ms) to voting disk (0//crs/voting.dsk)
[ CSSD]2006-10-15 02:02:50.719 [50609072] >ERROR: clssnmDiskPMT: 1 of 1 voting disks unavailable (0/0/1)
[ CSSD]2006-10-15 02:02:50.734 [50609072] >TRACE: clssscctx: dump of 0x0x810d500, len 3528


Hence the reboot has been caused due to latency issues in accessing the voting disk. The specific function which monitors this is the DiskPing Monitor Thread which monitors the length of a single I/O to the voting disk and marks the voting disk is unavailable. This has caused the CSS daemon to fail.

As seen from the /etc/inittab
h2:35:respawn:/etc/init.d/init.cssd fatal > /dev/null 2>&1 < /dev/null

a failure exit of the CSS daemon causes the machine to reboot.
This indicates the node is rebooted in case of CSS daemon failure.

This is also logged in the alertelephant.log in the $ORA_CRS_HOME/log/ which is the alert.log of the CRS stack.

2006-10-15 02:02:50.740
[cssd(8533)]CRS-1606:CSSD Insufficient voting files available [0 of 1]. Details in /u00/app/oracle/product/10.2.0/crs/log/elephant/cssd/ocssd.log.


Check the ocssd.log file on the other node (elephant) too

[ CSSD]2006-10-15 02:01:40.255 [50617264] >TRACE: clssgmClientConnectMsg: Connect from con(0xb7c3f580) proc(0xb7c67918) pid() proto(10:2:1:1)
[ CSSD]2006-10-15 02:02:50.724 [112925616] >ERROR: clssnmDiskPMT: 1 of 1 voting disks unavailable (0/0/1)
[ CSSD]2006-10-15 02:02:50.777 [112925616] >TRACE: clssscctx: dump of 0x0x810d500, len 3528


There have been a couple of additional CSS related parameters introduced in the 10.2.0.2 patchset to address long I/O requirements of storage vendors such as EMC and NetApp.

• reboottime: (default 3 seconds)
The amount of time allowed for a node to complete a reboot after the CSS daemon has been evicted.
This parameter can be set via the command
crsctl set css reboottime R [-force] (R is seconds)

• disktimeout (default 200 seconds)
The maximum amount of time allowed for a voting file I/O to complete; if this time is exceeded the voting disk will be marked as unavailable
This parameter can be set via the command
crsctl set css disktimeout D [-force] (D is seconds)

These commands must be run as root on a node where the CRSD is up, unless
'-force' is specified, in which case the the CRS stack should not be up on any other node.
After the commands are run all CSS daemons must be restarted for the settings to take effect on all nodes asap.
You can verify the settings by doing an ocrdump and checking the values in the OCRDUMPFILE

4 comments:

Anonymous said...

Hi Rego,

Good example.

The 'incremental backup as of SCN' can be used in Oracle Database 10g R2 to roll forward standby database.

In case of Oracle Database 10g R1,the incrementally updated backup (Note:290814.1) is used to roll forward the standby database.

Vaibhav

Anonymous said...

Rego,

Nice Explanations.

Regards,
Theetha

Tom Feinberg said...

Completely agree, this is really nice explenetion, hope it will help me with my Case Study Writing.

Anonymous said...

Very good explanation. Good top down approach to troubleshooting CRS issues.