Clicky Web Analytics

Clicky

Aug
21
Sat
Posted By Ritesh Chhajer on Saturday, August 21, 2010
22904 Views 7 Comments


Whenever a node is having issues joining the cluster back post reboot, here is a quick check list I would suggest:
  • /var/log/messages
  • ifconfig
  • ip route
  • /etc/hosts
  • /etc/sysconfig/network-scripts/ifcfg-eth*
  • ethtool
  • mii-tool
  • cluvfy
  • $ORA_CRS_HOME/log

Let us now take a closer look at specifc issues with examples and steps taken for their resolution.
These are all tested on Oracle 10.2.0.4 database on RHEL4 U8 x-64

1. srvctl not able to start Oracle Instance but sqlplus able to start
a. Check racg log for actual error message.
% more $ORACLE_HOME/log/`hostname -s`/racg/ora.{DBNAME}.{INSTANCENAME}.inst.log

b. Check if srvctl is configured to use correct parameter file(pfile/spfile)
% srvctl config database -d {DBNAME} -a
You can also validate parameter file by using sqlplus to see the exact error message.

c. Check ownership for $ORACLE_HOME/log
If this is owned by root, srvctl won't be able to start instance as oracle user.
# chown -R oracle:dba $ORACLE_HOME/log

2. VIP has failed over to another node but is not coming back to the original node
Fix: The node where the VIP has failed over, bring it down manually as root
Example: ifconfig eth0:2 down
PS: Be careful to bring down only VIP. A small typo may bring down your public interface:)

3. Moving OCR to a different location
PS: This can be done while CRS is up as root.
While trying to change ocr mirror or the ocr to a new location, ocrconfig complaints.
The fix is to touch the new file.
Example:
# ocrconfig -replace ocrmirror /crs_new/cludata/ocrfile
PROT-21: Invalid parameter

# touch /crs_new/cludata/ocrfile
# chown root:dba /crs_new/cludata/ocrfile
# ocrconfig -replace ocrmirror /crs_new/cludata/ocrfile

Verify:
a. Validate using "ocrcheck". Device/File Name should point to the new one with integrity check succeeded.
b. Ensure OCR inventory is updated correctly
# cat /etc/oracle/ocr.loc
ocrconfig_loc and ocrmirrorconfig_loc should point to correct locations.

4. Moving Voting Disk to a different location
PS: CRS must be down while moving the voting disk.

The idea is to add new voting disks and delete the older ones.
Find below sample errors and their fix.
# crsctl add css votedisk /crs_new/cludata/cssfile_new
Cluster is not in a ready state for online disk addition

We need to use force option. However, before using force option, ensure CRS is down.
If CRS is up, DO NOT use force option else it may corrupt your OCR.

# crsctl add css votedisk /crs_new/cludata/cssfile_new -force
Now formatting voting disk: /crs_new/cludata/cssfile_new
successful addition of votedisk /crs_new/cludata/cssfile_new.

Verify using "crsctl query css votedisk" and then delete the old votedisks.
While deleting too, you'll need to use force option.

Also verify the permissions of the voting disk files. It should be oracle:dba
If voting disks were added using root, the permission should be changed to oracle:dba

5. Manually registering listener resource to OCR
Listener was registered manually with OCR but srvctl was unable to bring up the listener
Let us first see example of how to manually do this.
From an existing available node, print the listener resource
% crs_stat -p ora.test-server2.LISTENER_TEST-SERVER2.lsnr > /tmp/res
% cat /tmp/res
NAME=ora.test-server2.LISTENER_TEST-SERVER2.lsnr
TYPE=application
ACTION_SCRIPT=/orahome/ora10g/product/10.2.0/db_1/bin/racgwrap
ACTIVE_PLACEMENT=0
AUTO_START=1
CHECK_INTERVAL=600
DESCRIPTION=CRS application for listener on node
FAILOVER_DELAY=0
FAILURE_INTERVAL=0
FAILURE_THRESHOLD=0
HOSTING_MEMBERS=test-server2
OPTIONAL_RESOURCES=
PLACEMENT=restricted
REQUIRED_RESOURCES=ora.test-server2.vip
RESTART_ATTEMPTS=5
SCRIPT_TIMEOUT=600
START_TIMEOUT=0
STOP_TIMEOUT=0
UPTIME_THRESHOLD=7d
USR_ORA_ALERT_NAME=
USR_ORA_CHECK_TIMEOUT=0
USR_ORA_CONNECT_STR=/ as sysdba
USR_ORA_DEBUG=0
USR_ORA_DISCONNECT=false
USR_ORA_FLAGS=
USR_ORA_IF=
USR_ORA_INST_NOT_SHUTDOWN=
USR_ORA_LANG=
USR_ORA_NETMASK=
USR_ORA_OPEN_MODE=
USR_ORA_OPI=false
USR_ORA_PFILE=
USR_ORA_PRECONNECT=none
USR_ORA_SRV=
USR_ORA_START_TIMEOUT=0
USR_ORA_STOP_MODE=immediate
USR_ORA_STOP_TIMEOUT=0
USR_ORA_VIP=

Modify relevant parameters in the resource file to point to correct instance.
Rename as resourcename.cap
% mv /tmp/res /tmp/ora.test-server1.LISTENER_TEST-SERVER1.lsnr.cap

Register with OCR
% crs_register ora.test-server1.LISTENER_TEST-SERVER1.lsnr -dir /tmp/

Start listener
% srvctl start listener -d testdb -n test-server1

While trying to start listener, srvctl is throwing errors like "Unable to read from listener log file"
The listener log file exists.
If resource is registered using root, then srvctl won't be able to start using oracle user.
So all the aforementioned operations while registering the listener manually should be done using oracle user.

6. Services
While checking status of a service, it says "not running"
If we try to start it using srvctl, the error message is "No such service exists" or "already running"
If we try to add service with same name, it says "already exists"
This happens because the service is in an "Unknown" state in the OCR
Using crs_stat, check if any related resource for service(resource names ending with .srv and .cs) is still lying around.
srvctl remove service -f has been tried and the issue persists.
Here is the fix:
# crs_stop -f {resourcename}
# crs_unregister {resourcename}
Now service can be added and started correctly.

7. Post host reboot, CRS is not starting
After host reboot, CRS was not coming up. No CRS logs in $ORA_CRS_HOME
Check /var/log/messages
"Cluster Ready Services waiting on dependencies. Diagnostics in /tmp/crsctl.9559"
No logs seen in /tmp/crsctl.*

Run cluvfy to identify the issue
$ORA_CRS_HOME/bin/cluvfy stage -post crsinst -n {nodename}

/tmp was not writable

/etc/fstab was incorrect and was fixed for making /tmp available

If you see messages like "Shutdown CacheLocal. my hash ids don't match" in the CRS log, then
check if /etc/oracle/ocr.loc is same across all nodes of the cluster.

8. CRS binary restored by copying from existing node in the cluster
CRS not starting with following messages in /var/log/messages;
"Id "h1" respawning too fast: disabled for 5 minutes"

CRSD log showing "no listener"

If CRS binary is restored by copying from existing node in the cluster, then you need to ensure:
a. Hostnames are modified correctly in $ORA_CRS_HOME/log
b. You may need to cleanup socket files from /var/tmp/.oracle

PS:Exercise caution while working with the socket files. If CRS is up, you should never touch those files otherwise reboot may be inevitable.

9. CRS rebooting frequently by oprocd
Check /etc/oracle/oprocd/ and grep for "Rebooting".
Check /var/log/messages and grep for "restart"
If the timestamps are matching, this confirms reboots are being initated by oprocd process.

%ps -ef|grep oprocd
root 10409 9937 0 Feb27 ? 00:00:00 /oracle/product/crs/bin/oprocd.bin run -t 1000 -m 500 -f

-t 1000 means oprocd would wake up every 1000ms
-m 500 means allow upto 500ms margin of error
Basically with these options if oprocd wakes up after > 1.5 secs it’s going to force a reboot.
This is conceptually analogous to what hangcheck timer used to do pre 10.2.0.4 Oracle releases on Linux.

Fix is to set CSS diagwait to 13
#crsctl set css diagwait 13 -force

# /oracle/product/crs/bin/crsctl get css diagwait
13

This actually changes what parameters oprocd runs with
%ps -ef|grep oprocd
root 10409 9937 0 Feb27 ? 00:00:00 /oracle/product/crs/bin/oprocd.bin run -t 1000 -m 10000 -hsi 5:10:50:75:90 -f

Note that the margin has now changed to 10000ms i.e 10 seconds in place of the default 0.5 seconds.

PS: Setting diagwait requires a full shutdown of Oracle Clusterware on ALL nodes.

10. Cluster hung. All SQL queries on GV$ views are hanging.
Alert log from all instance have message like below:
INST1: IPC Send timeout detected. Receiver ospid 1650

INST2:IPC Send timeout detected.Sender: ospid 24692
Receiver: inst 1 binc 150 ospid 1650

INST3: IPC Send timeout detected.Sender: ospid 12955
Receiver: inst 1 binc 150 ospid 1650

The ospid on all instances belong to LCK0 - Lock Process
In case of inter-instance lock issues, it's important to identify the instance from where it's initiating.
As seen from above, INST1 is the one that needs to be fixed.
Just identify the process that is causing row cache lock and kill it otherwise reboot node 1.

11. Inconsistent OCR with invalid permissions
% srvctl add db -d testdb -o /oracle/product/10.2
PRKR-1005 : adding of cluster database testdb configuration failed, PROC-5: User does not have permission to perform a cluster registry operation on this key. Authentication error [User does not have permission to perform this operation] [0]

crs_stat doesn't have any trace of it so utilities like crs_setperm/crs_unregister/crs_stop won't work in this case.

ocrdump shows:
[DATABASE.LOG.testdb]
UNDEF :
SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_ALL_ACCESS, OTHER_PERMISSION : PROCR_READ, USER_NAME : root, GROUP_NAME : root}

[DATABASE.LOG.testdb.INSTANCE]
UNDEF :
SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_ALL_ACCESS, OTHER_PERMISSION : PROCR_READ, USER_NAME : root, GROUP_NAME : root}

These logs are owned by root and that's the problem.
This means that the resource was perhaps added into OCR using root.
Though it has been removed by root but now it cannot be added by oracle user unless we get rid of the aforementioned.

Shutdown the entire cluster and either restore from previous good backup of OCR using:
ocrconfig -restore backupfilename

You can get list of backups using:
ocrconfig -showbackup

If you are not sure of last good backup, there you can also do the following:
Take export backup of OCR using:
ocrconfig -export /tmp/export -s online

Edit /tmp/export and remove those 2 lines pointing to DATABASE.LOG.testdb and DATABASE.LOG.testdb.INSTANCE owned by root

Import it back now
ocrconfig -import /tmp/export

After starting the cluster, verify using ocrdump.
The OCRDUMPFILE should not have any trace of those leftover log entries owned by root.



This is it my friends. Hope this helps.
More issues...more troubleshooting...more fun...Bring them on...Cheers:)

Rants & Raves Minimize

  • Gravatar
    Saurabh Sood Saturday, September 04, 2010 at 2:43 AM

    Hi,
    Good discription on rac toubleshooting.
    Just wanted to add to your first point "srvctl not able to start instance but sqlplus is"
    I faced the same issue on one of my 10204 rac database and the cause came out to be racgwrap was not modiFIed with ORACLE_HOME value $ORACLE_HOME/bin/racgwrap.
    Also CRS_HOME was missing from $CRS_HOME/racg/admin/racgwrap and set to %ORACLE_HOME%.

    -Saurabh Sood
    After changing these two values srvctl worked fine.

    This was concluded by setting SRVM_TRACE=true and running srvctl again, a log file will get generated @ CRS_HOME/log/`hostname -s`/client .

    • Gravatar
      Ritzy Tuesday, October 05, 2010 at 4:04 PM
      racgwrap
      Thanks Saurabh for sharing the issue you faced along with the resolution. Looking forward to hear more from you on askdbahappy

  • Gravatar
    Eric Thursday, October 21, 2010 at 1:00 PM
    Concepts
    A very comprehensive repository of troubleshooting cases with examples indeed. For understanding of RAC concepts, what's a good source you would recommend. There are too many books and documentation with lots of conflicting pieces.

    • Gravatar
      Ritzy Thursday, October 21, 2010 at 1:05 PM

      For basic understanding, just issue "clscfg -concepts" comand. Otherwise Oracle documentation is the best resource. You'll obviously learn more from experience as no amount of reading is enough unless they are put into practice. For cache fusion, it's best explained in "Real Application Clusters Handbook" authored by K. Gopalakrishnan

      • Gravatar
        Shah Monday, October 25, 2010 at 6:15 AM

        hi Ritzy,
        nice to read your blogs.
        shall i have your personal id, I would like to discuss some dba related questions.
        Kindly send me test mail at shahhasan86@gmail.com
        awaiting for your reply..
        thanks
        shah

        • Gravatar
          Ritzy Wednesday, November 10, 2010 at 10:49 PM

          Glad to know that you like my blogs.
          You can always post your question on my blogs and I'll be more than happy to respond to them.

          • Gravatar
            DNash Thursday, October 13, 2011 at 6:28 AM

            Good work Ritzy. Thanks for sharing.

Recommended Oracle DBA Books Minimize

     

Tag Cloud Minimize


Archive Posts Minimize
 
Monthly
    Yearly

    Disclaimer:
    This posting is provided "AS IS" with no warranties, and confers no rights. You assume all risk for your use.

    This posting has nothing to do with my present or past employer.