Sunday, June 28, 2009

 

Are you sure you will be able to activate your standby??

A couple of weeks I faced a scenario where the standby database crashed

On looking at the alert.log I see the following message in the alert.log of the standby

***********************************************************
Sat Jun 6 06:48:52 2009
Recovery interrupted!
cannot find needed online log for redo thread 1
Some recovered datafiles maybe left media fuzzy
Media recovery may continue but open resetlogs may fail
Sat Jun 6 06:48:53 2009
Errors in file /u01/app/oracle/admin/TEST/bdump/test1_mrp0_24533.trc:
ORA-10576: Give up restoring recovered datafiles to consistent state: some error occurred
ORA-16037: user requested cancel of managed recovery operation
Some recovered datafiles maybe left media fuzzy
Media recovery may continue but open resetlogs may fail
Sat Jun 6 06:48:54 2009
Waiting for MRP0 pid 24533 to terminate
************************************************************

Hmmm... this means that if the standby does not have the redo and cannot get it from the primary you will not be able to online media fuzzy files using supported methods

The same issue is explained in Bug 5956646
as an architectural limitation.

This is a very unlikely scenario but a possibility none the less

Tuesday, January 27, 2009

 

Outbound_connect_timeout

Outbound_connect_timeout comes into play when nodes in a cluster are down and we cannot wait for the OS timeout as this causes long delays in connect time. For example on Solaris the value of tcp_ip_abort_interval = 180000 ==> which is 180 seconds ==> 3 mins

In this post I will demonstrate how outbound_connect_timeout (OCT) can effectively avoid timeouts experienced by clients connecting to RAC nodes

If we take an example of the following connect string

TEST =
(DESCRIPTION =
(ADDRESS_LIST =
(LOAD_BALANCE = OFF)
(ADDRESS = (PROTOCOL = TCP)(HOST = sdby1-vip)(PORT = 1521))
(ADDRESS = (PROTOCOL = TCP)(HOST = sdby2-vip)(PORT = 1521))
(ADDRESS = (PROTOCOL = TCP)(HOST = sdby3-vip)(PORT = 1521))
(ADDRESS = (PROTOCOL = TCP)(HOST = sdby4-vip)(PORT = 1521))
(ADDRESS = (PROTOCOL = TCP)(HOST = prim1-vip)(PORT = 1521))
)
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = TEST)
)
)


In the above alias the first 4 nodes are non existent nodes and the service TEST
runs on the node prim1 which is listed last in the above alias.

I have deliberately set load_balance to OFF so that the client has to traverse through all nodes serially.

If we set an OUTBOUND_CONNECT_TIMEOUT of 3 seconds in the client's sqlnet.ora
(client is 10.2.0.3 Solaris) then the time to establish the connection is around 12 seconds.

If we were to run a sqlnet trace on the connection we see that the connection starts at

Connection started at
[27-JAN-2009 22:52:33:741] --- TRACE CONFIGURATION INFORMATION FOLLOWS ---
[27-JAN-2009 22:52:33:741] New trace stream is /tmp/cli_262.trc

and the first address which is tried is

[27-JAN-2009 22:52:33:757] nsc2addr: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=sdb1-vip)(PORT=1521))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=TEST)(CID=(PROGRAM=sqlplus@bart)(HOST=bart)(USER=oracle))))

Moving on we can see that OCT is enabled due to the line

[27-JAN-2009 22:52:33:759] nstoSetupTimeout: entry
[27-JAN-2009 22:52:33:759] nstoSetupTimeout: ATO enabled for ctx=0x1001c9280, val=3000(millisecs)


After this we the following pattern is seen continuously

[27-JAN-2009 22:52:33:776] nsevwtsg: entry
[27-JAN-2009 22:52:33:776] nsevwtsg: cid=0
[27-JAN-2009 22:52:33:776] nsevwait: entry
[27-JAN-2009 22:52:33:776] nsevwait: 1 registered connection(s)
[27-JAN-2009 22:52:33:776] nsevwait: 0 pre-posted event(s)
[27-JAN-2009 22:52:33:776] nsevwait: waiting for transport event (0 thru 0)...
[27-JAN-2009 22:52:33:776] nsevwait: 0 newly-posted event(s)
[27-JAN-2009 22:52:33:776] nsevwait: 0 posted event(s)
[27-JAN-2009 22:52:33:776] nsevwait: exit (0)
[27-JAN-2009 22:52:33:776] nstoToqCheckSingle: entry
[27-JAN-2009 22:52:33:776] nstoToqCheckSingle: normal exit


and the timeout occurs in 3 seconds

[27-JAN-2009 22:52:36:771] nstoHandleEventTO: ATO occurred for ctx=0x1001c9280

Ergo the OCT seems to be working perfectly

The client then tries to establish a connection with sdby2-vip and experiences
the same timeout

The connection is finally established at

[27-JAN-2009 22:52:45:915] nscon: no connect data
[27-JAN-2009 22:52:45:915] nscon: connect handshake is complete
[27-JAN-2009 22:52:45:915] nscon: nsctxinf[0]=0x41, [1]=0x41
[27-JAN-2009 22:52:45:915] nscon: normal exit


which translates to around 12 seconds.

From my testing for clients on Windows

OUTBOUND_CONNECT_TIMEOUT does not work with 10.2.0.3 base windows client. (Clients may get errors)
It works with patch 21 on top of 10.2.0.3
OUTBOUND_CONNECT_TIMEOUT does work with 11g client

Without OCT if we trace the connection again we see that we wait for more than 3 mins


[28-JAN-2009 14:18:35:299] nttcni: entry
[28-JAN-2009 14:18:35:299] nttcni: trying to connect to socket 10.
[28-JAN-2009 14:22:19:915] ntt2err: entry

[28-JAN-2009 14:22:19:915] ntt2err: soc 10 error - operation=1, ntresnt[0]=505, ntresnt[1]=145, ntresnt[2]=0
[28-JAN-2009 14:22:19:915] ntt2err: exit
[28-JAN-2009 14:22:19:915] nttcni: exit
[28-JAN-2009 14:22:19:915] nttcon: exit
[28-JAN-2009 14:22:19:921] nserror: entry
[28-JAN-2009 14:22:19:921] nserror: nsres: id=0, op=65, ns=12535, ns2=12560; nt[0]=505, nt[1]=145, nt[2]=0; ora[0]=0, ora[1]=0, ora[2]=0
[28-JAN-2009 14:22:19:921] nsopen: unable to open transport
[28-JAN-2009 14:22:19:921] nsiocancel: entry

Sunday, November 30, 2008

 

Parallel Rollback

I had a user call up and ask to kill a session which was causing him grief and hence I killed it without much thought since it was not a Production system

A few hours later I noticed that SMON was doing parallel txn recovery. This was validated by the view
select * from x$ktprxrt;

Unfortunately I have lost the output but it did show that it would take eons

But this was also confirmed from the pstack of smon which included the function ktprbeg which I believe begins parallel rollback. (Snippet below)

[syd0904:oracle:OSTA1]/u01/app/oracle => pstack 11905
11905: ora_smon_OSTA1

0000000100d80868 kturax (fffffffffffffffe, 380017150, b, 380017, 105ebe510, 5) + 928
0000000100e15620 ktprbeg (106502000, 0, 1065033c8, 105400, 1056b5, 1065033c8) + 1a0 ===> Begin parallel rollback
00000001007e9238 ktmmon (ffffffffffffffff, ffffffff7fffdda8, 0, 1056b5000, 1, 106502000) + f58
000000010106e0dc ksbrdp (105f1b000, 38000e, 106505ce0, 105f1b000, 105f1b, 1007e8260) + 39c
00000001024efed8 opirip (106510000, 106400, 106518, 380007000, 106510, 1066a5650) + 338
000000010033f7b4 opidrv (106512a90, 0, 32, 10650f7c8, 32, 0) + 4b4
0000000100339c50 sou2o (ffffffff7ffff3e8, 32, 4, ffffffff7ffff410, 105de9000, 105de9) + 50
00000001002fc00c opimai_real (3, ffffffff7ffff4e8, 0, 0, 247e09c, 14800) + 10c
00000001002fbe38 main (1, ffffffff7ffff5f8, 0, ffffffff7ffff4f0, ffffffff7ffff600, ffffffff79d00140) + 98
00000001002fbd5c _start (0, 0, 0, 0, 0, 0) + 17c
----------------- lwp# 2 / thread# 2 --------------------

and also confirmed from the SMON trace file

*** 2008-11-28 12:03:16.828
Parallel Transaction recovery caught exception 30319
Parallel Transaction recovery caught error 30319
*** 2008-11-28 12:07:17.163
Parallel Transaction recovery caught exception 30319
Parallel Transaction recovery caught error 30319

So the first thing I did was to disable parallel recovery because of the issue documented in Metalink
using

SQL> ALTER SYSTEM SET fast_start_parallel_rollback='FALSE';

System altered.

IMHO (atleast from past experience) serial recovery is faster than parallel recovery atleast in the case where the transaction causes a lot of index block splits.

After this the row from x$ktprxrt disappeared and the following appeared in the SMON trace file

*** 2008-11-28 12:08:32.763
SMON: parallel recovery restart with degree=0 (!=16)
Parallel Transaction recovery caught exception 30312
Parallel Transaction recovery caught error 30312
*** 2008-11-28 12:08:33.039
SMON: parallel recovery restart with degree=0 (!=16)
Parallel Transaction recovery caught exception 30312
Parallel Transaction recovery caught error 30312

The following views agree on how much time it is going to take to complete the rollback

SQL> select * from x$ktuxe where KTUXECFL='DEAD' and KTUXESTA='ACTIVE';

ADDR INDX INST_ID KTUXEUSN KTUXESLT KTUXESQN KTUXERDBF KTUXERDBB KTUXESCNB KTUXESCNW KTUXESTA
---------------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------------
KTUXECFL KTUXEUEL KTUXEDDBF KTUXEDDBB KTUXEPUSN KTUXEPSLT KTUXEPSQN KTUXESIZ
------------------------ ---------- ---------- ---------- ---------- ---------- ---------- ----------
FFFFFFFF79969D78 114 1 1 16 174276 2 36154 1113256061 1021 ACTIVE
DEAD 13 0 0 0 0 0 639787


SQL> select * from GV$FAST_START_TRANSACTIONS ;

INST_ID USN SLT SEQ STATE UNDOBLOCKSDONE UNDOBLOCKSTOTAL PID CPUTIME PARENTUSN PARENTSLT
---------- ---------- ---------- ---------- ---------------- -------------- --------------- ---------- ---------- ---------- ----------
PARENTSEQ XID PXID RCVSERVERS
---------- ---------------- ---------------- ----------
1 1 16 174276 RECOVERING 122712 762423 6312
000100100002A8C4 0

As you can see the value of
KTUXESIZ matches (UNDOBLOCKSTOTAL - UNDOBLOCKSDONE)

So rollback will complete when KTUXESIZ in x$ktuxe drops down to 0 which looks like alot of time !!!
Dumping the redo confirmed that it was the same transaction which was killed

Surprisingly the value of "rollback changes - undo records applied" in v$sysstat was not increasing during this timeline. I have tested this again (kill a long running job and watch the rollback) and can confirm that the stat does get incremented.

Wednesday, July 30, 2008

 

11g SQLNet Client trace files

I spent half an hour wondering why I was not able to generate a SQL* Net client trace file from a 11g client to a 10.2 database.
Here are the contents of my client sqlnet.ora

bart:ELEV:/opt/oracle/product/11.1.0/db_1/network/admin> cat sqlnet.ora
#SQLNET.OUTBOUND_CONNECT_TIMEOUT = 3
TRACE_LEVEL_CLIENT= 16
TRACE_DIRECTORY_CLIENT = /tmp
TRACE_FILE_CLIENT= cli.trc
TRACE_TIMESTAMP_CLIENT = ON
TRACE_UNIQUE_CLIENT = ON

It took me a while to realize that thanks to ADR the client trace files were actually going to $ORACLE_BASE/diag/clients/user_oracle/host_xxx/trace

The only way to disable this is to add the following in the sqlnet.ora
DIAG_ADR_ENABLED=off

After setting the above line in the sqlnet.ora the client trace files were generated in /tmp

Monday, July 28, 2008

 

One off patches, conflicts and merges

So you are applying one of the zillion patches on top of 10.2.0.3 and you hit the following error

************************************************************************

ApplySession applying interim patch '6338357' to OH
'/opt/oracle/product/10.2.0/db_1'
Interim patch 6338357 has File Conflict with patch(es) [ 5399670 ]
in OH /opt/oracle/product/10.2.0/db_1

************************************************************************

Wouldn't it be nice if you could check for conflicts before you apply a patch
Here is one way to do the same

1. You can check which source code file is being modified by doing the following

cd 6338357/etc/config
cat actions | grep .o
oneoff_actions
oracle.rdbms version="10.2.0.3.0" opt_req="R"
archive name="libserver10.a" path="%ORACLE_HOME%/lib" object_name="lib/libserver10.a/kelt.o"
make change_dir="%ORACLE_HOME%/rdbms/lib" make_file="ins_rdbms.mk" make_target="ioracle"
oracle.rdbms
oneoff_actions

2. Then go to the $ORACLE_HOME/.patch_storage and see if any of the previous patches have modified the same file

bart:TEST:/opt/oracle/product/10.2.0/db_1/.patch_storage> find . -name 'kelt.o' -print
./verify/archive/lib/libserver10.a/kelt.o
./5399670_Mar_20_2008_20_30_34/backup/.patch_storage/5399670_Mar_20_2008_20_30_34/files/lib/libserver10.a/kelt.o
./5399670_Mar_20_2008_20_30_34/original_patch/files/lib/libserver10.a/kelt.o
./5399670_Mar_20_2008_20_30_34/scratch/kelt.o./5399670_Mar_20_2008_20_30_34/files/lib/libserver10.a/kelt.o

So this clearly shows that Patch 5399670 was previously applied which modified the same source code file and you need a merge patch for both bugs (5399670 and 6338357)

Labels:


Monday, February 04, 2008

 

Rolling invalidations Addendum

If you enable cursor trace for the testcase in my previous post using the method described in http://el-caro.blogspot.com/search?q=unshared

The generated trace in udump will have the following

************************************************************
kksCheckCursor: pinning child #0 in shared mode 7d42bd0e8 7d7e87600
Failed sharing : Rolling invalidation
kksUnlockChild: releasing child
Failed sharing : 800000000000
kksSearchChildList: no suitable child found
Creating new child object #1
kksLoadChild: reload 0 child_reload 0
kksLoadChild: reload 0 child_reload 0
Compilation environment difference Failed sharing : 0
Change in cursor environment
************************************************************

Also the parameter _optimizer_invalidation_period does not work for statements using PQ. This can be tested by changing the degree of the table to 4 (say) and running the same test. In this case a new child cursor is generated on the first re-execution of the statement after stats are gathered even when the auto_invalidate flag is used.

Wednesday, January 30, 2008

 

Rolling invalidations

There have been discussions which I have seen related to the feature of auto invalidation in dbms_stats. A couple of references are

http://forums.oracle.com/forums/thread.jspa?threadID=592771&tstart=30
and
http://www.orafaq.com/maillist/oracle-l/2006/10/10/0429.htm

I have tested the relevant parameter “_optimizer_invalidation_period” on 10.2.0.3 and believe that this is working as expected

Let us take the below testcase where the parameter (it is dynamic) is set to a value of 120

SQL> show parameter optimizer_inva

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
_optimizer_invalidation_period integer 120

We have the following sql statement

11:00:00 SQL> select * from source where rownum<2;

OBJ# LINE SOURCE
---------- ---------- ---------------------------------------------------------------------------
194107 171 -- *Action: Start a new job, or attach to an existing job that has a


1 row selected.

Elapsed: 00:00:00.12
11:00:00 SQL> select * from source where rownum<2;

OBJ# LINE SOURCE
---------- ---------- ---------------------------------------------------------------------------
194107 171 -- *Action: Start a new job, or attach to an existing job that has a


1 row selected.

Elapsed: 00:00:00.00
11:00:00 SQL> select a.child_number,LAST_LOAD_TIME, to_char(LAST_ACTIVE_TIME,'dd-mon-yyyy hh24:mi:ss') ,b.invalidations from
v$SQL_SHARED_CURSOR a, v$sql b where a.sql_id='954g5yyw5tn1s' and a.child_address=b.child_address ;

CHILD_NUMBER LAST_LOAD_TIME TO_CHAR(LAST_ACTIVE_ INVALIDATIONS
------------ ------------------- -------------------- -------------
0 2008-01-29/11:00:00 29-jan-2008 11:00:00 0

1 row selected.

Elapsed: 00:00:00.14
11:00:00 SQL>
11:00:00 SQL> select executions, invalidations,child_number from v$sql where sql_id='954g5yyw5tn1s';

EXECUTIONS INVALIDATIONS CHILD_NUMBER
---------- ------------- ------------
2 0 0

1 row selected.

Now we gather stats on the table with the auto_invalidate parameter passed to the API.

11:00:00 SQL> exec dbms_stats.gather_table_stats('REGOFA','SOURCE',no_invalidate => DBMS_STATS.AUTO_INVALIDATE);

PL/SQL procedure successfully completed.

Elapsed: 00:00:01.50

Then we keep executing the sql statement of interest to check when the new cursor will be generated.

Elapsed: 00:00:01.50
11:00:13 SQL> select a.child_number,LAST_LOAD_TIME, to_char(LAST_ACTIVE_TIME,'dd-mon-yyyy hh24:mi:ss') ,b.invalidations from
v$SQL_SHARED_CURSOR a, v$sql b where a.sql_id='954g5yyw5tn1s' and a.child_address=b.child_address ;

CHILD_NUMBER LAST_LOAD_TIME TO_CHAR(LAST_ACTIVE_ INVALIDATIONS
------------ ------------------- -------------------- -------------
0 2008-01-29/11:00:00 29-jan-2008 11:00:09 0

1 row selected.

Elapsed: 00:00:00.05
11:00:13 SQL> select executions, invalidations,child_number from v$sql where sql_id='954g5yyw5tn1s';

EXECUTIONS INVALIDATIONS CHILD_NUMBER
---------- ------------- ------------
3 0 0

1 row selected.

Elapsed: 00:00:00.00
11:00:13 SQL> select * from v$sql_shared_cursor where sql_id='954g5yyw5tn1s';

SQL_ID ADDRESS CHILD_ADDRESS CHILD_NUMBER U S O O S L S E B P
------------- ---------------- ---------------- ------------ - - - - - - - - - -
I S T A B D L T R I I R L I O S M U T N F A I T D L D B P C S R P T M B M R O P
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
M F L
- - -
954g5yyw5tn1s 00000007D3BBCBD8 00000007D5644028 0 N N N N N N N N N N
N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N
N N N


1 row selected.
…….
11:00:37 SQL> select * from source where rownum<2;

OBJ# LINE SOURCE
---------- ---------- ---------------------------------------------------------------------------
194107 171 -- *Action: Start a new job, or attach to an existing job that has a


1 row selected.

Elapsed: 00:00:00.01
11:00:39 SQL> select a.child_number,LAST_LOAD_TIME, to_char(LAST_ACTIVE_TIME,'dd-mon-yyyy hh24:mi:ss') ,b.invalidations from
v$SQL_SHARED_CURSOR a, v$sql b where a.sql_id='954g5yyw5tn1s' and a.child_address=b.child_address ;

CHILD_NUMBER LAST_LOAD_TIME TO_CHAR(LAST_ACTIVE_ INVALIDATIONS
------------ ------------------- -------------------- -------------
0 2008-01-29/11:00:00 29-jan-2008 11:00:25 0
1 2008-01-29/11:00:37 29-jan-2008 11:00:37 0

2 rows selected.

Elapsed: 00:00:00.04
11:00:39 SQL> select executions, invalidations,child_number from v$sql where sql_id='954g5yyw5tn1s';

EXECUTIONS INVALIDATIONS CHILD_NUMBER
---------- ------------- ------------
7 0 0
1 0 1

2 rows selected.

Elapsed: 00:00:00.00
11:00:39 SQL> select * from v$sql_shared_cursor where sql_id='954g5yyw5tn1s';

SQL_ID ADDRESS CHILD_ADDRESS CHILD_NUMBER U S O O S L S E B P
------------- ---------------- ---------------- ------------ - - - - - - - - - -
I S T A B D L T R I I R L I O S M U T N F A I T D L D B P C S R P T M B M R O P
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
M F L
- - -
954g5yyw5tn1s 00000007D3BBCBD8 00000007D5644028 0 N N N N N N N N N N
N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N
N N N

954g5yyw5tn1s 00000007D3BBCBD8 00000007D3753DC0 1 N N N N N N N N N N
N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N Y N N
N N N

So somewhere between 11:00:00 and 11:00:39 (within the 2 minute window) a new child cursor has been generated with roll_invalid_mismatch set to ‘Y”

I have tested for the following values of _optimizer_invalidation_period and I see consistent results

120
210
600
1800
18000

Hence this would be an ideal way to avoid a hard parse storm

This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]