Tuesday, January 27, 2009

Outbound_connect_timeout

Outbound_connect_timeout comes into play when nodes in a cluster are down and we cannot wait for the OS timeout as this causes long delays in connect time. For example on Solaris the value of tcp_ip_abort_interval = 180000 ==> which is 180 seconds ==> 3 mins

In this post I will demonstrate how outbound_connect_timeout (OCT) can effectively avoid timeouts experienced by clients connecting to RAC nodes

If we take an example of the following connect string

TEST =
(DESCRIPTION =
(ADDRESS_LIST =
(LOAD_BALANCE = OFF)
(ADDRESS = (PROTOCOL = TCP)(HOST = sdby1-vip)(PORT = 1521))
(ADDRESS = (PROTOCOL = TCP)(HOST = sdby2-vip)(PORT = 1521))
(ADDRESS = (PROTOCOL = TCP)(HOST = sdby3-vip)(PORT = 1521))
(ADDRESS = (PROTOCOL = TCP)(HOST = sdby4-vip)(PORT = 1521))
(ADDRESS = (PROTOCOL = TCP)(HOST = prim1-vip)(PORT = 1521))
)
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = TEST)
)
)


In the above alias the first 4 nodes are non existent nodes and the service TEST
runs on the node prim1 which is listed last in the above alias.

I have deliberately set load_balance to OFF so that the client has to traverse through all nodes serially.

If we set an OUTBOUND_CONNECT_TIMEOUT of 3 seconds in the client's sqlnet.ora
(client is 10.2.0.3 Solaris) then the time to establish the connection is around 12 seconds.

If we were to run a sqlnet trace on the connection we see that the connection starts at

Connection started at
[27-JAN-2009 22:52:33:741] --- TRACE CONFIGURATION INFORMATION FOLLOWS ---
[27-JAN-2009 22:52:33:741] New trace stream is /tmp/cli_262.trc

and the first address which is tried is

[27-JAN-2009 22:52:33:757] nsc2addr: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=sdb1-vip)(PORT=1521))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=TEST)(CID=(PROGRAM=sqlplus@bart)(HOST=bart)(USER=oracle))))

Moving on we can see that OCT is enabled due to the line

[27-JAN-2009 22:52:33:759] nstoSetupTimeout: entry
[27-JAN-2009 22:52:33:759] nstoSetupTimeout: ATO enabled for ctx=0x1001c9280, val=3000(millisecs)


After this we the following pattern is seen continuously

[27-JAN-2009 22:52:33:776] nsevwtsg: entry
[27-JAN-2009 22:52:33:776] nsevwtsg: cid=0
[27-JAN-2009 22:52:33:776] nsevwait: entry
[27-JAN-2009 22:52:33:776] nsevwait: 1 registered connection(s)
[27-JAN-2009 22:52:33:776] nsevwait: 0 pre-posted event(s)
[27-JAN-2009 22:52:33:776] nsevwait: waiting for transport event (0 thru 0)...
[27-JAN-2009 22:52:33:776] nsevwait: 0 newly-posted event(s)
[27-JAN-2009 22:52:33:776] nsevwait: 0 posted event(s)
[27-JAN-2009 22:52:33:776] nsevwait: exit (0)
[27-JAN-2009 22:52:33:776] nstoToqCheckSingle: entry
[27-JAN-2009 22:52:33:776] nstoToqCheckSingle: normal exit


and the timeout occurs in 3 seconds

[27-JAN-2009 22:52:36:771] nstoHandleEventTO: ATO occurred for ctx=0x1001c9280

Ergo the OCT seems to be working perfectly

The client then tries to establish a connection with sdby2-vip and experiences
the same timeout

The connection is finally established at

[27-JAN-2009 22:52:45:915] nscon: no connect data
[27-JAN-2009 22:52:45:915] nscon: connect handshake is complete
[27-JAN-2009 22:52:45:915] nscon: nsctxinf[0]=0x41, [1]=0x41
[27-JAN-2009 22:52:45:915] nscon: normal exit


which translates to around 12 seconds.

From my testing for clients on Windows

OUTBOUND_CONNECT_TIMEOUT does not work with 10.2.0.3 base windows client. (Clients may get errors)
It works with patch 21 on top of 10.2.0.3
OUTBOUND_CONNECT_TIMEOUT does work with 11g client

Without OCT if we trace the connection again we see that we wait for more than 3 mins


[28-JAN-2009 14:18:35:299] nttcni: entry
[28-JAN-2009 14:18:35:299] nttcni: trying to connect to socket 10.
[28-JAN-2009 14:22:19:915] ntt2err: entry

[28-JAN-2009 14:22:19:915] ntt2err: soc 10 error - operation=1, ntresnt[0]=505, ntresnt[1]=145, ntresnt[2]=0
[28-JAN-2009 14:22:19:915] ntt2err: exit
[28-JAN-2009 14:22:19:915] nttcni: exit
[28-JAN-2009 14:22:19:915] nttcon: exit
[28-JAN-2009 14:22:19:921] nserror: entry
[28-JAN-2009 14:22:19:921] nserror: nsres: id=0, op=65, ns=12535, ns2=12560; nt[0]=505, nt[1]=145, nt[2]=0; ora[0]=0, ora[1]=0, ora[2]=0
[28-JAN-2009 14:22:19:921] nsopen: unable to open transport
[28-JAN-2009 14:22:19:921] nsiocancel: entry