I had a user call up and ask to kill a session which was causing him grief and hence I killed it without much thought since it was not a Production system
A few hours later I noticed that SMON was doing parallel txn recovery. This was validated by the view
select * from x$ktprxrt;
Unfortunately I have lost the output but it did show that it would take eons
But this was also confirmed from the pstack of smon which included the function ktprbeg which I believe begins parallel rollback. (Snippet below)
[syd0904:oracle:OSTA1]/u01/app/oracle => pstack 11905
11905: ora_smon_OSTA1
0000000100d80868 kturax (fffffffffffffffe, 380017150, b, 380017, 105ebe510, 5) + 928
0000000100e15620 ktprbeg (106502000, 0, 1065033c8, 105400, 1056b5, 1065033c8) + 1a0 ===> Begin parallel rollback
00000001007e9238 ktmmon (ffffffffffffffff, ffffffff7fffdda8, 0, 1056b5000, 1, 106502000) + f58
000000010106e0dc ksbrdp (105f1b000, 38000e, 106505ce0, 105f1b000, 105f1b, 1007e8260) + 39c
00000001024efed8 opirip (106510000, 106400, 106518, 380007000, 106510, 1066a5650) + 338
000000010033f7b4 opidrv (106512a90, 0, 32, 10650f7c8, 32, 0) + 4b4
0000000100339c50 sou2o (ffffffff7ffff3e8, 32, 4, ffffffff7ffff410, 105de9000, 105de9) + 50
00000001002fc00c opimai_real (3, ffffffff7ffff4e8, 0, 0, 247e09c, 14800) + 10c
00000001002fbe38 main (1, ffffffff7ffff5f8, 0, ffffffff7ffff4f0, ffffffff7ffff600, ffffffff79d00140) + 98
00000001002fbd5c _start (0, 0, 0, 0, 0, 0) + 17c
----------------- lwp# 2 / thread# 2 --------------------
and also confirmed from the SMON trace file
*** 2008-11-28 12:03:16.828
Parallel Transaction recovery caught exception 30319
Parallel Transaction recovery caught error 30319
*** 2008-11-28 12:07:17.163
Parallel Transaction recovery caught exception 30319
Parallel Transaction recovery caught error 30319
So the first thing I did was to disable parallel recovery because of the issue documented in Metalink
using
SQL> ALTER SYSTEM SET fast_start_parallel_rollback='FALSE';
System altered.
IMHO (atleast from past experience) serial recovery is faster than parallel recovery atleast in the case where the transaction causes a lot of index block splits.
After this the row from x$ktprxrt disappeared and the following appeared in the SMON trace file
*** 2008-11-28 12:08:32.763
SMON: parallel recovery restart with degree=0 (!=16)
Parallel Transaction recovery caught exception 30312
Parallel Transaction recovery caught error 30312
*** 2008-11-28 12:08:33.039
SMON: parallel recovery restart with degree=0 (!=16)
Parallel Transaction recovery caught exception 30312
Parallel Transaction recovery caught error 30312
The following views agree on how much time it is going to take to complete the rollback
SQL> select * from x$ktuxe where KTUXECFL='DEAD' and KTUXESTA='ACTIVE';
ADDR INDX INST_ID KTUXEUSN KTUXESLT KTUXESQN KTUXERDBF KTUXERDBB KTUXESCNB KTUXESCNW KTUXESTA
---------------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------------
KTUXECFL KTUXEUEL KTUXEDDBF KTUXEDDBB KTUXEPUSN KTUXEPSLT KTUXEPSQN KTUXESIZ
------------------------ ---------- ---------- ---------- ---------- ---------- ---------- ----------
FFFFFFFF79969D78 114 1 1 16 174276 2 36154 1113256061 1021 ACTIVE
DEAD 13 0 0 0 0 0 639787
SQL> select * from GV$FAST_START_TRANSACTIONS ;
INST_ID USN SLT SEQ STATE UNDOBLOCKSDONE UNDOBLOCKSTOTAL PID CPUTIME PARENTUSN PARENTSLT
---------- ---------- ---------- ---------- ---------------- -------------- --------------- ---------- ---------- ---------- ----------
PARENTSEQ XID PXID RCVSERVERS
---------- ---------------- ---------------- ----------
1 1 16 174276 RECOVERING 122712 762423 6312
000100100002A8C4 0
As you can see the value of
KTUXESIZ matches (UNDOBLOCKSTOTAL - UNDOBLOCKSDONE)
So rollback will complete when KTUXESIZ in x$ktuxe drops down to 0 which looks like alot of time !!!
Dumping the redo confirmed that it was the same transaction which was killed
Surprisingly the value of "rollback changes - undo records applied" in v$sysstat was not increasing during this timeline. I have tested this again (kill a long running job and watch the rollback) and can confirm that the stat does get incremented.