Oracle database instance recovery is taking too much time after shutdown immediate

1.We were going to open DB after shutdown immediate but it went hang.In alert log , we could only notice following message for 40 minutes.

 

Completed: ALTER DATABASE MOUNT
Tue Oct 31 01:08:39 2017
ALTER DATABASE OPEN
Beginning crash recovery of 1 threadsparallel recovery started with 9 processes
Started redo scanTue Oct 31 01:08:58 2017
Completed redo scan
read 3733362 KB redo, 2824480 data blocks need recovery
Started redo application at
Thread 1: logseq 39596, block 3434354
Recovery of Online Redo Log: Thread 1 Group 10 Seq 39596 Reading mem 0
Mem# 0: /oracle/data/san_oradata01/eocmdb/redofiles/log10b.rdo
Recovery of Online Redo Log: Thread 1 Group 11 Seq 39597 Reading mem 0
Mem# 0: /oracle/data/san_oradata01/eocmdb/redofiles/log11b.rdo
Tue Oct 31 01:09:18 2017
Recovery of Online Redo Log: Thread 1 Group 13 Seq 39598 Reading mem 0
Mem# 0: /oracle/data/san_oradata01/eocmdb/redofiles/log13b.log
Tue Oct 31 01:09:28 2017
Completed redo application of 628.90MB

 

2.Troubleshooting using following queries:-

export NLS_DATE_FORMAT=’dd-mon-yyyy hh24:mi:ss’

set feedback on

set heading on

set echo on

set pagesize 100

set linesize 150

col checkpoint_change# format 999999999999999999999

set echo on

set linesize 400

set pagesize 20000

col name format a75

col MEMBER format a100

col file_name format a100

col PLATFORM_NAME format a30

col TABLESPACE_NAME format a30

set numformat 999999999999999

col hxfil format 9999

col fhsta format 9999

col fhscn format 999999999999999999999999

col fhtmn format a30

col fhrba_Seq format 9999999999999

col FHBCP_THR format 9999

col fhthr format 999

col fhrba_seq format 99999999

col fhscn format 999999999999999999

set linesize 175

set pagesize 50

alter session set nls_date_format = ‘DD-MON-RRRR HH24:MI:SS’;

Spool recover.lst

select name,platform_name,open_mode,controlfile_type,log_mode,flashback_on,RESETLOGS_TIME,RESETLOGS_CHANGE# from v$database;

select name, status from v$datafile;

select name, recover, fuzzy, checkpoint_change#,RESETLOGS_TIME,RESETLOGS_CHANGE# from v$datafile_header;

select * from v$backup;

select name, open_mode, checkpoint_change#, ARCHIVE_CHANGE# from v$database;

select GROUP#,THREAD#,SEQUENCE#,MEMBERS,ARCHIVED,STATUS,FIRST_CHANGE# from v$log;

select GROUP#,member from v$logfile;

select * from v$recover_file;

select * from v$recovery_log;

select hxfil FILE#, fhsta STATUS, fhscn SCN, fhrba_Seq SEQUENCE, fhtnm TABLESPACE, FHBCP_THR Thread, fhafs AFUZZY from x$kcvfh order by hxfil;

— Min PITR ABSSCN is minimun SCN needed to open

select min(FHSCN) “LOW FILEHDR SCN”, max(FHSCN) “MAX FILEHDR SCN”, max(FHAFS) “Min PITR ABSSCN” from X$KCVFH;

— Min max SEQUENCE on file headers

SELECT fhthr thread, fhrba_seq sequence, fhscn scn, fhsta status, count(*) FROM x$kcvfh group by fhthr,fhrba_seq,fhscn,fhsta;

select min(FHRBA_SEQ) “LOW FILEHDR SEQUENCE”, max(FHRBA_SEQ) “MAX FILEHDR SEQUENCE” from X$KCVFH;

select HXFIL File_num,substr(HXFNM,1,140) File_name,FHTYP Type,HXERR Validity, FHSCN SCN, FHTNM TABLESPACE_NAME,FHSTA status ,FHRBA_SEQ Sequence, FHTHR Thread from X$KCVFH;

SELECT fhthr thread, fhrba_seq sequence, fhscn scn, fhsta status, count(*) FROM x$kcvfh group by fhthr,fhrba_seq,fhscn,fhsta;

 

3. Please execute following command to check recovery.

RMAN>backup validate check logical database;

SQL>alter system dump logfile ‘/oracle/data/san_oradata01/eocmdb/redofiles/log10b.rdo’ validate;

RMAN>recover database test;

 

4.Re-creating control file and run recover from RMAN

connect target /;
alter session set tracefile_identifier = “10046_trace_file”;
alter session set events ‘10046 trace name context forever, level 3’;
set echo on;
debug on;
run {
allocate channel d1 type disk;
recover database;
}
debug off;
exit

But this was stuck with following:-

RMAN> recover database;

Starting recover at 31-OCT-17
using channel ORA_DISK_1

starting media recovery

unable to find archived log
archived log thread=1 sequence=39598
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of recover command at 10/31/2017 06:29:09
RMAN-06054: media recovery requesting unknown archived log for thread 1 with sequence 39598 and starting SCN of 12444621699598

 

5.Now we need to validate archive log 39597 for consistency.

RMAN> validate archivelog sequence 39597 thread 1;

6.We need to do following if recover stuck

a)Open one more session and shutdown abort the instance

b)Startup mount

c)Try with no parallel recovery

Now, you are applying the redolog named /oracle/data/san_oradata01/eocmdb/redofiles/log13b.log

with sequence 39598

Which is the current redo log:

SQL> select GROUP#,THREAD#,SEQUENCE#,MEMBERS,ARCHIVED,STATUS,FIRST_CHANGE# from v$log;

GROUP# THREAD# SEQUENCE# MEMBERS ARC STATUS FIRST_CHANGE#
—————- —————- —————- —————- — —————- —————-
8 1 39593 1 YES INACTIVE 12444097046518
9 1 39595 1 YES INACTIVE 12444097257724
10 1 39596 1 YES ACTIVE 12444097354077
15 1 39592 1 YES INACTIVE 12444096949452
12 1 39594 1 YES INACTIVE 12444097140637
13 1 39598 1 NO CURRENT 12444621699598 >>>>>>>>>>>>>>>>>>
14 1 39591 1 YES INACTIVE 12444095658402
11 1 39597 1 YES ACTIVE 12444621392892

SQL> recover database noparallel using backup controlfile;

If it ask for log sequence 39598, then provide the full path as below and then press <Enter>
/oracle/data/san_oradata01/eocmdb/redofiles/log13b.log  (This is current redolog file)

If it says… “Media Recovery complete” then open the db with resetlogs:

SQL> alter database open resetlogs;

Note:-Following views are very useful to monitor instance recovery progress.

Reference:-Monitoring Restore/Recovery Progress (Doc ID 1335910.1)

 Session Waits:-

Are there any sessions in wait and what is it waiting for?

Run the following queries at least 3 times, at 5 minute intervals to see progress/change.

set linesize 200 trimspool oncol event form a25
col p1text form a15
col p1 form 999999
col p2text form a15
col p2 form 999999
col p3text form a10
col p3 form 9999
col waited form 9999
col waiting form 9999select sid, event, p1text, p1, p2text, p2, p3text, p3,
wait_time waited, seconds_in_wait waiting
from gv$session_wait
where event not like ‘SQL*Net%’
and event not like ‘%timer%’
and event not like ‘rdbms%’
and event not like ‘pipe%’
and event not like ‘DIAG%’
and event not like ‘Streams AQ%’
and event not like ‘VKTM%’
and state = ‘WAITING’
order by seconds_in_wait
/

Recovery Progress:-

What is the recovery progress?  V$RECOVERY_PROGRESS is only populated when RECOVERY is in progress. A restore operation will not populate this view. So if you think a recovery process is slow – is it really at the recovery phase, or still restoring from RMAN backuppieces?

This is an example of a recovery progress:

 

START_TIME ITEM UNITS SOFAR TOTAL TIMESTAMP
——————– ——————————– ——————————– ———- ———- ——————–
31-OCT-2017 15:52:22 Active Apply Rate KB/sec 164 0
31-OCT-2017 15:52:22 Average Apply Rate KB/sec 167 0
31-OCT-2017 15:52:22 Maximum Apply Rate KB/sec 5460 0
31-OCT-2017 15:52:22 Redo Applied Megabytes 330 0
31-OCT-2017 15:52:22 Last Applied Redo SCN+Time 0 0 30-OCT-2017 13:45:57
31-OCT-2017 15:52:22 Active Time Seconds 1963 0
31-OCT-2017 15:52:22 Elapsed Time Seconds 2018 0
31-OCT-2017 15:14:17 Log Files Files 1 1
31-OCT-2017 15:14:17 Active Apply Rate KB/sec 40 40
31-OCT-2017 15:14:17 Average Apply Rate KB/sec 45 45
31-OCT-2017 15:14:17 Maximum Apply Rate KB/sec 341 341START_TIME ITEM UNITS SOFAR TOTAL TIMESTAMP
——————– ——————————– ——————————– ———- ———- ——————–
31-OCT-2017 15:14:17 Redo Applied Megabytes 89 89
31-OCT-2017 15:14:17 Last Applied Redo SCN+Time 0 0 30-OCT-2017 13:39:19
31-OCT-2017 15:14:17 Active Time Seconds 1903 1903
31-OCT-2017 15:14:17 Apply Time per Log Seconds 1903 1903
31-OCT-2017 15:14:17 Elapsed Time Seconds 2005 2005

 

6.Root Cause of slowness:-

  • Hang analyzer log says:-

is waiting for ‘parallel recovery slave idle wait’ with wait info:
{
p1: ”=0x1001ffff
p2: ”=0x1
p3: ”=0x182f945d00
time in wait: 1.004650 sec
timeout after: never
wait id: 57508
blocking: 0 sessions
current sql: <none>

  • IO stat report:-

The all local disk (not in SAN) was showing await.

AWR report :-

it looks like local disk issue not SAN issue. If you look into Av rd(ms) ,it is 1 -2 ms for SAN and for local it is 20 ms to 60 ms. Oracle suggestion is it should be less than 10 ms. So all datafiles should immediately be moved to SAN from Local mount point (Local does not have good striping capacity compare to SAN which will raise bottleneck) .