1.We were going to open DB after shutdown immediate but it went hang.In alert log , we could only notice following message for 40 minutes.
Completed: ALTER DATABASE MOUNT
Tue Oct 31 01:08:39 2017 ALTER DATABASE OPEN Beginning crash recovery of 1 threadsparallel recovery started with 9 processes Started redo scanTue Oct 31 01:08:58 2017 Completed redo scan read 3733362 KB redo, 2824480 data blocks need recovery Started redo application at Thread 1: logseq 39596, block 3434354 Recovery of Online Redo Log: Thread 1 Group 10 Seq 39596 Reading mem 0 Mem# 0: /oracle/data/san_oradata01/eocmdb/redofiles/log10b.rdo Recovery of Online Redo Log: Thread 1 Group 11 Seq 39597 Reading mem 0 Mem# 0: /oracle/data/san_oradata01/eocmdb/redofiles/log11b.rdo Tue Oct 31 01:09:18 2017 Recovery of Online Redo Log: Thread 1 Group 13 Seq 39598 Reading mem 0 Mem# 0: /oracle/data/san_oradata01/eocmdb/redofiles/log13b.log Tue Oct 31 01:09:28 2017 Completed redo application of 628.90MB |
2.Troubleshooting using following queries:-
export NLS_DATE_FORMAT=’dd-mon-yyyy hh24:mi:ss’
set feedback on
set heading on
set echo on
set pagesize 100
set linesize 150
col checkpoint_change# format 999999999999999999999
set echo on
set linesize 400
set pagesize 20000
col name format a75
col MEMBER format a100
col file_name format a100
col PLATFORM_NAME format a30
col TABLESPACE_NAME format a30
set numformat 999999999999999
col hxfil format 9999
col fhsta format 9999
col fhscn format 999999999999999999999999
col fhtmn format a30
col fhrba_Seq format 9999999999999
col FHBCP_THR format 9999
col fhthr format 999
col fhrba_seq format 99999999
col fhscn format 999999999999999999
set linesize 175
set pagesize 50
alter session set nls_date_format = ‘DD-MON-RRRR HH24:MI:SS’;
Spool recover.lst
select name,platform_name,open_mode,controlfile_type,log_mode,flashback_on,RESETLOGS_TIME,RESETLOGS_CHANGE# from v$database;
select name, status from v$datafile;
select name, recover, fuzzy, checkpoint_change#,RESETLOGS_TIME,RESETLOGS_CHANGE# from v$datafile_header;
select * from v$backup;
select name, open_mode, checkpoint_change#, ARCHIVE_CHANGE# from v$database;
select GROUP#,THREAD#,SEQUENCE#,MEMBERS,ARCHIVED,STATUS,FIRST_CHANGE# from v$log;
select GROUP#,member from v$logfile;
select * from v$recover_file;
select * from v$recovery_log;
select hxfil FILE#, fhsta STATUS, fhscn SCN, fhrba_Seq SEQUENCE, fhtnm TABLESPACE, FHBCP_THR Thread, fhafs AFUZZY from x$kcvfh order by hxfil;
— Min PITR ABSSCN is minimun SCN needed to open
select min(FHSCN) “LOW FILEHDR SCN”, max(FHSCN) “MAX FILEHDR SCN”, max(FHAFS) “Min PITR ABSSCN” from X$KCVFH;
— Min max SEQUENCE on file headers
SELECT fhthr thread, fhrba_seq sequence, fhscn scn, fhsta status, count(*) FROM x$kcvfh group by fhthr,fhrba_seq,fhscn,fhsta;
select min(FHRBA_SEQ) “LOW FILEHDR SEQUENCE”, max(FHRBA_SEQ) “MAX FILEHDR SEQUENCE” from X$KCVFH;
select HXFIL File_num,substr(HXFNM,1,140) File_name,FHTYP Type,HXERR Validity, FHSCN SCN, FHTNM TABLESPACE_NAME,FHSTA status ,FHRBA_SEQ Sequence, FHTHR Thread from X$KCVFH;
SELECT fhthr thread, fhrba_seq sequence, fhscn scn, fhsta status, count(*) FROM x$kcvfh group by fhthr,fhrba_seq,fhscn,fhsta;
3. Please execute following command to check recovery.
RMAN>backup validate check logical database;
SQL>alter system dump logfile ‘/oracle/data/san_oradata01/eocmdb/redofiles/log10b.rdo’ validate;
RMAN>recover database test;
4.Re-creating control file and run recover from RMAN
connect target /;
alter session set tracefile_identifier = “10046_trace_file”;
alter session set events ‘10046 trace name context forever, level 3’;
set echo on;
debug on;
run {
allocate channel d1 type disk;
recover database;
}
debug off;
exit
But this was stuck with following:-
RMAN> recover database; Starting recover at 31-OCT-17 starting media recovery unable to find archived log |
5.Now we need to validate archive log 39597 for consistency.
RMAN> validate archivelog sequence 39597 thread 1;
6.We need to do following if recover stuck
a)Open one more session and shutdown abort the instance
b)Startup mount
c)Try with no parallel recovery
Now, you are applying the redolog named /oracle/data/san_oradata01/eocmdb/redofiles/log13b.log
with sequence 39598
Which is the current redo log:
SQL> select GROUP#,THREAD#,SEQUENCE#,MEMBERS,ARCHIVED,STATUS,FIRST_CHANGE# from v$log;
GROUP# THREAD# SEQUENCE# MEMBERS ARC STATUS FIRST_CHANGE#
—————- —————- —————- —————- — —————- —————-
8 1 39593 1 YES INACTIVE 12444097046518
9 1 39595 1 YES INACTIVE 12444097257724
10 1 39596 1 YES ACTIVE 12444097354077
15 1 39592 1 YES INACTIVE 12444096949452
12 1 39594 1 YES INACTIVE 12444097140637
13 1 39598 1 NO CURRENT 12444621699598 >>>>>>>>>>>>>>>>>>
14 1 39591 1 YES INACTIVE 12444095658402
11 1 39597 1 YES ACTIVE 12444621392892
SQL> recover database noparallel using backup controlfile;
If it ask for log sequence 39598, then provide the full path as below and then press <Enter>
/oracle/data/san_oradata01/eocmdb/redofiles/log13b.log (This is current redolog file)
If it says… “Media Recovery complete” then open the db with resetlogs:
SQL> alter database open resetlogs;
Note:-Following views are very useful to monitor instance recovery progress.
Reference:-Monitoring Restore/Recovery Progress (Doc ID 1335910.1)
Session Waits:-
Are there any sessions in wait and what is it waiting for?
Run the following queries at least 3 times, at 5 minute intervals to see progress/change.
col p1text form a15
col p1 form 999999
col p2text form a15
col p2 form 999999
col p3text form a10
col p3 form 9999
col waited form 9999
col waiting form 9999select sid, event, p1text, p1, p2text, p2, p3text, p3,
wait_time waited, seconds_in_wait waiting
from gv$session_wait
where event not like ‘SQL*Net%’
and event not like ‘%timer%’
and event not like ‘rdbms%’
and event not like ‘pipe%’
and event not like ‘DIAG%’
and event not like ‘Streams AQ%’
and event not like ‘VKTM%’
and state = ‘WAITING’
order by seconds_in_wait
/
Recovery Progress:-
What is the recovery progress? V$RECOVERY_PROGRESS is only populated when RECOVERY is in progress. A restore operation will not populate this view. So if you think a recovery process is slow – is it really at the recovery phase, or still restoring from RMAN backuppieces?
This is an example of a recovery progress:
START_TIME ITEM UNITS SOFAR TOTAL TIMESTAMP
——————– ——————————– ——————————– ———- ———- ——————– 31-OCT-2017 15:52:22 Active Apply Rate KB/sec 164 0 31-OCT-2017 15:52:22 Average Apply Rate KB/sec 167 0 31-OCT-2017 15:52:22 Maximum Apply Rate KB/sec 5460 0 31-OCT-2017 15:52:22 Redo Applied Megabytes 330 0 31-OCT-2017 15:52:22 Last Applied Redo SCN+Time 0 0 30-OCT-2017 13:45:57 31-OCT-2017 15:52:22 Active Time Seconds 1963 0 31-OCT-2017 15:52:22 Elapsed Time Seconds 2018 0 31-OCT-2017 15:14:17 Log Files Files 1 1 31-OCT-2017 15:14:17 Active Apply Rate KB/sec 40 40 31-OCT-2017 15:14:17 Average Apply Rate KB/sec 45 45 31-OCT-2017 15:14:17 Maximum Apply Rate KB/sec 341 341START_TIME ITEM UNITS SOFAR TOTAL TIMESTAMP ——————– ——————————– ——————————– ———- ———- ——————– 31-OCT-2017 15:14:17 Redo Applied Megabytes 89 89 31-OCT-2017 15:14:17 Last Applied Redo SCN+Time 0 0 30-OCT-2017 13:39:19 31-OCT-2017 15:14:17 Active Time Seconds 1903 1903 31-OCT-2017 15:14:17 Apply Time per Log Seconds 1903 1903 31-OCT-2017 15:14:17 Elapsed Time Seconds 2005 2005 |
6.Root Cause of slowness:-
- Hang analyzer log says:-
is waiting for ‘parallel recovery slave idle wait’ with wait info:
{
p1: ”=0x1001ffff
p2: ”=0x1
p3: ”=0x182f945d00
time in wait: 1.004650 sec
timeout after: never
wait id: 57508
blocking: 0 sessions
current sql: <none>
- IO stat report:-
The all local disk (not in SAN) was showing await.
AWR report :-
it looks like local disk issue not SAN issue. If you look into Av rd(ms) ,it is 1 -2 ms for SAN and for local it is 20 ms to 60 ms. Oracle suggestion is it should be less than 10 ms. So all datafiles should immediately be moved to SAN from Local mount point (Local does not have good striping capacity compare to SAN which will raise bottleneck) .