Padstow (10)

Monday 28 July, 2008 - 22:56

DB Recovery

The example RAC on the padstow cluster failed to start because the DATA disk group was not mounted. ASM had dismounted this disk group because of the following error:

WARNING: cache failed to read fn=348 indblk=0 from disk(s): 1 0
ORA-15196: invalid ASM block header [kfc.c:7910] [endian_kfbh] [348] [2147483648] [6 != 1]
NOTE: a corrupted block was dumped to the trace file

Since it was an invalid block header, I reasoned that the only way to recover these disks was to reformat them:

su -
/etc/init.d/oracleasm deletedisk DATA1
/etc/init.d/oracleasm deletedisk DATA2
/etc/init.d/oracleasm createdisk DATA1 /dev/sdf1
/etc/init.d/oracleasm createdisk DATA2 /dev/sdg1

The DATA disk group is automatically dropped by ASM once these commands are issued. I used the following command through the +ASM2 instance to recreate the DATA disk group

CREATE DISKGROUP data
NORMAL REDUNDANCY
DISK 'ORCL:DATA1'
DISK 'ORCL:DATA2';

Now, I have a blank disk group: the SPFILE, one copy of the CONTROLFILE, and all of the data files are now gone.

What I should have done at this stage was to obtain the DBID as outlined in padstow (4), and used the following commands for the example2 (because the latest backup on the non-shared drive (/u00):

rman target / catalog rman@rmancat
SET DBID 657014536;
STARTUP NOMOUNT;
RESTORE SPFILE;

Although the init.ora in $ORACLE_HOME/dbs only had a SPFILE entry, RMAN complained bitterly but started up an Oracle instance (called DUMMY) so that RMAN has something to talk to.

Instead of doing that nice simple procedure, I had to construct a PFILE from the information in the alert log. This is a painful reminder for me to check to see if there is an easier way first.

The next problem was to restore the CONTROLFILE. In real file systems, there would be a copy command (e.g. cp) for me to copy the good CONTROLFILE over the missing one. Does a copy command exist in asmcmd? No! (Is only available in 11G). Fortunately, RMAN has the following command:

RESTORE CONTROLFILE TO '+DATA' FROM AUTOBACKUP;

This command assumes that I had configured CONTROLFILE AUTOBACKUP to ON. I assumed wrong. Instead, I had to edit the PFILE to remove the CONTROLFILE that was on the DATA disk group, and do the following RMAN commands:

STARTUP MOUNT PFILE='/u00/backup/init00.ora'
BACKUP CONTROLFILE;
RESTORE CONTROLFILE TO '+DATA';
RESTORE CONTROLFILE TO '+FRA';
SHUTDOWN IMMEDIATE

I then used ASMCMD to find out the new names of the control files in order to update the PFILE. The two (2) control file restorations are needed to ensure that the control files are in sync.

Once I had the SPFILE and CONTROLFILEs back, I could return to a normal restoration. (The SPFILE was built from the final init.ora via the CREATE SPFILE command). Used CROSSCHECK to expire all of the ARCHIVELOG files in the DATA disk group. The recovery failed because of a logical error in one of the ARCHIVELOG files. Fortunately, the error message gave a SCN which I used in a RECOVER UNTIL SCN command. Then it was just a simple PITR.