Extract(Read) > Transform(Test&Apply) > Load(Learn) Blog

May 5, 2008

Data Guard Redo Apply and Media Recovery Best Practices 10g

Filed under: Oracle Best Practices,Oracle Performance — H.Tonguç Yılmaz @ 7:11 am

This morning I was requested to check a 10gR2 physical standby instance which was slow to apply redo and the lag was growing rapidly, my initial check showed me 9 to 12 minutes duration for 250 MB redo log file to apply;


Mon May  5 06:03:17 2008
Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902209.arc
Mon May  5 06:13:20 2008
Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902210.arc
Mon May  5 06:23:07 2008
Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902211.arc
Mon May  5 06:32:51 2008
Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902212.arc
Mon May  5 06:42:40 2008
Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902213.arc
Mon May  5 06:52:01 2008
Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902214.arc
Mon May  5 07:00:52 2008
Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902215.arc
Mon May  5 07:09:58 2008
Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902216.arc
..
    

After a quick look from unix SQL*Plus as sysdba the Top 5 waits were;


PX Deq: Test for msg   (4,5x amount)                                            
free buffer waits                                                  
log file sequential read                                           
PX Deq Credit: send blkd                                           
checkpoint completed                               
     

And after going through the Note:387343.1 and Data Guard Redo Apply and Media Recovery Best Practices 10gR2 I recommended to increase the buffer cache from 512 MB to 6 GB, reduce the recovery parallelism from 20 to 8 and also some additional parameters like;


*.db_cache_size=6G               
*.log_buffer=78643200
*.large_pool_size=2G 
*.pga_aggregate_target=10G
*.workarea_size_policy='AUTO'
*.parallel_min_servers=360
*.parallel_max_servers=520
*.parallel_threads_per_cpu=2
*.optimizer_mode=ALL_ROWS 
*.optimizer_dynamic_sampling=4
*.cursor_sharing=EXACT

RECOVER MANAGED STANDBY DATABASE PARALLEL 8;  
     

After the restart as I monitored the apply process from the alert.log again, now the database was able to apply 250 MB redo log file within 30 seconds and 1 minutes;


Mon May  5 09:01:07 2008
ALTER DATABASE RECOVER  automatic from '/assos_arcdir' standby database PARALLEL 8  
Mon May  5 09:01:07 2008
Media Recovery Start
Managed Standby Recovery not using Real Time Apply
 parallel recovery started with 8 processes
Mon May  5 09:01:52 2008
Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902217.arc
Mon May  5 09:02:29 2008
Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902218.arc
Mon May  5 09:03:41 2008
Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902219.arc
Mon May  5 09:04:39 2008
Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902220.arc
Mon May  5 09:05:33 2008
Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902221.arc
..
     

And the Top 5 waits were now;


PX Deq: Test for msg   (x amount)
log file sequential read 
checkpoint completed 
db file parallel write 
latch: checkpoint queue latch
    

So as my lessons learned, I really want to highlight two things, yes again and again these two;

1- If you have enough physical memory, benefit from it to the ends, do not run your production databases with small SGA(critical for OLTP) or PGA(critical for OLAP) especially after 9iR2(nearly no latching problem like the old days with large SGAs). Come on even on laptops we use some GBs of sized SGAs nowadays.

2- Parallelism does not mean faster, I benefit from reducing the degree of parallelism most often than compared to increasing it, 4 or 8 is nearly anytime is enough for my needs. I hope Oracle invests more in automatic parallelism tuning feature in the near future which is FALSE by default on 10gR2 and like automatic undo or automatic pga management also automatic parallel tuning takes its place in my most liked automatic tuning Oracle features :)

Here are some of the sql scripts I used during above troubleshooting from sql*plus, like in above standby case sometimes we only have sql*plus and some simple v$ queries to gather required troubleshooting information.

About these ads

5 Comments »

  1. In the same document

    for wait : PX Deq: Par Recov Reply (Coordinator synchronous wait for Slave (wait for checkpoints))
    It says : Increase PARALLEL EXECUTION MESSAGE SIZE to 65535

    I had :

    EVENT TOTAL_WAITS TIME_WAITED AVERAGE_WAIT*10
    —————————————————————————————————————–
    PX Deq: Test for msg 164245401 102316646 6.2
    checkpoint completed 210190 59264214 2819.6
    db file parallel write 1705982 47579154 278.9
    PX Deq: Par Recov Reply 12940138 7260931 5.6
    control file sequential read 93846796 7252060 .8

    After that :

    ….

    to be continued

    Comment by dbahar — May 15, 2008 @ 9:18 am | Reply

  2. EVENT TOTAL_WAITS TIME_WAITED AVERAGE_WAIT*10
    ————————————————————————————————————————-PX Deq: Test for msg 1491772 3430776 23
    db file parallel write 71368 1917831 268.7
    checkpoint completed 6265 1858020 2965.7
    PX Deq: Par Recov Reply 168204 175118 10.4
    cf sequential read 1171026 93547 .8
    recovery read 9904 69418 70.1
    PX Deq Credit: send blkd 24983 63495 25.4
    latch: checkpoint queue latch 457797 40245 .9

    Most of the waits dramatically decreased. For ex. :

    Event old value new value
    —————————— —————- ———————-
    PX Deq: Test for msg 164245401 1491772
    db file parallel write 1705982 71368

    I will analyze redo apply rate after this modification. I hope it will increase…

    Comment by dbahar — May 15, 2008 @ 10:21 am | Reply

  3. On what hard ware did you run the dataguard.
    On a T2000 I go best with the parallel 32 for the recovery.

    Comment by archos2 — May 18, 2008 @ 9:30 pm | Reply

  4. @archos2

    uname -a
    OSF1 milasbcp V5.1 2650 alpha

    Memory: Real: 93G tot

    SQL> SELECT a.stat_name, a.VALUE
    2 FROM v$osstat a
    3 WHERE a.stat_name IN (‘NUM_CPUS’) ;

    STAT_NAME VALUE
    —————————————————————- ———-
    NUM_CPUS 32

    Comment by H.Tonguç Yılmaz — May 20, 2008 @ 7:26 am | Reply

  5. There is a shell script in this metalink note which may also assist in monitoring;

    Note:240875.1
    Oracle 9i Media Recovery Best Practices

    Comment by H.Tonguç Yılmaz — May 20, 2008 @ 10:56 am | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 80 other followers

%d bloggers like this: