This morning I was requested to check a 10gR2 physical standby instance which was slow to apply redo and the lag was growing rapidly, my initial check showed me 9 to 12 minutes duration for 250 MB redo log file to apply;
Mon May 5 06:03:17 2008 Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902209.arc Mon May 5 06:13:20 2008 Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902210.arc Mon May 5 06:23:07 2008 Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902211.arc Mon May 5 06:32:51 2008 Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902212.arc Mon May 5 06:42:40 2008 Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902213.arc Mon May 5 06:52:01 2008 Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902214.arc Mon May 5 07:00:52 2008 Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902215.arc Mon May 5 07:09:58 2008 Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902216.arc ..
After a quick look from unix SQL*Plus as sysdba the Top 5 waits were;
PX Deq: Test for msg (4,5x amount) free buffer waits log file sequential read PX Deq Credit: send blkd checkpoint completed
And after going through the Note:387343.1 and Data Guard Redo Apply and Media Recovery Best Practices 10gR2 I recommended to increase the buffer cache from 512 MB to 6 GB, reduce the recovery parallelism from 20 to 8 and also some additional parameters like;
*.db_cache_size=6G *.log_buffer=78643200 *.large_pool_size=2G *.pga_aggregate_target=10G *.workarea_size_policy='AUTO' *.parallel_min_servers=360 *.parallel_max_servers=520 *.parallel_threads_per_cpu=2 *.optimizer_mode=ALL_ROWS *.optimizer_dynamic_sampling=4 *.cursor_sharing=EXACT RECOVER MANAGED STANDBY DATABASE PARALLEL 8;
After the restart as I monitored the apply process from the alert.log again, now the database was able to apply 250 MB redo log file within 30 seconds and 1 minutes;
Mon May 5 09:01:07 2008 ALTER DATABASE RECOVER automatic from '/assos_arcdir' standby database PARALLEL 8 Mon May 5 09:01:07 2008 Media Recovery Start Managed Standby Recovery not using Real Time Apply parallel recovery started with 8 processes Mon May 5 09:01:52 2008 Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902217.arc Mon May 5 09:02:29 2008 Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902218.arc Mon May 5 09:03:41 2008 Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902219.arc Mon May 5 09:04:39 2008 Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902220.arc Mon May 5 09:05:33 2008 Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902221.arc ..
And the Top 5 waits were now;
PX Deq: Test for msg (x amount) log file sequential read checkpoint completed db file parallel write latch: checkpoint queue latch
So as my lessons learned, I really want to highlight two things, yes again and again these two;
1- If you have enough physical memory, benefit from it to the ends, do not run your production databases with small SGA(critical for OLTP) or PGA(critical for OLAP) especially after 9iR2(nearly no latching problem like the old days with large SGAs). Come on even on laptops we use some GBs of sized SGAs nowadays.
2- Parallelism does not mean faster, I benefit from reducing the degree of parallelism most often than compared to increasing it, 4 or 8 is nearly anytime is enough for my needs. I hope Oracle invests more in automatic parallelism tuning feature in the near future which is FALSE by default on 10gR2 and like automatic undo or automatic pga management also automatic parallel tuning takes its place in my most liked automatic tuning Oracle features :)
Here are some of the sql scripts I used during above troubleshooting from sql*plus, like in above standby case sometimes we only have sql*plus and some simple v$ queries to gather required troubleshooting information.