This morning I was requested to check a 10gR2 physical standby instance which was slow to apply redo and the lag was growing rapidly, my initial check showed me 9 to 12 minutes duration for 250 MB redo log file to apply;
Mon May 5 06:03:17 2008 Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902209.arc Mon May 5 06:13:20 2008 Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902210.arc Mon May 5 06:23:07 2008 Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902211.arc Mon May 5 06:32:51 2008 Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902212.arc Mon May 5 06:42:40 2008 Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902213.arc Mon May 5 06:52:01 2008 Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902214.arc Mon May 5 07:00:52 2008 Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902215.arc Mon May 5 07:09:58 2008 Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902216.arc ..
After a quick look from unix SQL*Plus as sysdba the Top 5 waits were;
PX Deq: Test for msg (4,5x amount) free buffer waits log file sequential read PX Deq Credit: send blkd checkpoint completed
And after going through the Note:387343.1 and Data Guard Redo Apply and Media Recovery Best Practices 10gR2 I recommended to increase the buffer cache from 512 MB to 6 GB, reduce the recovery parallelism from 20 to 8 and also some additional parameters like;
*.db_cache_size=6G *.log_buffer=78643200 *.large_pool_size=2G *.pga_aggregate_target=10G *.workarea_size_policy='AUTO' *.parallel_min_servers=360 *.parallel_max_servers=520 *.parallel_threads_per_cpu=2 *.optimizer_mode=ALL_ROWS *.optimizer_dynamic_sampling=4 *.cursor_sharing=EXACT RECOVER MANAGED STANDBY DATABASE PARALLEL 8;
After the restart as I monitored the apply process from the alert.log again, now the database was able to apply 250 MB redo log file within 30 seconds and 1 minutes;
Mon May 5 09:01:07 2008 ALTER DATABASE RECOVER automatic from '/assos_arcdir' standby database PARALLEL 8 Mon May 5 09:01:07 2008 Media Recovery Start Managed Standby Recovery not using Real Time Apply parallel recovery started with 8 processes Mon May 5 09:01:52 2008 Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902217.arc Mon May 5 09:02:29 2008 Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902218.arc Mon May 5 09:03:41 2008 Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902219.arc Mon May 5 09:04:39 2008 Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902220.arc Mon May 5 09:05:33 2008 Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902221.arc ..
And the Top 5 waits were now;
PX Deq: Test for msg (x amount) log file sequential read checkpoint completed db file parallel write latch: checkpoint queue latch
So as my lessons learned, I really want to highlight two things, yes again and again these two;
1- If you have enough physical memory, benefit from it to the ends, do not run your production databases with small SGA(critical for OLTP) or PGA(critical for OLAP) especially after 9iR2(nearly no latching problem like the old days with large SGAs). Come on even on laptops we use some GBs of sized SGAs nowadays.
2- Parallelism does not mean faster, I benefit from reducing the degree of parallelism most often than compared to increasing it, 4 or 8 is nearly anytime is enough for my needs. I hope Oracle invests more in automatic parallelism tuning feature in the near future which is FALSE by default on 10gR2 and like automatic undo or automatic pga management also automatic parallel tuning takes its place in my most liked automatic tuning Oracle features :)
Here are some of the sql scripts I used during above troubleshooting from sql*plus, like in above standby case sometimes we only have sql*plus and some simple v$ queries to gather required troubleshooting information.
In the same document
for wait : PX Deq: Par Recov Reply (Coordinator synchronous wait for Slave (wait for checkpoints))
It says : Increase PARALLEL EXECUTION MESSAGE SIZE to 65535
I had :
EVENT TOTAL_WAITS TIME_WAITED AVERAGE_WAIT*10
—————————————————————————————————————–
PX Deq: Test for msg 164245401 102316646 6.2
checkpoint completed 210190 59264214 2819.6
db file parallel write 1705982 47579154 278.9
PX Deq: Par Recov Reply 12940138 7260931 5.6
control file sequential read 93846796 7252060 .8
After that :
….
to be continued
Comment by dbahar — May 15, 2008 @ 9:18 am |
EVENT TOTAL_WAITS TIME_WAITED AVERAGE_WAIT*10
————————————————————————————————————————-PX Deq: Test for msg 1491772 3430776 23
db file parallel write 71368 1917831 268.7
checkpoint completed 6265 1858020 2965.7
PX Deq: Par Recov Reply 168204 175118 10.4
cf sequential read 1171026 93547 .8
recovery read 9904 69418 70.1
PX Deq Credit: send blkd 24983 63495 25.4
latch: checkpoint queue latch 457797 40245 .9
Most of the waits dramatically decreased. For ex. :
Event old value new value
—————————— —————- ———————-
PX Deq: Test for msg 164245401 1491772
db file parallel write 1705982 71368
I will analyze redo apply rate after this modification. I hope it will increase…
Comment by dbahar — May 15, 2008 @ 10:21 am |
On what hard ware did you run the dataguard.
On a T2000 I go best with the parallel 32 for the recovery.
Comment by archos2 — May 18, 2008 @ 9:30 pm |
@archos2
uname -a
OSF1 milasbcp V5.1 2650 alpha
Memory: Real: 93G tot
SQL> SELECT a.stat_name, a.VALUE
2 FROM v$osstat a
3 WHERE a.stat_name IN (‘NUM_CPUS’) ;
STAT_NAME VALUE
—————————————————————- ———-
NUM_CPUS 32
Comment by H.Tonguç Yılmaz — May 20, 2008 @ 7:26 am |
There is a shell script in this metalink note which may also assist in monitoring;
Note:240875.1
Oracle 9i Media Recovery Best Practices
Comment by H.Tonguç Yılmaz — May 20, 2008 @ 10:56 am |