Recently one of dataguard systems started to get out of sync. When we check the logs we could see that Oracle was complaining that the archivelogs were corrupt. The standby alert log showed the following.
Fri Feb 22 08:51:16 2019 RFS: Assigned to RFS process 5352 RFS: Opened log for thread 1 sequence 21864 dbid -185466353 branch 97646777 CORRUPTION DETECTED: In redo blocks starting at block 135169count 2048 for thread 1 sequence 21864 Deleted Oracle managed file /u02/FAST_RECOVERY_AREA/CBIPROD/ARCHIVELOG/2019_22_02/O1_MF_1_7_DHD1GTSC_.ARC RFS: Possible network disconnect with primary database
We manually copied over the affected archivelog from the live system and catalogued it in RMAN on the standby site:
CBIPRODDR:> RMAN target / RMAN> catalog start with /u02/temp/arch21864;
As soon as we did this MRP applied the log without complaint therefore the logs appear to be fine on the live site and must be getting corrupt on transit.
The next log that dataguard tried to copy was corrupt so all the SQLNet traffic between the 2 sites appears to be suffering corruption but not FTP for example. This led me to consider and research if SQLNet traffic is treated any differently to other traffic types. It turns out that by default, many standard routers and firewalls have a system called deep packet inspection(DPI) where traffic regardless as dangerous is more closely assessed. SQLNet traffic is often identified as dangerous and therefore the packets can be corrupted by this DPI which is very frustrating. Unfortunately our customer was unable to turn this feature off so we had to find a workaround. Further research showed that most routers only consider SQLNet traffic if it’s running on a standard port, 1521. As a test we changed the standby listener port to 1522 and updated the config on the live site. Immediately the logs started being pulled over by FAL without corruption. We were fairly amazed but delighted!
Quick summary, if you are getting constant or frequently corrupt archivelogs on your standby site in a dataguard environment, check the logs are OK on your live site and if they are try changing the listener port from 1521 (say to 1528) on the standby site to see if this quickly resolves the problem!