Failover issues with Active/Passive

FarzanaMustafa · ‎01-21-2019

Hello,

Using 3020 HA pair. We are currently having two issues regarding fail-over:

Fail-over time from primary to secondary takes about two minutes. Fail-over back to the primary takes on average 10 minutes. This seems excessive for a production environment.
Once failed-over from primary to secondary, our externally-facing websites become inaccessible from the outside.

Both the issues have been observed since PAN-OS 7.1.10. Gone through several iterations of firmware upgrade and currently on 8.1.3, however, no change noticed.

Next device on the network after the firewalls are a Cisco Nexus stack.

Monitor Fail Hold Down Time (min)=1

Monitor Hold Time (ms)=3000

Any idea what is going on here?

FarzanaMustafa · ‎02-07-2019

Hi All,

Just wanted to let you all know that TAC team has assisted on this issue. Below files/info were collected and after analyzing them the conclusion was: Seems to be an external issue, PA ARP requests/replies are not delivered to end host.

-Packet capture for Non-IP traffic on both the firewalls. Perform this packet capture while performing the failover. We want to see whether new primary firewall is sending GARP immediately or not.
-Keep a continuous ping running through HA and include this in packet filter for above capture. One filter will capture all non-IP traffic and other filter would be for ping.
-Perform a failover, write down timestamp, time required to recover and minutes of outage.
-Collect packet captures, session output for ping(from host machine), global counters.

-Tech Support files

Closing this post now. Client will check connected switches and devices to understand why ARP replies/requests are not delivered to end host.

View solution in original post

Brandon_Wertz · ‎01-22-2019

@FarzanaMustafa wrote:

Next device on the network after the firewalls are a Cisco Nexus stack.

Monitor Fail Hold Down Time (min)=1
Monitor Hold Time (ms)=3000

Any idea what is going on here?

Nexus "stack?" Can you ellaborate on the network architecure and how your HA interfaces are incorporated into the network?

The HA interfaces should be in a L2 VLAN, with no other ports anywhere on your network in that VLAN. The HA interfaces themselves should just be normal access VLANs.

OtakarKlier · ‎01-22-2019

Hello,

What are you using as your test? Are you putting the active into suspend? Are you using ACI?

Please advise,

FarzanaMustafa · ‎01-23-2019

Hi @OtakarKlier and @Brandon_Wertz

HA is configured directly from one firewall to another without any network devices in between. We have four cables between the firewalls - two of them are used as primary HA links (Control + Data), and two ethernet interfaces are configured as backup HA interfaces (one for Control backup, and one for Data link backup, interfaces are not tagged).

We manually suspend the primary firewall to fail-over to the secondary, then make the first one active again, and suspend the secondary to fail-over back primary.

OtakarKlier · ‎01-24-2019

Hello,

Are your Nexus in vPC? I have a similar setup and my failover is almost instantanious. Maybe open a tac case to make sure everything is running as it should?

Regards,

Brandon_Wertz · ‎01-24-2019

@FarzanaMustafa wrote:
Hi @OtakarKlier and @Brandon_Wertz

HA is configured directly from one firewall to another without any network devices in between. We have four cables between the firewalls - two of them are used as primary HA links (Control + Data), and two ethernet interfaces are configured as backup HA interfaces (one for Control backup, and one for Data link backup, interfaces are not tagged).

We manually suspend the primary firewall to fail-over to the secondary, then make the first one active again, and suspend the secondary to fail-over back primary.

It makes no sense what-so-ever that you would have anything other than a milisecond failover on firewalls that are directly connected to each other. Let alone multi-minute outages.

My deployment is acorss an OTV WAN link hundreds of miles away and our failover is instaneous.

FarzanaMustafa · ‎02-07-2019

Hi All,

Just wanted to let you all know that TAC team has assisted on this issue. Below files/info were collected and after analyzing them the conclusion was: Seems to be an external issue, PA ARP requests/replies are not delivered to end host.

-Packet capture for Non-IP traffic on both the firewalls. Perform this packet capture while performing the failover. We want to see whether new primary firewall is sending GARP immediately or not.
-Keep a continuous ping running through HA and include this in packet filter for above capture. One filter will capture all non-IP traffic and other filter would be for ping.
-Perform a failover, write down timestamp, time required to recover and minutes of outage.
-Collect packet captures, session output for ping(from host machine), global counters.

-Tech Support files

Closing this post now. Client will check connected switches and devices to understand why ARP replies/requests are not delivered to end host.

Unlock your full community experience!

Failover issues with Active/Passive

Failover issues with Active/Passive

Show your appreciation!