Failover issues with Active/Passive

Reply
Highlighted
L4 Transporter

Failover issues with Active/Passive

Hello,

 

Using 3020 HA pair. We are currently having two issues regarding fail-over:

  1. Fail-over time from primary to secondary takes about two minutes. Fail-over back to the primary takes on average 10 minutes. This seems excessive for a production environment.
  2. Once failed-over from primary to secondary, our externally-facing websites become inaccessible from the outside. 

 

Both the issues have been observed since PAN-OS 7.1.10. Gone through several iterations of firmware upgrade and currently on 8.1.3, however, no change noticed.

 

Next device on the network after the firewalls are a Cisco Nexus stack.

 

Monitor Fail Hold Down Time (min)=1

Monitor Hold Time (ms)=3000

 

Any idea what is going on here?


Accepted Solutions
Highlighted
L4 Transporter

Hi All,

 

Just wanted to let you all know that TAC team has assisted on this issue. Below files/info were collected and after analyzing them the conclusion was: Seems to be an external issue, PA ARP requests/replies are not delivered to end host.

 

-Packet capture for Non-IP traffic on both the firewalls. Perform this packet capture while performing the failover. We want to see whether new primary firewall is sending GARP immediately or not.
-Keep a continuous ping running through HA and include this in packet filter for above capture. One filter will capture all non-IP traffic and other filter would be for ping.
-Perform a failover, write down timestamp, time required to recover and minutes of outage.
-Collect packet captures, session output for ping(from host machine), global counters.

-Tech Support files

 

Closing this post now. Client will check connected switches and devices to understand why ARP replies/requests are not delivered to end host.

View solution in original post


All Replies
Highlighted
Cyber Elite


@FarzanaMustafa wrote:

 

 

Next device on the network after the firewalls are a Cisco Nexus stack.

 

Monitor Fail Hold Down Time (min)=1

Monitor Hold Time (ms)=3000

 

Any idea what is going on here?


 

Nexus "stack?"  Can you ellaborate on the network architecure and how your HA interfaces are incorporated into the network?

 

The HA interfaces should be in a L2 VLAN, with no other ports anywhere on your network in that VLAN.  The HA interfaces themselves should just be normal access VLANs.

Cyber Elite

Hello,

What are you using as your test? Are you putting the active into suspend? Are you using ACI?

 

Please advise,

Highlighted
L4 Transporter

Hi @OtakarKlier and @Brandon_Wertz

 

HA is configured directly from one firewall to another without any network devices in between. We have four cables between the firewalls - two of them are used as primary HA links (Control + Data), and two ethernet interfaces are configured as backup HA interfaces (one for Control backup, and one for Data link backup, interfaces are not tagged).

 

We manually suspend the primary firewall to fail-over to the secondary, then make the first one active again, and suspend the secondary to fail-over back primary.

Highlighted
Cyber Elite

Hello,

Are your Nexus in vPC? I have a similar setup and my failover is almost instantanious. Maybe open a tac case to make sure everything is running as it should?

 

Regards,

Highlighted
Cyber Elite


@FarzanaMustafa wrote:

Hi @OtakarKlier and @Brandon_Wertz

 

HA is configured directly from one firewall to another without any network devices in between. We have four cables between the firewalls - two of them are used as primary HA links (Control + Data), and two ethernet interfaces are configured as backup HA interfaces (one for Control backup, and one for Data link backup, interfaces are not tagged).

 

We manually suspend the primary firewall to fail-over to the secondary, then make the first one active again, and suspend the secondary to fail-over back primary.


 

It makes no sense what-so-ever that you would have anything other than a milisecond failover on firewalls that are directly connected to each other.  Let alone multi-minute outages.

 

My deployment is acorss an OTV WAN link hundreds of miles away and our failover is instaneous. 

Highlighted
L4 Transporter

Hi All,

 

Just wanted to let you all know that TAC team has assisted on this issue. Below files/info were collected and after analyzing them the conclusion was: Seems to be an external issue, PA ARP requests/replies are not delivered to end host.

 

-Packet capture for Non-IP traffic on both the firewalls. Perform this packet capture while performing the failover. We want to see whether new primary firewall is sending GARP immediately or not.
-Keep a continuous ping running through HA and include this in packet filter for above capture. One filter will capture all non-IP traffic and other filter would be for ping.
-Perform a failover, write down timestamp, time required to recover and minutes of outage.
-Collect packet captures, session output for ping(from host machine), global counters.

-Tech Support files

 

Closing this post now. Client will check connected switches and devices to understand why ARP replies/requests are not delivered to end host.

View solution in original post

Like what you see?

Show your appreciation!

Click Like if a post is helpful to you or if you just want to show your support.

Click Accept as Solution to acknowledge that the answer to your question has been provided.

The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!

These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the Live Community as a whole!

The Live Community thanks you for your participation!