Best config to speed up HA failover

DraganMilojevic · ‎07-29-2019

During the last PAN OS upgrade we had to failover between two firewalls in HA configuration. The failover time takes unusually amount of time during which the Internet access was unavailable. It took approximately 10-15 lost pings (to internet host) for passive to become an active. We had opened a case with PAN support and our zoom meeting was dropping, it was reconnecting after about 15 sec automatically. In one of my previous jobs the failover was taking very quickly, i would lost 1 or 2 pings 8.8.8.8..

Our HA setup is like this:

HA1 - over aux-1

HA2 - over eth1/10

Mode is active-passive/the config sync is enabled/passive link state is auto/preemptive is not setup/LACP-LLDP is not configured/Link and path monitorings are enabled/

Wondering if someone had simmilar experience and what was the solution to speed up the failover.

OtakarKlier · ‎07-29-2019

Hello,

This is the best practice for upgrades. Hopefully you didnt just reboot the firewall and instead used the 'Ssupend' Feature.

https://knowledgebase.paloaltonetworks.com/KCSArticleDetail?id=kA10g000000ClRrCAK

Regards,

Retired Member · ‎07-29-2019

Do you use pppoe for the internet connection?

DraganMilojevic · ‎07-29-2019

No, it is a "normal" ISP connection, 50Mbps

DraganMilojevic · ‎07-29-2019

Appreciate your concern; i have been working with PAN for a quite some time and never had an issue with OS upgrade but that was not my question...

Sec101 · ‎07-29-2019

this sounds like a spanning-tree issue- the time it takes for that port to come up - could be STP going through the listening learning .....stages

BPry · ‎07-29-2019

@DraganMilojevic,

What @OtakarKlier was rightfully pointing out was that the proper upgrade procedure would likely have prevented any extended failover outage, as it's a 'clean' way of switching active status to the other peer. If you simply restarted the firewall as part of the upgrade procedure without suspending it, there are a variety of settings that could cause an extended period of time to elapse before traffic starts flowing through the peer unit. If this is the case, we could actually recommend looking at different log files instead of looking for configuration / configuration issues that would cause extended failover time.

There are a variety of settings that I would look at to narrow it down. The first being if LACP aggregates are in use at all, then the HA timer settings deployed on the device, that STP is setup correctly on your switches (took me to long to type this reply, +1 to @Sec101 for being technically the first person to bring up STP), and lastly what @Retired Member mentioned with his PPOE suggestion. You've already said no to two of the four, so the remaining two are things you should look into.

FYI, your comment to @OtakarKlier came off a little rude. Please keep in mind that some suggestions or comments, when answered, will lead to your solution. As mentioned earlier, the order that you performed the upgrade is actually highly important in knowing where we should actually be looking for issues. So if you didn't actually follow recommended procedures, we kinda need to know about it so we don't send you down a rabbit hole troubleshooting the wrong thing.

Unless someone has the title 'Community Manager', everyone that comments on this post is devoting time out of our day to help you answer a question/problem you are having. Please consider that when responding to someone spending part of their day helping others on this forum.

DraganMilojevic · ‎07-30-2019

Thanks Sec101, appreciate your comment; i dont think the issue is STP since firewalls are conencted directly; no switch in between.

DraganMilojevic · ‎07-30-2019

Thanks BPry, appreciate your comment and time spent answering question.
The 'power off' upgrade process was less likely applicable for my case as i mentioned in the original post that during my previous jobs i had successfully performed PAN OS upgrade, losing 1,2 pings which makes me believe i did follow the upgrade process properly and not just powering off the firewall. I would suspect that powering off the firewall will cause more lost pings.
I am leaning more towards LACP settings at this moment.
I am strong believer that this community is a great place to get answer, sharing ideas, best practice, tips and trick and that everyone's time is valuable, including mine. Being in this line of work for quite some time, i do understand the importance of right information so i am trying to put as much useful details in the original post as possible which should allow people willing to assist to be pointed into right direction. If there is not enough information, it is much easier to ask question instead making assumptions. I think everyone will benefit from this..

OtakarKlier · ‎07-30-2019

Hello,

Also in version 8 you can modify the parameters that are used to speed this up a bit. Device tab->high availability->General tab->election settings. I would say set them to aggressive and give it a test.

Regards,

DraganMilojevic · ‎08-07-2019

Thanks, i will try that in my next maintenance window.

Cheers

jeremy.larsen · ‎08-07-2019

I'm going to concur with sec101 on this. If your switch ports are not set to go straight into the forwarding state, you may have delays while STP goes through it's learning process. 15s sounds about right. If you are using Cisco equipment, that's the default learning timer. I'm not sure how this is affected by Passive Link State: auto, but you should probably have these ports in "portfast" mode (Cisco speak) or "edge" mode (Juniper speak) either way.

https://www.cisco.com/c/en/us/support/docs/lan-switching/spanning-tree-protocol/19120-122.html

forward delay—The forward delay is the time that is spent in the listening and learning state. This time is equal to 15 sec by default, but you can tune the time to be between 4 and 30 sec.

PS - Also agree with using "suspend" instead of just reboot for HA failover. Upgrade the Passive node first thus reducing your failover to a single event as well.

dgsans · ‎03-29-2023

Dragan, wondering if you ever got a solution for this issue?

We are having the same. I have a new pair of PA-440s with 8 Ethernet ports plus one management port. There are no dedicated HA ports on the model.

Firewalls are currently running PAN-OS 10.2.3-h4.

Eth1/1 is the external interface

Eth1/2 is the internal network

Eth1/3 is the management network

Eth1/4 is unused

Eth1/5 is used for HA1

Eth1/6 is used for HA1 Backup

Eth1/7 is used for HA2

Eth1/8 is used for HA2 Backup

HA setup is active/passive

The management port goes to a switchport configured for the management VLAN.

There is a Cisco Catalyst 9300 switch. Three VLANs are configured for EXTERNAL, INTERNAL and MANAGEMENT.

All the switch ports that the PA firewalls are connected to are set to portfast mode.

I have a constant ping running from a computer behind the firewall to 8.8.8.8

When I go to the active firewall to Device, High Availability, Operational Commands and suspend it, I lose ping responses for close to 30 seconds and then it recovers.

The HA election settings were initially set to Recommended. I set them to Aggressive on both, committed the changes, and suspended the current active one -- still 30 second data loss. I went to Aggressive mode and was going to try setting some values lower, but several of them are apparently at their minimum value in Advanced mode.

No aggregate ports are configured on the firewalls, no LACP is in use -- all connections are singletons.

Config sync is on

Anyone have any suggestions?

Raido_Rattameister · ‎03-29-2023

30 second data loss is definitely configuration issue.

Check your switch configs (spanning tree, LACP etc).

HA default timers will give you 0-1 lost ping during failover. No need to adjust HA to aggressive mode.

By default passive will keep it's interfaces in shut mode and this can take time until switch will enable interfaces.

Auto mode gives faster failover as swithport is already up but without knowing config on switch side I can't give any recommendations.

Principal Architect @ Cloud Carib Ltd
Palo Alto Networks certified from 2011

dgsans · ‎03-30-2023

I did find one config setting to improve on the PAs. Under Device, High Availability, General, Active/Passive Settings, I had Passive Link State set to Shutdown -- I changed this to Auto on both firewalls, committed and tried again. Now its about 12-13 seconds of data loss during the switchover.

I am not using LACP on the switch -- each firewall has one connection to one switch for each of the three VLANs.

All the switch ports that the firewalls are connected to have portfast enabled. Are there other spanning tree related configurations to check on a Cisco switch when set for portfast?

Thanks

Unlock your full community experience!

Best config to speed up HA failover

Best config to speed up HA failover

Show your appreciation!