I am seeing an issue where upon pulling the plug in an active/active setup, OSPF does it's job. A few packets are lost but it is within an acceptable range. My issue stems from when the innactive member comes back online and joins the HA. I can't confirm this, but it feels as if the OSPF routes are coming up before the FW services are ready to process the packets. This causes an outage of a few seconds instead of near instantaneous fail back. I have configured BFD in hopes it would aleviate the issue but I am still seeing drops of up to a few seconds. We are looking for near instantaneous failvoer and fail back (So please no arguments about Active-Passive being easier, etc) I'm hoping if this is the case I can somehow adjust how the failed member rejoins the active/active pair and properly syncs all data before the OSPF process even begins.
Would this have anything to do with the covergence time? Any other routes that it may have be causing this, i.e. another dynamic routing protocol or static routes?
Just a few thoughts.
The OSPF process doesn't appear to start until the HA pair is active again. The problem happens after the OSPF routes come up and traffic starts flowing to the previously down firewall. It is this traffic that appears to be dumping into a black hole temporarily.
I do not know, if the problem is still relevant for you, but I encountered the same issue and found a fix recently.
If a device switches into the state Tentative, we encounter almost no outage. But when failing back, we have about 5 seconds of connectivity loss. This happens right after the OSPF Link State Databases have been fully exchanged between the firewall and the connected router.
I found out that we don't simply encounter traffic loss, but a TTL exceeded error for packets routed towards the firewall.
Here is what happens:
When the firewall becomes active, it establishes OSPF connectivity with the router. After the LS Databases are fully exchanged, the router instantly recalculates the new best paths and installs them in its routing table. It will learn, that the Palo Alto is the new best next hop for certain networks and will start to send packets to the firewall.
The Palo Alto on the other does not instantly recalculate the best paths! By default, it will wait 5 seconds before starting the SPF process. Within this time frame, the router will route packets towards the firewall while the firewall will route packets to whereever it deems best based on the old routing table. In our case, this meant sending it back to the router.
You can change this SPF calculation delay in the OSPF settings:
With this setting, the failback only causes about 1 second of outage.
I know this is old but thank you. I will lab that up when I get time. It was such as issue that we went to Active/Passive and only experience a max of 1.5s switchovers.
Click Accept as Solution to acknowledge that the answer to your question has been provided.
The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!
These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the Live Community as a whole!
The Live Community thanks you for your participation!