failing back to primary FW and short loss of ISP connection

arcnsparc · ‎12-28-2022

Good evening,

Tomorrow I'm cutting over a new pair of 3410's. I have 3 LAG connections (AE.1, AE.11, and AE.10). AE.11 is the physical connections to my ISP switch. There are two L3 sub interfaces (VLAN 800 & 801). VLAN 800 = ISP1 and VLAN 801 = ISP2. Both ISP routes are static and have the same metric / AD. I'm using ECMP and it works well from my testing.

When I fail over (using reboot to simulate power loss), the passive FW goes active immediately and if I'm lucky, I may see one PING packet drop on both ISP links. The failover is impressive.

When failing back to the primary FW, I lose ISP 2 for approximately 12 - 20 seconds. I've configured the election settings to "standard" and tried using 1min, 2min, and 5min for the Preemption Hold Time. "Preemptive" is checked on both FW's and the primary Device Priority is set to 100 & the Secondary is set to 200. It works well except when failing back to the primary / preempting / preemption, the ISP2 circuit drops for 12-20 seconds. The ISP1 circuit may drop 1 packet, but is more consistent.

ISP1 has a /24 interface bound to VLAN 800 (not crazy about this, but handed this situation).

ISP2 has a /30 interface bound to VLAN 801.

Both are L3 sub interfaces on the same LAG (AE11.801 and AE11.800).

I thought maybe LACP could be an issue, but if it was, it would impact both ISP's as they traverse the same LAG. The upstream ISP switches are a pair of Extreme 10/100/1G switches in an MLAG configuration using Extreme ELRP for loop detection and prevention.

Any advice other timers, tweaks, troubleshooting steps, etc is greatly appreciated. Both ISP's are static and not BGP. Both have the same AD and Metric (10 & 10).

Regards!

arcnsparc · ‎12-28-2022

I did test manual fail-over by using "suspend local device for high availability" - and can go back and forth between both FW's and maybe lose 1 packet. It's only when using Preemption this scenario occurs for ISP2.

I had a typo, the election settings are set to "recommended"

Thanks

OtakarKlier · ‎12-29-2022

Hello,

I "think" this is layer2, eg STP, messing with settings and MAC address advertisement. Or it could be the hold timers on the PAN etc. Not sure on how the Extreme switches deal with failover and clearing mac tables, etc. but I know Cisco has a lag and I've just come to except ~1-2 minutes of downtime. This article kinda goes into how to prevent that due to keeping the passive interfaces in an 'UP' state.

https://knowledgebase.paloaltonetworks.com/KCSArticleDetail?id=kA10g000000ClcACAS

Also check the following to see if it applies to your scenario:

https://knowledgebase.paloaltonetworks.com/KCSArticleDetail?id=kA10g000000ClHnCAK

Try and disable Spanning-Tree on the switches if you can
- When spanning tree is enabled on a switch port, it will not immediately start to forward data. It will instead go through a number of states while it determines the topology of the network. This can cause of a delay of up to 30-50 seconds before traffic starts to be forwarded. This applies to the original spanning tree protocol (STP) defined by the IEEE 802.1D.

Here is a link to a bunch of HA articles:

https://knowledgebase.paloaltonetworks.com/KCSArticleDetail?id=kA10g000000ClIbCAK

I hope this helps explain some of what you are seeing.

Regards,

Unlock your full community experience!

failing back to primary FW and short loss of ISP connection

failing back to primary FW and short loss of ISP connection

Show your appreciation!