Multiple Virtual Routers in a single system - Issues with Failover in an Active/Standby setup

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Announcements
Please sign in to see details of an important advisory in our Customer Advisories area.

Multiple Virtual Routers in a single system - Issues with Failover in an Active/Standby setup

L2 Linker

Hello,

 

We have a pair of PA5050 appliances that operate dual virtual-routers in order to provide dual ISP connectivity using eBGP between them.  We also have some site to site and AWS VPNs terminating on VR1 and GlobalProtect VPNS terminating in VR2. 

 

The issue we have seen is that when upgrading the PAN-OS, the appliances fail over as do services in VR1 but services in the VR2 do not work when failed over to the standby appliance.  A config comparison between Active and Standby shows an exact match.  It should be noted VR1 is not the default VR.

 

These appliances are managed by a Panorama management server and I note that in the Panorama template VR1 lists that it is part of "vsys1", whereas VR2 is listed as "none" in the vsys field.

 

Could this have something to do with the failure to restore services in VR2 when it fails over?   I am not able to find any PA documented detail around this and config in Panorama is often overlooked in the PA configuration guides.

 

or can anyone else suggest a reason why fail over in a particular vsys fails.  Is there a cli method to check replicated sessions between the active and standby appliances to determine what is available when failing over to a standby appliance for routing etc., on the standby?

 

Thanks in advance for any support in this matter.

3 REPLIES 3

L3 Networker

Hi,

 

I can suggest using device>troubleshooting. 

To test routing using a virtual router from both firewalls and verify the output.

And as per info, its route leaking or VR routing-related issue pls check its routing area from both outsides to inside vice versa.

 

test-config.PNG

 

Best Regards,
Suresh

Hello,

 

Apologies for the delaying in responding to this post, my Palo Alto topology has been causing nothing but grief these past weeks.

 

We discovered the following when attempting a failover once again:

 

The BGP peering between the virtual-routers is (obviously) in a down state on the passive PA.

When failover occurs the BGP between virtual routers and any external BGP peers initiates and finally establishes (if detected as down, see below re Cisco router)

The only replication of routing between the Active and Passive is the forwarding table (using Graceful restart) but not the routing (RIB table)

When fail over occurs the Graceful restart was doing it's job between VRs but because downstream external BGP peers (Cisco routers) were not configured for Graceful Restart, itcould not reestablish with the PA until BGP timers had expired (default 120-180 seconds)
I don't think Graceful restart would be a useful feature to enable on the Cisco as the issue appeared to be more around the Cisco not recognizing the lost BGP peer as the MAC/IP was carried over to the passive (now active) so as far as the CIsco was concerned everything was normal.

A manual clearing of the Cisco-PA BGP peering restored service.

 

So all in all I think we are looking at bringing in faster BGP timers but I cannot yet find an answer as to how the Cisco router detects a failover to a PA which utilises the same MAC/IP and GARPs.

 

Regards

Cyber Elite
Cyber Elite

Hi @GrantCampbell4 ,

 

BGP NSF Awareness is not enabled by default on Cisco.  NSF Awareness is needed for it to recognize the BGP OPEN message sent by the newly active firewall in order to re-establish the session and repopulate its BGP RIB.  The "bgp graceful-restart" global command (can also be configured per neighbor or per template) enables both NSF-aware (assist neighbor) and NSF-capable (perform SSO if HW supported) functions on the router.

 

This configuration on the Cisco side should allow the BGP session to be re-established without the outage (neighbor down, routes withdrawn) caused by the hold timer expired.  With GR enabled, you should keep the long hold timers.

 

Thanks,

 

Tom

Help the community: Like helpful comments and mark solutions.
  • 2868 Views
  • 3 replies
  • 0 Likes
Like what you see?

Show your appreciation!

Click Like if a post is helpful to you or if you just want to show your support.

Click Accept as Solution to acknowledge that the answer to your question has been provided.

The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!

These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the LIVEcommunity as a whole!

The LIVEcommunity thanks you for your participation!