Odd behavior around ISP Failover with Static Route Path Monitoring

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Announcements

Odd behavior around ISP Failover with Static Route Path Monitoring

L1 Bithead

Hi,

 

I had an unexpected situation occur recently with regards to failover behavior on static route path monitoring. We have 3 ISPs, and this past weekend 2 of them went down at different times (hooray). For the purposes of this post, I will be talking about one of them.

 

Interestingly, the path-monitoring worked when the failure event occurred - and the routes were able to be failed over to the secondary (and eventually the tertiary) ISP. However, when the ISPs came back up, the routes did not recover. I'm specifically looking at the behavior for the default route out as defined with a static route for destination 0.0.0.0/0.

 

In my experiences with our Internet providers, sometimes the ISP CPE device can go down, and sometimes the service upstream could be down. To account for this, I selected 2 options in the path-monitoring box for destinations to check. One being the Edge CPE IP (which can be pinged from both inside or outside the network under normal/specific circumstances) and the other being a public DNS for 1.1.1.1.

 

The IP on the PA interface connecting to the ISP is on the same subnet as the IP on the ISP CPE.

 

When going through testing/troubleshooting, I could see inbound layer-2 traffic from the edge CPE all the way to the PA interface. However, I saw very little, maybe even no (did not specifically note, unfortunately) outbound traffic from that very same PA interface. Pings from that interface to the CPE IP failed, while I could ping it across the Internet from another PA interface used for a different ISP. Felt like very weird behavior.

 

Preemptive hold time set to what I believe is the default of 2 minutes

 

image.png

 

Theories I have based on guessing and reading around the Internet:

 

1. Since one of the paths monitored are not within the subnet, it could not be determined to be in anything but a failed state. If this is the case, what is my best option for determining if the ISP upstream from the connecting device is down?
2. Something strange going on because the next-hop IP is the same as one of the Destination IP's being monitored. I don't expect this to be the case, because if so then what is my option for determining if the connected upstream device is connected, yet unavailable for some reason?

3. ...something to do with the interface on the PA having IPs without an explicit subnet defined? Thus, it couldn't know that the CPE IP was in it's subnet to check? If this seems valid, then I double-down on the question raised in point #1. Edit (#7? 8?): As I come back for yet another edit this point #3 is seeming more and more valid. Both of the ISPs that failed, and did not recover successfully do not have subnets defined on the IPs associated with their interfaces, while the ISP that didn't fail - or maybe did at some point, but recovered so smoothly that I never saw it - has a subnet attached to the IPs of the interface on the PA device. Hmm.

Any comments or insights would be immensely appreciated. Thanks!

2 REPLIES 2

Thanks for your post. I am seeing the exact same behavior both in 8.1.12 and now 9.0.6 on a PA-220. My workaround, not sure why this works, was to add another static route on my primary ISP interface for 1.1.1.1/32 with a next hop of the ISP gateway. When I test this, the route monitor seems to be up now. Without this additional route, it stays in a down state even when the interface is back up and functioning. 

 

I've tested this before, used in production with clients, and I feel like this is something that was broken and has remained broken for some time now somewhere beginning the in 8.1.x code and above. 

In my case, this was a result of a Zone Protection Profile applied to my "untrust' zone which included both ISP interfaces. It was dropping the ICMP replies due to strict IP check. I disabled this and now everything works as expected. 

  • 4014 Views
  • 2 replies
  • 0 Likes
Like what you see?

Show your appreciation!

Click Like if a post is helpful to you or if you just want to show your support.

Click Accept as Solution to acknowledge that the answer to your question has been provided.

The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!

These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the LIVEcommunity as a whole!

The LIVEcommunity thanks you for your participation!