Forwarding Decisions in PANOS

locampo · ‎11-11-2017

Hey guys. Fairly new to PANOS and also coming from the perspective of having been a longtime IT generalist with a large interest in networking to finally having a role as a dedicated SEM network engineering role. Having said that, we recently encountered a situation that confused me greatly.

Context: we’ve got several branch offices with Metro-Ethernet connections and routes back to our main datacenter. We have local DHCP relay agents that relay DHCP requests at these sites to our central DHCP server in our main datacenter. The DHCP traffic is relayed THROUGH the PA FW, which then routes it back to HQ via the Metro-E due to a static route with a metric of 10 for our infrastructure subnet (172.30.0.0/16). This has worked fine for years.

Last week, we saw DHCP relay traffic suddenly getting routed out the default route rather than the static route, and therefore failing. We did packet captures and could see the traffic egressing our outside interface and being NAT’d, even though the sessions for this traffic in the monitor continued to identify the Metro-E interface as the destination interface. We confirmed the static route existed in the forwarding table and that it should have matched, but something wasn’t right.

We worked for about 2 hours with PA support and the techn finally told us that the default route’s metric needed to be changed from 10 to 1. Doing so fixed the issue and traffic again flowed as intended.

But this makes no sense to me.

First of all, this worked for years with no issue, so why would it suddenly be an issue? And at multiple locations/firewalls??

Second, my understanding of TCP/IP routing is that traffic gets routed to the most specific route that matches based on the mast. Thus, the default route (0.0.0.0 /0) only wins if there’s no other match... since the /16 route was there and matched, it should have taken precence. Metrics only matter in the case of a tie... right??? The PA tech was suggesting that the default route would be selected over a more specific route if they had equal metrics. Is that accurate?

Perhaps my understanding of TCP/IP routing is fundamentally flawed. Can someone please either enlighten me or else confirm that my understanding is accurate?

If it’s not that I’m completely off base here, then perhaps PANOS behaves slightly differently in how it makes forwarding decisions. Can anyone shed some light? I tried looking for official documentation but couldn’t find a document on specifically how PA devices make forwarding decisions to verify or refute what I was told.

pulukas · ‎11-13-2017

I think you are on the right track, get the case escalated. The explanation does not make sense.

Yes, PA documenation is light in the routing features. But even so, I don't think you would find help there in this case anyway.

Steve Puluka BSEET - IP Architect - DQE Communications (Metro Ethernet/ISP)
ACE PanOS 6; ACE PanOS 7; ASE 3.0; PSE 7.0 Foundations & Associate in Platform; Cyber Security; Data Center

View solution in original post

pulukas · ‎11-12-2017

There is nothing wrong with your understanding of the routing situation.

This sounds like a bug with a work around. I suspect the trigger was one of the updates that automatically apply to either threats or application identification. And the route change is just a work around to miss the bug.

Steve Puluka BSEET - IP Architect - DQE Communications (Metro Ethernet/ISP)
ACE PanOS 6; ACE PanOS 7; ASE 3.0; PSE 7.0 Foundations & Associate in Platform; Cyber Security; Data Center

locampo · ‎11-13-2017

That was my thought, too.

The only thing I don't understand is how Threats and/or Applications would apply to the basic routing engine of the FW. My understanding was that packet processing in PANOS is that route look up happens early in the process, but actual forwarding in the last step, so I guess it is possible that a bug somewhere in the middle could cause the issue. That may even explain the disconnect between what we're observing in the monitor (which says the traffic should get routed out the intended/correct interface, following the specific route) vs packet captures (which show traffic being forwarded out the wrong/unintended interface, following the default route). A route lookup from the CLI for the destination IP showed the traffic should still have been hitting the route that it used to be hitting, so the lookup process itself is working... something else is causing the actual forwarding decision to differ from the FIB lookup result. A bug in the application engine could also explain why this issue seems to only be affecting DHCP traffic and not other traffic. Pings, for example, were working and being routed normally at the same time the DHCP traffic was being mis-routed.

We have an open case with support, but they don't seem to think this is a bug or anything unusual. The tech was telling us that this is the expected behavior... (?!) ... and that if the default route has a lower metric it would win out over the more specific route with the higher metric. Despite the fact this was only affecting some (not all) traffic and the fact that it was working for years as configured. I'll have to try and get the case escalated.

I'd still like to see a formal document for how PANOS makes forwarding decisions. There's some very nice documentation on packet processing (https://live.paloaltonetworks.com/t5/Learning-Articles/Packet-Flow-Sequence-in-PAN-OS/ta-p/56081) but the part dealing with actual forwarding decisions / routing table lookups is glossed over quite non-specifically. Probably because basic TCP/IP routing is assumed to be standard and known information.

pulukas · ‎11-13-2017

I think you are on the right track, get the case escalated. The explanation does not make sense.

Yes, PA documenation is light in the routing features. But even so, I don't think you would find help there in this case anyway.

Steve Puluka BSEET - IP Architect - DQE Communications (Metro Ethernet/ISP)
ACE PanOS 6; ACE PanOS 7; ASE 3.0; PSE 7.0 Foundations & Associate in Platform; Cyber Security; Data Center

locampo · ‎11-14-2017

Thanks.

So, I got a response from a new engineer who confirmed the route metric was not involved. It turns out that the prior engineer also, after changing the metric, cleared all existing DHCP sessions from the FW. It looks like that was what fixed the issue, not changing metrics.

After looking closer at logs, it appears we had our layer 2 links connectivity drops for our metro-ethernet connectionsat multiple site locations. This caused the static route to be removed from the forwarding table at that time (because the associated interface went offline). My understanding is that, when that happened, the existing sessions from the relay agents to the DHCP server were no longer valid and were rebuilt, which did a new route look up to determine the destination zone for the sessions. The traffic was then routed (and NAT'd) out our outside interface via the default route. Once the links came back up for the metro-ethernet interfaces, the route was re-added to the forwarding table, however this did not cause the existing sessions to get rebuilt. The packets began to get routed appropriately, but since the session zone information did not change, the sessions were incorrect (inside>outside) and this was causing the traffic to still get NAT'd. Essentially, the observations we saw that didn't make sense were due to the fact that the sessions were no longer valid for the actual path the traffic was taking. Clearing the sessions was what fixed the issue.

I understand this is how stateful firewalls function and is not a PANOS-specific problem, yet it does seem odd to me that traffic no longer matching its parent session would not trigger the FW to reassess or clear the session in some way. Stateful firewalls have been around a long time, but no one has come up with a good way to mitigate these kinds of issues? Wondering if anyone knows of any technologies and/or configurations that could have helped prevent this.

pulukas · ‎11-15-2017

Yes, this is a general weakness of "fast path" for existing sessions. Since the traffic matches an existing session the route lookups and full session evaluation are not performed but in this case were needed.

The generic way to avoid this particular type of issue is to have your static routes marked as "permanent" meaning they do not withdraw even if the next hop or interface is not reachable. Unfortunately I don't see this option in the PA interface for static routes.

Another approach is to create a discard route for this prefix at a less preferred metric than the active route. Then when the active route is not available the traffic hits the discard instead of your default route. And thus no other session will be setup. This feature is availalbe on the PA.

Steve Puluka BSEET - IP Architect - DQE Communications (Metro Ethernet/ISP)
ACE PanOS 6; ACE PanOS 7; ASE 3.0; PSE 7.0 Foundations & Associate in Platform; Cyber Security; Data Center

Forwarding Decisions in PANOS