so I've got a curious little problem and wanted to get some opinions before possibly creating a feature request at PA.
we have a customer using a palo alto as his main firewall.
and a certain cloud based proxy as their proxy.
connecting to this proxy is done via IPSEC vpn tunnel.
our customer noted that sometimes they lose connectivity to the internet(via proxy) for up to 1 hour or less from time to time.
troubleshooting with the proxy provider determined this happened when a failover occurred on the proxy side.
more indepth questions revealed the way this proxy provider handles failover.
the proxy is redundant by means of Nodes. customers are loadbalanced over the different nodes
and to connec they build an ipsec vpn tunnel with the node directly.
should this node go down. the public ip is moved to a differnt node and the ipsec vpn is terminated on the old node and rebuilt on the new one( public peer ip is moved)
however as the public ip remains the same and the tunnel is reestablished on a different node with the same ip the palo alto keeps sending data over the (now old)ipsec tunnel. .
the remote side now drops any traffic as the palo alto is sending date over the old tunnel( and AH).
all traffic is now timing out (syn sent, no syn ack back)
the remote proxy expects DPD to kick in for this and after the DPD determines the tunnel is dead a new tunnel is established.
in comes palo alto's way of not sending dpd continuously( but only during phase 2 rekey( hence the hour downtime) and no option to enable persistend dpd.
configuring tunnel monitoring could help here.(as it is the closest palo alto seems willing to go to enabling persistent DPD) however this requires icmp to a host through the tunnel be enabled. and icmp is not allowed towards the cloud proxies....
so: when a node fails over:
at cloud proxy side: ipsec is terminated and rebuilt on differnt node
at palo alto: this is not detected.
--> palo alto sends traffic with outdated AH, cloud proxy firewalling drops this traffic
--> cloud proxy firewall expects the remote side(palo alto) to rebuild the tunne once it detects the current one is unresponsive via dpd, palo alto only sends dpd when phase2rekey is done or when signalled by tunne monitoring.
tunnelmonitoring on palo alto could resolve this. however this monitioring only works with ICMP. which is dropped by the cloud proxy.
--> so far no solution.
I asked the cloud proxy provider if allowing icmp would be possible --> I got a no.
possible other ideas:
feature request at palo alto to allow a checkbox for persistend dpd messages
feature request at palo alto to allow tunnel monitoring over somthing other then icmp( tpc 80? )
workaround: reduce phase2 lifetime to very short time in order to reduce downtime during failover...
far from ideal. and during failover the downtime is still a value between 15 seconds(dpd) and 15seconds + minimu phase2 lifetime
The problem is in this case both the proxy provide and palo alto are depending on how you look at it not at fault/ or are both at fault:
the way dpd is implemented by palo alto is kind off iffy. and seems to try to promote the prorietary tunnel monitoring of their product.
however dpd does not have a standard so they are/were free to implement dpd as they see fit. though I'm not a fan of security vendors giving their own twists to how they implement stuff(looking at you checkpoint ipsec vpn encryption domains). if everybody else is doing it different(semi the same) especially if no extra advantage can be seen from it.
the no icmp policy of the cloud proxy is bothersome. and the argument that icmp is a security issue in and of itself is outdated.
permitting icmp from any, can be argumented, might be abused.
but limiting the icmp from known sources/zone. say allowign it from vpn tunnel connected devices should not be a major issue.
Am I missing something here?
or anybody else have had similar issues/possible workarounds I'm missing?
Solved! Go to Solution.
I'd have to check but I suspect the tunnel is currently configured as support both.
but normally ikev2 should be supported.
this is a good alternative workaround( as lowering the phase 2 rekey increases load of the firewall.
however the liveliness check will only trigger also after 5+ minutes.
whilst that is fairly short. for some user losing internet access for 5 minutes they will complain.
but I will certainly try it.
Ever find a solution to this? We use zscaler and it sounds very familiar to this implementation. Our tunnels keep going down and I see DPD timouts on our side in the debug logs.
Click Accept as Solution to acknowledge that the answer to your question has been provided.
The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!
These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the LIVEcommunity as a whole!
The LIVEcommunity thanks you for your participation!