AWS Tunnels Down when We make a Failover

Alpalo · ‎01-21-2022

Hello everyone,

I have observed that when a failover occurs on an active/passive cluster the IPSEC tunnels to AWS all go down and take a time to recover.

I have verified that the traffic goes down and does not communicate for a time of about 5-10 minutes.

Has anyone else seen this problem and do you know how I can fix it?

I would also like to comment that the tunnels are 2 by 2 with PBF and failover next hop created to not have problems of asymmetries.

Regards

SutareMayur · ‎02-07-2022

Hi @Alpalo ,

Normally HA2 link is used to sync IPSEC SAs from Active to Passive firewall. Can you check if passive firewalls are having IPSEC SAs synced ?

Also I would recommend you to verify traffic as well as system logs related to the tunnel traffic to see if you are seeing any unwanted logs there.

M

Check out my YouTube channel - https://www.youtube.com/@NetworkTalks

MortenAug · ‎02-25-2022

Hi

We see the same thing.

If you configure tunnel monitoring - that seems to bring up the tunnel(s) again quickly. But they will disconnect - in our case approx. 60 seconds after failover.

BR,

Morten

abettencourt · ‎06-08-2022

I too observe the same behaviour with the same setup as your with the AWS standard 2 tunnels they have you setup. We lose about 60 seconds while they re-establish.

how would one check if ipsec is being synced properly Sutare ?

aleksandar.astardzhiev · ‎06-10-2022

Hi @abettencourt ,

To check if IPsec SA were synced just login to the passive member and confirm you see established SAs.

abettencourt · ‎06-10-2022

yes they are all synced ipsec, but no ike obviously.

this is the same for AWS and non-AWs tunnels, however only the AWS tunnels go down during these failovers.

NathanielM · ‎08-03-2022

Not sure if you found a solution to this...

As stated here in the KB article:
https://knowledgebase.paloaltonetworks.com/KCSArticleDetail?id=kA10g000000ClVGCA0

My interpretation of reading that means if the tunnel goes down (or presumably being initially set up) it will only be negotiated by interesting traffic, so you have several options to keep that interesting traffic:

Tunnel monitor using your tunnel interface, the route to the peer will be via the tunnel hence it is interesting traffic (icmp between tunnel peers on the same /30 subnet).
1. On a side note i'd use "path monitoring" instead of PBF if you have two static routes to the same destination. E.g. 10.1.0.0/24 via tunnel 1, 10.1.0.0/24 via tunnel 2. Just put the prefferred metric on 10 and the other 20 and be sure to path monitor on both routes.
If you have a monitoring tool such as Solarwinds ping something on the remote end, even if it is just to a dummy host, this will be icmp to the remote end.
Set up a lambda function to be triggered when the tunnel is down. It will then run the "test" commands on the PA:
1. test vpn ike-sa gateway <gateway_name>
2. test vpn ipsec-sa tunnel <tunnel_name>

https://docs.paloaltonetworks.com/pan-os/10-2/pan-os-admin/vpns/set-up-site-to-site-vpn/test-vpn-con...

aleksandar.astardzhiev · ‎02-05-2023

Hi @abettencourt , @Alpalo ,

It is almost an year when this was posted, have you found a solution?

Last week we did some failover tests, related to other issues and we experience the same issue first hand.

I haven't completely figure it out, but it looks like it is related to how AWS will handle phase2 when phase1 is down.

As already mentioned HA will sync only phase2, which means in event of failover secondary member will have phase2 to AWS up and will try to use it, but there will be no phase1. Firewall will believe tunnel is up and try to use the phase2 that it "inherit" from primary peer, but I am guessing AWS will reject the traffic, because it is using phase2 for which there is no valid phase1.

It is interesting to note, that when forcing phase1 to negotiate using "test vpn ike-sa gateway .." command, tunnel will start working immediately. In the logs I can see that after phase1 negotiation phase2 is also renewed.

This KB mention some interesting solution - https://knowledgebase.paloaltonetworks.com/KCSArticleDetail?id=kA14u000000HAuZCAW&lang=en_US%E2%80%A...

Here they quickly suggest that you can create log forwarding action with HTTP profile, when failover event is triggered, FW to send API to itself to test phase1, which will bring AWS tunnel back to functional immediately after failover.

I am still puzzled what exactly is causing this issue, but something with IKEv2 phase1 liveness check could be the explanation.

I want to made some more tests with IKEv2 liveness check disabled or with tunnel monitor enabled.

abettencourt · ‎02-28-2023

Hi Astardzhiev!

Thanks so much for the reply.

I never did find a solution to the issue no, so this is very interesting to see. I did however arrive at the same conclusion of the issue though - figured it was the phase1 as you can see the rekeys pop up as soon as it starts ( the traffic continues to flow for a few seconds after failing over before AWS notices phase 1 is different)

Your possible solution seems like its going at the top of my list of things to test however! excited to see a possible workaround for this.

aleksandar.astardzhiev · ‎02-28-2023

Hi @abettencourt ,

I forgot to share my results...

TL;DR enabling tunnel monitor monitoring the AWS tunnel IP is working perfect..

I was able reproduce the issue with manual failover and every time noticed that although phase2 seems up, traffic is not working. In our case we run BGP and we notice that BGP went down and says down for long time.

As you know AWS establish two separate tunnels. so first I enabled tunnel monitor for one of those tunnel and perform another failover.

Tunnel monitor failed couple of seconds after the failover, which seems to trigger new tunnel negotiation (at least how I explain it to myself).

Since AWS peer is working the new tunnel is established almost immediately.

Using default valued for tunnel monitor profile (3sec/5 threshold) were enough to keep the BGP peering up.

I am still puzzled why this is happening exactly. During my initial setup I explicitly decided to not enable tunnel monitor, because we are using BGP which should achieve the dynamic switch over when tunnels are down..

Raido_Rattameister · ‎03-02-2023

Tunnel goes down due IKE not being synchronized between HA peers. Most likely AWS will bring tunnel down due DPD failure.

https://knowledgebase.paloaltonetworks.com/KCSArticleDetail?id=kA14u000000HAuZCAW

Try:
"To being up phase 1 automatically, use the HTTP Log Forwarding feature known as "log to action". If the firewall sees a HA event in the logs, configure "log to action" to trigger the command "test vpn ike-sa" to bring up phase 1 automatically in the event of a failover."

Principal Architect @ Cloud Carib Ltd
Palo Alto Networks certified from 2011

wvandriessche · ‎05-14-2024

Hi all,

Anyone that has an example of the payload to get this type of 'log to action' working?

Thanks a lot!

Greeting

Eric_B · ‎02-24-2025

You can have it detect the failover event and make an API call to itself to rekey the VPNs using XPath with a wildcard or be specific by ike-gatway name. I believe this will do what they were trying to say. I am not sure that I really want to do this in production, but I am going to lab it up and test it out.

Eric_B · ‎03-20-2025

So while it is possible to create a localhost server profile to trigger the API, operational commands do not work with Xpath. You need a separate server profile for each s2s VPN you want to "test" to bring it back up with the name of that gateway as a value. It is not ideal. Theoretically, you could script a query via config commands to show all the gateways, create an array, then send an op command to the APIT for each gateway. That way you are not adding or removing server profiles for each s2s.

Unlock your full community experience!

AWS Tunnels Down when We make a Failover

AWS Tunnels Down when We make a Failover

Show your appreciation!