AWS Tunnels Down when We make a Failover

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

AWS Tunnels Down when We make a Failover

L4 Transporter

Hello everyone,

 

I have observed that when a failover occurs on an active/passive cluster the IPSEC tunnels to AWS all go down and take a time to recover.

 

I have verified that the traffic goes down and does not communicate for a time of about 5-10 minutes.

Has anyone else seen this problem and do you know how I can fix it?


I would also like to comment that the tunnels are 2 by 2 with PBF and failover next hop created to not have problems of asymmetries.

 

Regards

11 REPLIES 11

L6 Presenter

Hi @Alpalo ,

 

Normally HA2 link is used to sync IPSEC SAs from Active to Passive firewall. Can you check if passive firewalls are having IPSEC SAs synced ?

Also I would recommend you to verify traffic as well as system logs related to the tunnel traffic to see if you are seeing any unwanted logs there.

M

Check out my YouTube channel - https://www.youtube.com/@NetworkTalks

L0 Member

Hi

We see the same thing.

If you configure tunnel monitoring - that seems to bring up the tunnel(s) again quickly. But they will disconnect - in our case approx. 60 seconds after failover.

BR,

Morten

L1 Bithead

I too observe the same behaviour with the same setup as your with the AWS standard 2 tunnels they have you setup. We lose about 60 seconds while they re-establish.

 

how would one check if ipsec is being synced properly Sutare ?

Hi @abettencourt ,

To check if IPsec SA were synced just login to the passive member and confirm you see established SAs.

yes they are all synced ipsec, but no ike obviously.

 

this is the same for AWS and non-AWs tunnels, however only the AWS tunnels go down during these failovers.

L1 Bithead

Not sure if you found a solution to this...

 

As stated here in the KB article:
https://knowledgebase.paloaltonetworks.com/KCSArticleDetail?id=kA10g000000ClVGCA0

My interpretation of reading that means if the tunnel goes down (or presumably being initially set up) it will only be negotiated by interesting traffic, so you have several options to keep that interesting traffic:

 

  1. Tunnel monitor using your tunnel interface, the route to the peer will be via the tunnel hence it is interesting traffic (icmp between tunnel peers on the same /30 subnet).
    1. On a side note i'd use "path monitoring" instead of PBF if you have two static routes to the same destination. E.g. 10.1.0.0/24 via tunnel 1, 10.1.0.0/24 via tunnel 2. Just put the prefferred metric on 10 and the other 20 and be sure to path monitor on both routes.
  2. If you have a monitoring tool such as Solarwinds ping something on the remote end, even if it is just to a dummy host, this will be icmp to the remote end.
  3. Set up a lambda function to be triggered when the tunnel is down. It will then run the "test" commands on the PA:
    1. test vpn ike-sa gateway <gateway_name>
    2. test vpn ipsec-sa tunnel <tunnel_name>

https://docs.paloaltonetworks.com/pan-os/10-2/pan-os-admin/vpns/set-up-site-to-site-vpn/test-vpn-con...

Hi @abettencourt , @Alpalo ,

 

It is almost an year when this was posted, have you found a solution?

Last week we did some failover tests, related to other issues and we experience the same issue first hand.

 

I haven't completely figure it out, but it looks like it is related to how AWS will handle phase2 when phase1 is down.

As already mentioned HA will sync only phase2, which means in event of failover secondary member will have phase2 to AWS up and will try to use it, but there will be no phase1. Firewall will believe tunnel is up and try to use the phase2 that it "inherit" from primary peer, but I am guessing AWS will reject the traffic, because it is using phase2 for which there is no valid phase1.

 

It is interesting to note, that when forcing phase1 to negotiate using "test vpn ike-sa gateway .." command, tunnel will start working immediately. In the logs I can see that after phase1 negotiation phase2 is also renewed.

This KB mention some interesting solution - https://knowledgebase.paloaltonetworks.com/KCSArticleDetail?id=kA14u000000HAuZCAW&lang=en_US%E2%80%A...

Here they quickly suggest that you can create log forwarding action with HTTP profile, when failover event is triggered, FW to send API to itself to test phase1, which will bring AWS tunnel back to functional immediately after failover.

 

I am still puzzled what exactly is causing this issue, but something with IKEv2 phase1 liveness check could be the explanation.

I want to made some more tests with IKEv2 liveness check disabled or with tunnel monitor enabled.

 

L1 Bithead

Hi Astardzhiev!

 

Thanks so much for the reply.

 

I never did find a solution to the issue no, so this is very interesting to see. I did however arrive at the same conclusion of the issue though - figured it was the phase1 as you can see the rekeys pop up as soon as it starts ( the traffic continues to flow for a few seconds after failing over before AWS notices phase 1 is different)

 

Your possible solution seems like its going at the top of my list of things to test however! excited to see a possible workaround for this.

Hi @abettencourt ,

I forgot to share my results...

 

TL;DR enabling tunnel monitor monitoring the AWS tunnel IP is working perfect..

I was able reproduce the issue with manual failover and every time noticed that although phase2 seems up, traffic is not working. In our case we run BGP and we notice that BGP went down and says down for long time.

 

As you know AWS establish two separate tunnels. so first I enabled tunnel monitor for one of those tunnel and perform another failover.

Tunnel monitor failed couple of seconds after the failover, which seems to trigger new  tunnel negotiation (at least how I explain it to myself).

Since AWS peer is working the new tunnel is established almost immediately.

Using default valued for tunnel monitor profile (3sec/5 threshold) were enough to keep the BGP peering up.

 

I am still puzzled why this is happening exactly. During my initial setup I explicitly decided to not enable tunnel monitor, because we are using BGP which should achieve the dynamic switch over when tunnels are down..

 

Cyber Elite
Cyber Elite

Tunnel goes down due IKE not being synchronized between HA peers. Most likely AWS will bring tunnel down due DPD failure.

 

https://knowledgebase.paloaltonetworks.com/KCSArticleDetail?id=kA14u000000HAuZCAW

 

Try:
"To being up phase 1 automatically, use the HTTP Log Forwarding feature known as "log to action". If the firewall sees a HA event in the logs, configure "log to action" to trigger the command "test vpn ike-sa" to bring up phase 1 automatically in the event of a failover."

Enterprise Architect, Security @ Cloud Carib Ltd
Palo Alto Networks certified from 2011

Hi all,

 

Anyone that has an example of the payload to get this type of 'log to action' working?

 

Thanks a lot!

Greeting

  • 7085 Views
  • 11 replies
  • 0 Likes
Like what you see?

Show your appreciation!

Click Like if a post is helpful to you or if you just want to show your support.

Click Accept as Solution to acknowledge that the answer to your question has been provided.

The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!

These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the LIVEcommunity as a whole!

The LIVEcommunity thanks you for your participation!