Active/Pasive HA with LAG to Virtual Chassis = Dropped Packets?

Announcements

ATTENTION Customers, All Partners and Employees: The Customer Support Portal (CSP) will be undergoing maintenance and unavailable on Saturday, November 7, 2020, from 11 am to 11 pm PST. Please read our blog for more information.

Reply
Highlighted
L1 Bithead

Active/Pasive HA with LAG to Virtual Chassis = Dropped Packets?

Good afternoon,

I tried to deploy a Active/Passive cluster yesterday with only partial success!

Things didn't work as expected. Sessions were forming but servers would work intermittently. At times it would change so that what was working, stopped, and what wasn't, started. Some services worked fine for some people throughout. And for others nothing worked. After 45 minutes of trying various things we rolled back and I got to wondering what I'd missed...

In the lab I'd modelled the set up fairly closely to the real world scenario;

PAN.png

  • I have a total of 4 BGP peers - 2 for each device that sit in the "Internet" zone on interfaces 1/1, 1/2, 1/3, 1/4 (1/1 & 1/3 are up on Active and 1/2 & 1/4 are ready to go up on the Passive in case of a fail over scenario)
  • We accept only a default route (0/0) and announce only 1 prefix. BGP/Routing worked in both the lab and the real world.
  • 3 ports per device form part of an aggregated Ethernet bundle, "AE1", making up the "Trust" zone. (1/5, 1/7, 1/9)
  • The AE1 bundle mounts a number of L3 subnets that act as default gateways for downstream servers.
  • The AE1 bundle connects from each PAN device to an EX4200 virtual switch stack running a single AE bundle, "AE11". (Not modelled in lab)
  • There is no routing occurring on the switch fabric.
  • There are no "Deny" rules - only a default Any/Any/Any "Allow" rule.
  • There are no fail over rules enabled.

Things I tried

  1. Disabled Jumbo frames - didn't need them anyway, was a relic from an earlier Active/Active setup.
  2. Changed "Passive link state" to "Shutdown" from "Auto"

My current working theory is that having both PAN devices (even though one is shutdown/passive) connected to the switch fabric over a single AE bundle caused traffic to get lost at L2. Is this possible? Perhaps I've missed something else. Either way I'd love to know what I got wrong and how it can be fixed.

Thanks for your time,

Simon


Accepted Solutions
Highlighted
L1 Bithead

Hello again,

To those of you interested - we successfully deployed these firewalls yesterday, after making a change to the topology.

We replaced the single LAG between the switch fabric with a LAG to each device.

For whatever reason this has solved the issue and we're no longer seeing dropped packets.

Thank you hshah for your help.

Here is the revised, working topology. I hope it helps someone else.

PAN-revised.png

View solution in original post


All Replies
Highlighted
L6 Presenter

Hi Simon,

If device is in Passive state, it will not respond to any traffic, If it gets any traffic it just drops it. If you think passive unit is getting some of traffic due to switching issue then try following things.

1. Clear mac table on switch, it will clear stale entry for passive unit if it exist

2. If that doesnt fix the issue do packet capture on passive unit. That can help you to verify if firewall is getting any traffic

3. You may want to check hardware counters on switch connecting Passive unit, see if they are increasing.

If issue is on active unit and not on passive than provide me output for following command after each 5 minutes. I need 6 samples. This will provide precise reason for drop

show counter global filter packet-filter yes delta yes sev drop

Regards,

Hardik Shah

Highlighted
L1 Bithead

Thanks Hardik,
So this architecture is valid?

You say "3. You may want to check hardware counters on switch connecting Passive unit, see if they are increasing."

It's the same logical switch that's connected to the Active unit, under the same LAG. Just 3/6 of the LAG members are up and 3 down.

Highlighted
L6 Presenter

Hello Suppliers,

If switch is logical than I dont have much info on troubleshooting.

Architecture is correct, first try to find out drop reason.

It could be firewall, switch or BGP routers.

Regards,

Hardik Shah

Highlighted
L1 Bithead

Thanks Hardik,

It won't be the BGP routers - they're 3rd party and are working currently.

It was the switch itself that seemed to be having trouble doing L2 when we ran tests, though I didn't know whether the LAG to the new firewalls had caused the issue...

I will try again in the next maintenance window and check the filters with the deltas.

If anyone else has any other ideas or suggestions, I'm all ears.

Highlighted
L6 Presenter

Hello Supplier,

global counter is the best option to find out root cause.

Apart from that you can do packet capture on firewall to troubleshoot particular data stream. Even that is effective.

Further suggestion can be provided after results of this output.

Regards,

Hardik Shah

Highlighted
L1 Bithead

Hello Hardik,

My name is Simon!

Aside from that - I've been reading up and found that the scenario "Layer 3 Active/Passive with Link Aggregation" on page 80 of this document - Designing Networks with Palo Alto Networks Firewalls makes use of MC-LAG. I'm only using LAG. Could this be the problem?

Highlighted
L6 Presenter

Hello Simmon,

Topologies are different, but both should be supported and none should have drops.

Global counter data would be really useful here.

Regards,

Hardik Shah

Highlighted
L1 Bithead

Hello again,

To those of you interested - we successfully deployed these firewalls yesterday, after making a change to the topology.

We replaced the single LAG between the switch fabric with a LAG to each device.

For whatever reason this has solved the issue and we're no longer seeing dropped packets.

Thank you hshah for your help.

Here is the revised, working topology. I hope it helps someone else.

PAN-revised.png

View solution in original post

Like what you see?

Show your appreciation!

Click Like if a post is helpful to you or if you just want to show your support.

Click Accept as Solution to acknowledge that the answer to your question has been provided.

The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!

These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the Live Community as a whole!

The Live Community thanks you for your participation!