Active/Pasive HA with LAG to Virtual Chassis = Dropped Packets?

SimonBlackler · ‎07-12-2014

Good afternoon,

I tried to deploy a Active/Passive cluster yesterday with only partial success!

Things didn't work as expected. Sessions were forming but servers would work intermittently. At times it would change so that what was working, stopped, and what wasn't, started. Some services worked fine for some people throughout. And for others nothing worked. After 45 minutes of trying various things we rolled back and I got to wondering what I'd missed...

In the lab I'd modelled the set up fairly closely to the real world scenario;

I have a total of 4 BGP peers - 2 for each device that sit in the "Internet" zone on interfaces 1/1, 1/2, 1/3, 1/4 (1/1 & 1/3 are up on Active and 1/2 & 1/4 are ready to go up on the Passive in case of a fail over scenario)
We accept only a default route (0/0) and announce only 1 prefix. BGP/Routing worked in both the lab and the real world.
3 ports per device form part of an aggregated Ethernet bundle, "AE1", making up the "Trust" zone. (1/5, 1/7, 1/9)
The AE1 bundle mounts a number of L3 subnets that act as default gateways for downstream servers.
The AE1 bundle connects from each PAN device to an EX4200 virtual switch stack running a single AE bundle, "AE11". (Not modelled in lab)
There is no routing occurring on the switch fabric.
There are no "Deny" rules - only a default Any/Any/Any "Allow" rule.
There are no fail over rules enabled.

Things I tried

Disabled Jumbo frames - didn't need them anyway, was a relic from an earlier Active/Active setup.
Changed "Passive link state" to "Shutdown" from "Auto"

My current working theory is that having both PAN devices (even though one is shutdown/passive) connected to the switch fabric over a single AE bundle caused traffic to get lost at L2. Is this possible? Perhaps I've missed something else. Either way I'd love to know what I got wrong and how it can be fixed.

Thanks for your time,

Simon

SimonBlackler · ‎07-17-2014

Hello again,

To those of you interested - we successfully deployed these firewalls yesterday, after making a change to the topology.

We replaced the single LAG between the switch fabric with a LAG to each device.

For whatever reason this has solved the issue and we're no longer seeing dropped packets.

Thank you hshah for your help.

Here is the revised, working topology. I hope it helps someone else.

View solution in original post

hshah · ‎07-12-2014

Hi Simon,

If device is in Passive state, it will not respond to any traffic, If it gets any traffic it just drops it. If you think passive unit is getting some of traffic due to switching issue then try following things.

1. Clear mac table on switch, it will clear stale entry for passive unit if it exist

2. If that doesnt fix the issue do packet capture on passive unit. That can help you to verify if firewall is getting any traffic

3. You may want to check hardware counters on switch connecting Passive unit, see if they are increasing.

If issue is on active unit and not on passive than provide me output for following command after each 5 minutes. I need 6 samples. This will provide precise reason for drop

show counter global filter packet-filter yes delta yes sev drop

Regards,

Hardik Shah

SimonBlackler · ‎07-12-2014

Thanks Hardik,
So this architecture is valid?

You say "3. You may want to check hardware counters on switch connecting Passive unit, see if they are increasing."

It's the same logical switch that's connected to the Active unit, under the same LAG. Just 3/6 of the LAG members are up and 3 down.

hshah · ‎07-12-2014

Hello Suppliers,

If switch is logical than I dont have much info on troubleshooting.

Architecture is correct, first try to find out drop reason.

It could be firewall, switch or BGP routers.

Regards,

Hardik Shah

SimonBlackler · ‎07-12-2014

Thanks Hardik,

It won't be the BGP routers - they're 3rd party and are working currently.

It was the switch itself that seemed to be having trouble doing L2 when we ran tests, though I didn't know whether the LAG to the new firewalls had caused the issue...

I will try again in the next maintenance window and check the filters with the deltas.

If anyone else has any other ideas or suggestions, I'm all ears.

hshah · ‎07-12-2014

Hello Supplier,

global counter is the best option to find out root cause.

Apart from that you can do packet capture on firewall to troubleshoot particular data stream. Even that is effective.

Further suggestion can be provided after results of this output.

Regards,

Hardik Shah

SimonBlackler · ‎07-13-2014

Hello Hardik,

My name is Simon!

Aside from that - I've been reading up and found that the scenario "Layer 3 Active/Passive with Link Aggregation" on page 80 of this document - Designing Networks with Palo Alto Networks Firewalls makes use of MC-LAG. I'm only using LAG. Could this be the problem?

hshah · ‎07-13-2014

Hello Simmon,

Topologies are different, but both should be supported and none should have drops.

Global counter data would be really useful here.

Regards,

Hardik Shah

SimonBlackler · ‎07-17-2014

Hello again,

To those of you interested - we successfully deployed these firewalls yesterday, after making a change to the topology.

We replaced the single LAG between the switch fabric with a LAG to each device.

For whatever reason this has solved the issue and we're no longer seeing dropped packets.

Thank you hshah for your help.

Here is the revised, working topology. I hope it helps someone else.

Unlock your full community experience!

Active/Pasive HA with LAG to Virtual Chassis = Dropped Packets?

Active/Pasive HA with LAG to Virtual Chassis = Dropped Packets?

Show your appreciation!