Switch Redundancy at Access Layer

locampo · ‎10-16-2018

This is a bit off topic, but I thought some folks might have some knowledge and wisdom to offer.

Where I work we're working dilligently to provide robust resiliency and redundancy for our firewalls using dual powersupplies, HA, and multiple ISP circuits with policy-based routing for failover. Our core switching (also our core router) is also fully redundant, with an IRF stack of two H3C 7506 chassis in physically disparate locations in our main campus connected via divergent fiber runs and all the distribution closets connected back to the core via an aggregation group consisting of a link to each half of the core. The weakest link in this scenario is down at layer 2: our leaf switches.

No matter what we think of, there's no getting around the fact that any single access switch failing is going to cause downtime for some group of users. Short of getting dual NICs in each machine and trying to build BAGs/LAGs for each workstation, what is the best way to mitigate the impact of this?

Obviously, invest in high quality switches, with fully redundant and hot swappable PSUs and fans. I've thought that trying to always maintain a spare switch in the stack that is unused would be a potential strategy for N+1, so that if a single switch dies we can simply add the other switch to replace the failed member remotely and then have someone physically move cables (now becoming a task fit for a help desk technician rather than requiring a network engineer to be available to go manually replace the stack member). But I was wondering if anyone know of any technology that might automate this or provide this warm spare feature explicitly.

Or any other thoughts or ideas are welcome as well.

LukeBullimore · ‎10-16-2018

The best idea that I can think of would be to configure spanning tree if you have that capability. Past that, you're really limited by the fact that the switches are Layer2.

OtakarKlier · ‎10-17-2018

Hello,

In the past we had 'spare' switches that had the same code level and a base config on them. If a prod switch bit the dust, we could then take our config backup and dump it onto the spare, replace the dead one, and plug everything back in.

The other thing we did in the past was use our wifi as a backup, if the machines are capable.

Just some thoughts.

locampo · ‎10-17-2018

@LukeBullimore

Spanning Tree will protect against loops and therefore allows multiple paths in a switch fabric, and therefore redundancy between the access layer and the core. It does not, however, allow for redundancy at the access layer itself... you know, what endpoints actually physically connect to. Nothing I can think of really does or could, except dual connections between each endpoint and multiple switches, but that would require somekind of link aggregation and thus double the number of drops/patches/switches, and would therefore be cost-prohibitive to all but those with the heftiest of coffers.

But, since this is the weakest link in the chain of the network's armor and the one that actually serves the users themselves, I was wondering if maybe some solutions out there that worked around this problem.

locampo · ‎10-17-2018

@OtakarKlier

Hello there,

Yes, this is essentially what we do now. We have a stock of spare switches for each make and model and when one fails we load up the matching firmware and then rejoin it to the stack (on Comware this does not require restoring a config, you simply add it to the stack in place of the missing member). Our environment at the moment is all over the place in terms of code levels (bad, I know, but not uncommon), so it would not be possible to maintain a spare for each code level until we normalize our codebase across the organization.

Also, aside from needing to load up code, our current strategy relies on a network engineer being available to come into the office and physically install a switch. I was hoping that others might have found creative solutions to that requirement.

Your point about wireless is interesting. The only issue I see there is that making each wired host also a wireless host would create a huge additional load to our wireless infrastructure and wireless capacity is already an issue in most places. It would be cool if there were a solution that could maintain a preconfigured and tested wireless NIC that would remain cold but automatically turn itself on and connect if the wired network failed (like if it were unable to ping some IP).

RobinClayton · ‎10-18-2018

Spread "Departments" across different switches, then, any one failure affects multiple users across different departments reducing the overall impact to the organisation.

Have ACTIVE on hand spares.

Reduce the number of users per switch.

Having dual NICS and LACP would work.

But ultimatly, Who has dictated the need for resillient access layer???

locampo · ‎10-18-2018

Nobody has dictated it, per se, but when a switch fails it's a "drop everything else" class emergency. When this happens on the weekends or afterhours, there's no one here to address it and my employer is a hospital, so it's a a 24x7x365 operation where anytime something breaks the words "patient safety" quickly get used.

In our current strategy, one of the two network engineers working M-F, 9-5, supporting all (close to 200) the switches (not to mention 30+ firewalls and ~300 access points across 28 or so physical locations) has to be available to physically replace the failed unit and then console into the stack to run the commands necessary to join the new switch to the stack, then pray that the operation doesn't impact the rest of the stack, then confirm the switch came up as the correct member number with the right config, then manually reconnect cables (being cafeful that each patch cable goes into the same port in the new switch as the old, lest some device requiring special VLAN or port configurations to end up on a standard port).

I was wondering if someone hadn't built a stacking framework that allowed for running a warm spare switch in the stack that would automatically take over the place and config of a failed member in the stack. That would reduce the repair procedure to just the moving cables part... which could easily be explained to and completed by a non-network engineer, say at 3am on a Sunday. N+1 is a concept that is ubiquiteous in almost every other area of infrastructure, except for access layer switches (as far as I've seen).

RobinClayton · ‎10-18-2018

DOT1.X is what your potentialy looking for then.

When devices connect to a switch port they are authenticated and the correct VLAN provisioned from the central configuration database.

In that way you don't need to have "Port Security" or "Vlans" pre-assigned to any ports. so a quick cable swap to any other switch port is all that should be required.

OtakarKlier · ‎10-18-2018

Hello,

Coming from a person that has spent a lot of time in a hospital bed for one reason or another (me), I can appreciate the work you guys do. I while there cannot be a warm switch, there is nothing that says you cannot have a switch in the stack that is for 'emergencies' only. Meaning its there and working and ready for someone to just move patch cables to it.

However I have used many different brands and even when the vendors states that a failure in one switch wont affect the stack, I have see otherwise :(. For that reason at the last place I worked, we didnt stack anything ( 300+ access switches). All the switches were standalone and we seperated phones and computers on different switches. While we were not a 24/7 shop, we were a call center and downtime was tons of money lost and potential contracts lost.

Getting funding for these items shouldnt be an issue. You already mentioned the argument to use, 'patient safety'. Use the same arguement to either get additional switching, in house on-call admins, outsourced company that can come in within a short timeframe to assist.

But you are correct, the access layer is single point of failure. I still suggest a wifi backup and make sure the switches the AP's connect to are not access layer switches.

Just some thoughts.

locampo · ‎10-18-2018

@RobinClayton

That's good advice. However, 802.1x and NAC wouldn't completely remove the need to occasionally have specific port configurations for things like, oh say, random medical devices that for some reason still can't do autonegotiation even with up to date firmware as of 2018 and so need static speed and duplex settings to come up. But it would eliminate 90+% of the port-specific configurations, making those edge cases a potentially more manageable problem.

Thanks for the thoughts.

locampo · ‎10-18-2018

@OtakarKlier

Yes, I have also seen instances were stacking switches introduces an additional logical single point of failure. It's tricky, because stacking switches has many other advantages (like troubleshooting, reducing management overhead, and reducing the number of connections back to the core). But indeed, I've seen bugs where something in the management plane fails and then causes the dataplane to progressively fail, and in a stack scenario that failure is stack-wide.

I'm surprised that we are still managing switches the way we are in this day and age. In the wireless arena, everything is now centrally managed in some kind of controller or virtual controller based system. For a long time now, deploying a new access point has been plug it in, let it get a DHCP address and find the controller, then you provision it in the controller and you're pretty much done. Why this has not been the case for switching is beyond me.

In any case, standalone switches would still need to be physically swapped and the users still down in the mean time. I think the best strategy we're leaning toward is to deploy everystack with an N+1 switch that we just manually TREAT as a spare. We'll down all the ports so it can't be used, and in the event of a single switch physically failing we can SSH into the stack, make that member the member number that failed and let it pick up the config. Then we can tell someone on site to physically move cables 1:1 from the dead switch to the spare and at that point we're back up and able to wait until we can find time to come an properly remediate.

The wireless backup option for the endpoints is a good idea... but again, I think the additional wireless noise would kill the entire network. We'd need some way to keep the wireless NICs off and on standby until needed, but even then if the density is high the WLAN might not support all those wired clients in addition to the normal load of wireless clients.

OtakarKlier · ‎10-18-2018

Hello,

Your switch vendor should have some type of central control that ou can use to maange all the swithes. If not there are 3rd party tools you cna use to backup configs, monitor uptime, and perform upgrades.

The rest, there are many ways to go about this, just depends on what your management will approve. I say offer them 2-3 options and let them make the choice, that way its on their heads. Except the central control of devices, that they should approve.

Regards,

RobinClayton · ‎10-19-2018

Central cloud managed switching has been arround for a number of years. But you need to step away from the Tier 1 vendors and look at things like Ubnt, Draytek, Meraki, etc..

For the situations where you have Quirky equipment. Configure specific ports on every switch for those items. if they don't support MDI-X Auto negotiation, then it's not going to support twin ethernet connections and LACP.

Rob

Unlock your full community experience!

Switch Redundancy at Access Layer

Switch Redundancy at Access Layer

Show your appreciation!