This is a bit off topic, but I thought some folks might have some knowledge and wisdom to offer.
Where I work we're working dilligently to provide robust resiliency and redundancy for our firewalls using dual powersupplies, HA, and multiple ISP circuits with policy-based routing for failover. Our core switching (also our core router) is also fully redundant, with an IRF stack of two H3C 7506 chassis in physically disparate locations in our main campus connected via divergent fiber runs and all the distribution closets connected back to the core via an aggregation group consisting of a link to each half of the core. The weakest link in this scenario is down at layer 2: our leaf switches.
No matter what we think of, there's no getting around the fact that any single access switch failing is going to cause downtime for some group of users. Short of getting dual NICs in each machine and trying to build BAGs/LAGs for each workstation, what is the best way to mitigate the impact of this?
Obviously, invest in high quality switches, with fully redundant and hot swappable PSUs and fans. I've thought that trying to always maintain a spare switch in the stack that is unused would be a potential strategy for N+1, so that if a single switch dies we can simply add the other switch to replace the failed member remotely and then have someone physically move cables (now becoming a task fit for a help desk technician rather than requiring a network engineer to be available to go manually replace the stack member). But I was wondering if anyone know of any technology that might automate this or provide this warm spare feature explicitly.
Or any other thoughts or ideas are welcome as well.
The best idea that I can think of would be to configure spanning tree if you have that capability. Past that, you're really limited by the fact that the switches are Layer2.
In the past we had 'spare' switches that had the same code level and a base config on them. If a prod switch bit the dust, we could then take our config backup and dump it onto the spare, replace the dead one, and plug everything back in.
The other thing we did in the past was use our wifi as a backup, if the machines are capable.
Just some thoughts.
Spanning Tree will protect against loops and therefore allows multiple paths in a switch fabric, and therefore redundancy between the access layer and the core. It does not, however, allow for redundancy at the access layer itself... you know, what endpoints actually physically connect to. Nothing I can think of really does or could, except dual connections between each endpoint and multiple switches, but that would require somekind of link aggregation and thus double the number of drops/patches/switches, and would therefore be cost-prohibitive to all but those with the heftiest of coffers.
But, since this is the weakest link in the chain of the network's armor and the one that actually serves the users themselves, I was wondering if maybe some solutions out there that worked around this problem.
Yes, this is essentially what we do now. We have a stock of spare switches for each make and model and when one fails we load up the matching firmware and then rejoin it to the stack (on Comware this does not require restoring a config, you simply add it to the stack in place of the missing member). Our environment at the moment is all over the place in terms of code levels (bad, I know, but not uncommon), so it would not be possible to maintain a spare for each code level until we normalize our codebase across the organization.
Also, aside from needing to load up code, our current strategy relies on a network engineer being available to come into the office and physically install a switch. I was hoping that others might have found creative solutions to that requirement.
Your point about wireless is interesting. The only issue I see there is that making each wired host also a wireless host would create a huge additional load to our wireless infrastructure and wireless capacity is already an issue in most places. It would be cool if there were a solution that could maintain a preconfigured and tested wireless NIC that would remain cold but automatically turn itself on and connect if the wired network failed (like if it were unable to ping some IP).
Spread "Departments" across different switches, then, any one failure affects multiple users across different departments reducing the overall impact to the organisation.
Have ACTIVE on hand spares.
Reduce the number of users per switch.
Having dual NICS and LACP would work.
But ultimatly, Who has dictated the need for resillient access layer???
Nobody has dictated it, per se, but when a switch fails it's a "drop everything else" class emergency. When this happens on the weekends or afterhours, there's no one here to address it and my employer is a hospital, so it's a a 24x7x365 operation where anytime something breaks the words "patient safety" quickly get used.
In our current strategy, one of the two network engineers working M-F, 9-5, supporting all (close to 200) the switches (not to mention 30+ firewalls and ~300 access points across 28 or so physical locations) has to be available to physically replace the failed unit and then console into the stack to run the commands necessary to join the new switch to the stack, then pray that the operation doesn't impact the rest of the stack, then confirm the switch came up as the correct member number with the right config, then manually reconnect cables (being cafeful that each patch cable goes into the same port in the new switch as the old, lest some device requiring special VLAN or port configurations to end up on a standard port).
I was wondering if someone hadn't built a stacking framework that allowed for running a warm spare switch in the stack that would automatically take over the place and config of a failed member in the stack. That would reduce the repair procedure to just the moving cables part... which could easily be explained to and completed by a non-network engineer, say at 3am on a Sunday. N+1 is a concept that is ubiquiteous in almost every other area of infrastructure, except for access layer switches (as far as I've seen).
DOT1.X is what your potentialy looking for then.
When devices connect to a switch port they are authenticated and the correct VLAN provisioned from the central configuration database.
In that way you don't need to have "Port Security" or "Vlans" pre-assigned to any ports. so a quick cable swap to any other switch port is all that should be required.
Coming from a person that has spent a lot of time in a hospital bed for one reason or another (me), I can appreciate the work you guys do. I while there cannot be a warm switch, there is nothing that says you cannot have a switch in the stack that is for 'emergencies' only. Meaning its there and working and ready for someone to just move patch cables to it.
However I have used many different brands and even when the vendors states that a failure in one switch wont affect the stack, I have see otherwise :(. For that reason at the last place I worked, we didnt stack anything ( 300+ access switches). All the switches were standalone and we seperated phones and computers on different switches. While we were not a 24/7 shop, we were a call center and downtime was tons of money lost and potential contracts lost.
Getting funding for these items shouldnt be an issue. You already mentioned the argument to use, 'patient safety'. Use the same arguement to either get additional switching, in house on-call admins, outsourced company that can come in within a short timeframe to assist.
But you are correct, the access layer is single point of failure. I still suggest a wifi backup and make sure the switches the AP's connect to are not access layer switches.
Just some thoughts.
That's good advice. However, 802.1x and NAC wouldn't completely remove the need to occasionally have specific port configurations for things like, oh say, random medical devices that for some reason still can't do autonegotiation even with up to date firmware as of 2018 and so need static speed and duplex settings to come up. But it would eliminate 90+% of the port-specific configurations, making those edge cases a potentially more manageable problem.
Thanks for the thoughts.
Click Accept as Solution to acknowledge that the answer to your question has been provided.
The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!
These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the Live Community as a whole!
The Live Community thanks you for your participation!