OSPF adjacency flapping - normal?

fjwcash · ‎03-14-2018

While trying to track down the cause for 3 recent Internet outages we've experienced at one of our schools (which we still haven't determined the cause to yet), we've noticed that our OSPF adjacencies are flapping up and down across the district. Multiple times per day, across multiple sites, going back to the beginning of last month (that's as far back as the logs go on the district core firewall).

Is this normal or something we should be concerned with? Could this be the reason we get hiccups in our connections to the schools (where you can be typing in an SSH session and suddenly all the characters stop appearing for 10 seconds then appear slowly then appear normally again) when network usage for the school is fairly low? Could this be the reason for 5-10 minute outages like we've experienced the past two days (nothing showing in the logs on the fibre switches, no links up/down, no STP outages, etc)? Could this get to the point where our entire WAN goes down?

I'm very new to OSPF and routing protocols in general, coming from a static routing background dealing only with the connections on the "inside" of the telco router at a remote site (each site with their own connection to the Internet). We've since migrated to a proper WAN setup using OSPF internally, with a single connection to the public Internet for the whole district.

Our WAN consists of 3 separate networks that all terminate at the district office: an MPLS link with the local telco for the out-of-town schools, a point-to-point fibre network in town, and a point-to-point wireless network for schools we can't reach with fibre yet. For the MPLS links, the OSPF is established between an L3 switch in the district office (upstream from the district firewall) and the PA firewall in the school. For the fibre and wireless networks, the OSPF is established between the PA firewall in the district office and the PA firewall in the school (we use a layer 2 vlans across the fibre/wireless network terminating on the PA firewall). Other than the Router ID, and neighbour config, the OSPF setup on all the firewalls is virtually identical (everything is in Area 0).

We haven't had any issues (that we know of) with the above setup, although we do understand that it's sub-optimal (we're looking at what it would take to have all of the OSPF links terminate on the L3 switch instead, such that the district firewall stops being a router too).

So, should I be worried about the OSPF adjacencies flapping? Should I spend time on figuring those out? Or are they a red herring to some other issue?

Most of the OSPF "outages" are under 10 seconds. The only ones that are longer (3-5 minutes) are for the school that lost connectivity completely 3 times in the last two days (but, not sure if that's the cause or just a symptom).

OtakarKlier · ‎03-14-2018

Hello,

Are you referring to the following messages as flaps?

OSPF adjacency with neighbor has gone down. interface tunnel.XX, neighbor router ID 1.2.3.4, neighbor IP address 1.2.3.5

OSPF doesnt flap on its own. In my experience on single links such as yours, the reason is an actual link issue and not OSPF if your settings are default. To mititgate this we have two ways to get back to our data centers, P2P wan links, primary, and VPN tunnels over the internet to the datacenters. So each or our sites has 2 conections to its 'closest' data center and 1 to the remote one for additional redundancy. I have a 100M fiber link that is just for internet access at one of my sites, but it also does this, for a few seconds randomly it blips and used to down the site.

Regards,

fjwcash · ‎03-14-2018

@OtakarKlierwrote:
Hello,
Are you referring to the following messages as flaps?

OSPF adjacency with neighbor has gone down. interface tunnel.XX, neighbor router ID 1.2.3.4, neighbor IP address 1.2.3.5

Yes, these are the messages I'm seeing, multiple times per day, going back at least two months. 3-10 seconds later will be a message that OSPF adjacency has been established again.

Just wondering if these are normal, or something to be concerned with?

OtakarKlier · ‎03-14-2018

Hello,

In short yes, be concerened, but it might not be your config. What is the latency across the circuits? Are they ever full, i.e. 100% capacity? One thing you might try if the links are full or near capacity is to setup QoS and give priority to routing protocols, in this case OSPF, over all other traffic, even voice. If the circuits have low latency and are not near capacity, then I would call the provider and have them test circuits to verify.

If you dont have monitoring software, you could download a free version and monitor the links with pings and watch drops/latency/jitter etc.

Hope that helps.

fjwcash · ‎03-14-2018

We're the provider of the links. 🙂

We use Ubiquiti point-to-point and point-to-multi-point wireless connections between schools, connecting back to a site with a fibre link, that connects back to the central office. These are very low-latency links (generally under 5 ms, the longest link is 15 ms) and nowhere near saturated. Most schools have dedicated 100 Mbps links with 30-50 Mbps usage; a handful of schools share a 700 Mbps link that shows under 200 Mbps usage.

It could be that we're dropping packets on these wireless links, including the OSPF hello packets, which is causing the OSPF link to drop and re-establish. If I'm reading the OSPF docs right, though, that would require a 40 second network outage (or really bad luck to drop the 4 hello packets without affecting other traffic).

Would playing with the Hello Interval, Dead Counts, or similar timings make a difference in such an environment?

Everything is monitored via Nagios and LibreNMS, so we have graphs and alerts out the wazoo, but they aren't real-time (polling runs every 5 minutes and takes just about the full 5 minutes to query everything). Nothing has been flagged as "bad", although we do get the odd jump to 200 ms latency to some sites, and the occasional jump to 10 % packet loss. But that's every few days, not multiple times per day. And rarely for more than 1 polling cycle.

Or, is this barking up the wrong tree? We need to look deeper into the Ubiquiti links, which will "fix" the OSPF issues running on top?

fjwcash · ‎03-14-2018

As we are using layer 2 vlans between the district office and the remote school (vlan across the wireless links, too), would changing the OSPF Link Type from "broadcast" to "p2p" make a difference here?

All of the schools use the same vlan and connect back to the same switch (that the district firewall is plugged into, with the vlans terminating on the firewall). In essense, these are point-to-point links (the district firewall is the neighbour for each of the school firewalls). Would using the default "broadcast" type with all the multicast packets going out to all the sites be an issue?

OtakarKlier · ‎03-14-2018

Hello,

Sorry misread the provider part of it but yes, wireless can be 'exciting' :). If you are to change from the default values, just be careful and make the changes on the far side of the link first rather than the near/hub side as this can cause OSPF issues and potentially dropping routes.

•	Hello Interval (sec)—Interval, in seconds, at which the OSPF process sends hello packets to its directly connected neighbors (range is 0-3600; default is 10).

•	Dead Counts—Number of times the hello interval can occur for a neighbor without OSPF receiving a hello packet from the neighbor, before OSPF considers that neighbor down. The Hello Interval multiplied by the Dead Counts equals the value of the dead timer (range is 3-20; default is 4).

Try increasing the Hello Interval to 20 from the default 10 and see if the issues still apear. This will take the Dead Count from 40 seconds to 80 seconds.

This article uses Cisco but the concept is the same for any OSPF configuration:

https://networklessons.com/ospf/ospf-hello-and-dead-interval/

Also check and see if you can setup QoS on both the PAN's and Ubiquity to see if you can prioritize the OSPF packets.

Regards,

OtakarKlier · ‎03-14-2018

So PAN would say that setting the Link type to P2P would be best since they are the only devices talking, however I have seen it work in broadcast without issue.

When you can all the schools use the same vlan are you saying its one big flat vlan or each school has its own vlan ID and subnet ( I hope the latter for security)? Unless the Ubiquity is blocking some of the multicast packets, it shouldnt be an issue. But if you do make the change, makes ure its the far side first :).

fjwcash · ‎03-14-2018

We originally had each school on their own vlan and just did static routing from the district office through the fibre/wireless network. Each school had their own public subnet and FreeBSD firewall.

Then we got migrated over to an MPLS network for all the schools terminating at the district office, with two /24 subnets for the entire district (each school gets 5-8 IPs distributed via OSPF). The contrators that implemented that, including the initial installation and configuration of the Palo Alto firewalls, just mirrored the MPLS setup onto our vlan setup (1 MPLS tag for Internet traffic, 1 MPLS tag for in-district tech traffic, 1 MPLS tag for in-district management traffic became 1 vlan for Internet, 1 vlan for tech traffic, 1 vlan for management traffic).

So now we have 1 interface on the district firewall with 3 vlans terminated there. That connects to a switch that distributes fibre links to the secondary schools, that connect to the Ubiquiti wireless dishes to connect the elementary schools. And the 3 vlans are pushed through each of those links.

So, yes, 1 "flat" vlan setup to all the schools on our private network. With the firewalls in the schools handling all the traffic, security policies, NAT, etc. With OSPF between the district firewall and the school firewalls.

fjwcash · ‎03-22-2018

Okay, after some further digging and testing, it appears the OSPF setup we have may be sub-optimal, and cannot be (easily) switched to p2p/p2mp.

On the MPLS side of things, the school firewall has a separate /30 subnet for each of the vlans, and the telco router next to it is the neighbour (other end of the /30 subnet). Then the telco does their magic in the MPLS "cloud". The telco router in our data centre then has a /30 subnet on it for each vlan, with the district firewall being the OSPF neighbour, using the same /30 subnets for each vlan. IOW, there's a 1 single IP/subnet on each firewall and router. And they're basically directly connected via Ethernet patch cables. So the school firewall sends a multicast/broadcast out one physical interface, to a directly-attached telco router. And the district firewall sends a multicast/broadcast out one physical interface, to a directly-attached telco router. And the telco routers do their magic behind the scenes to connect everything across the MPLS network.

On the fibre/wireless side of things, the contractor just emulated the same setup, making the district firewall take the place of the telco router, putting each of the /30 subnets from each school onto a single interface on the district firewall for the OSPF. So the school firewalls are configured the same as on the MPLS network (1 /30 subnet for each vlan). But the district firewall has all of the /30 subnets for all the schools on the same physical interface.

So, the OSPF setup on the fibre/wireless network (which is basically a single flat layer 2 network with the same 3 vlans connecting the district firewall and each of the school firewalls) is using multicast/broadcast across 84 separate subnets (28 schools x 3 vlans), to reach a neighbour that is logically 1 hop away. This seems ... very sub-optimal.

Unfortunately, moving to a (hopefully) better setup would require modifying the OSPF setup on all the school firewalls, switching to using a single /24 private subnet to connect the firewalls on the fibre/wireless network, and using only a single IP on the district firewall for the OSPF interfaces. Which means downtime for 28 schools simultaneously. 😞

Is this a correct assumption? Is this something that would be worthwhile pursuing, or is the current setup "okay"?

OtakarKlier · ‎03-22-2018

Hello,

Do all your PAN's see each other as OSPF neighbors or one from each school location to the distric fw? If they are all enighbors, then I would really worry, if they are not and the distric fw is a neighbor to the rest of them, there could be minor tweaks but I dont think its the cause of your original issue.

BTW did increasing the hello timer help out?

Regards,

fjwcash · ‎03-22-2018

Each firewall sees only a single neighbour.

For the telco network, the school firewalls only see the telco router as a neighbour, and the district firewall sees only the the telco router (via the Internet interface).

For the fibre/wireless network, the school firewalls only see the district firewall as a neighbour, using their own /30 subnet. The district firewall sees all of the school firewalls as neighbours. So it's a star configuration, using a separate fibre interface on the district firewall.

The school firewalls do not see the other school firewalls as neighbours.

We haven't made any changes to the OSPF setup as yet. We're still investigating the setup, figuring out how it all works.

OtakarKlier · ‎03-22-2018

Hello,

I think I might have gotten confused along the way somewhere. Would this a simplified representation of the nework between the distric and the schools?

If yes, then I think the setup is OK.

fjwcash · ‎03-22-2018

Not really. There's only a single link into any school. Below is a crude ASCII drawing for it.

Internet <--------> Telco router <--------> District firewall <--------> fibre/wireless <---------> School firewall

| \ \--------> School firewall

| \---------> School firewall

V

Telco router <-----> MPLS <------> Telco router <----> School firewall

| ^

School firewall |

V

Telco router <-------> School firewall

School firewall neighbours with telco router. \

Telco router neighbours with telco router. |---> all of these are done with 1:1 connections

Telco router neighbours with district firewall. /

School firewall A neighbours with district firewall \

School firewall B neighbours with district firewall |---> schools use 1 IP, district firewall has separate IPs for each school

School firewall C neighbours with district firewall / (but the district firewall uses only a single physical interface for this)

Hope that makes things a little clearer. 🙂

jandreini · ‎03-22-2018

So each school firewall has OSPF adjacency with both a telco router and the district firewall (via wireless)? or is the MPLS network layer2 and transparent to your firewalls? If two adjacencies, which ones are flapping?

I apologize if I missed it, but what version of PanOS and how is BFD configured?

Unlock your full community experience!

OSPF adjacency flapping - normal?

OSPF adjacency flapping - normal?

Show your appreciation!