Frequent Running Of AddrObjRefresh Blocking Commits?

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Announcements

Frequent Running Of AddrObjRefresh Blocking Commits?

L2 Linker

Hi!

I'm new here, so will probably be asking lots of dumb questions ... but hopefully relatively interesting dumb questions.

Currently running PANOS 6.0.7 on a pair of PA-5060s (active-passive). Not currently live (switchover day is 17th January), and we're considering upgrading to 6.1.1 next week.

Recently (last few days) ran into an issue whereby configuration commits are being blocked ("Another commit/validate is in progress. Please try again later"). I can work around this by doing a graceful reboot of our pair of PA-5060s but this seems a little extreme and I thought I would dig into it a bit deeper (and raise a support call).

Only significant change recently (did I hear "Yeah! Right!" from the hecklers at the back?), was the addition of one of our Active Directory servers to Device-> User Identification -> Group Map Settings.

What I've been able to figure out on my own was that there seems to be a job (show jobs all) called "AddrObjRefresh" that seems to be kicking off every 10 seconds or so :-

Enqueued                     ID             Type    Status Result Completed

--------------------------------------------------------------------------

2015/01/10 14:54:20        8290   AddrObjRefresh       ACT   PEND        0%

2015/01/10 14:54:09        8289   AddrObjRefresh       FIN     OK 14:54:19 

2015/01/10 14:53:59        8288   AddrObjRefresh       FIN     OK 14:54:08 

2015/01/10 14:53:48        8287   AddrObjRefresh       FIN     OK 14:53:58 

2015/01/10 14:53:38        8286   AddrObjRefresh       FIN     OK 14:53:47 

2015/01/10 14:53:27        8285   AddrObjRefresh       FIN     OK 14:53:37 

2015/01/10 14:53:17        8284   AddrObjRefresh       FIN     OK 14:53:26 

2015/01/10 14:53:06        8283   AddrObjRefresh       FIN     OK 14:53:16 

2015/01/10 14:52:56        8282   AddrObjRefresh       FIN     OK 14:53:05 

2015/01/10 14:52:45        8281   AddrObjRefresh       FIN     OK 14:52:55 

2015/01/10 14:52:35        8280   AddrObjRefresh       FIN     OK 14:52:44 

2015/01/10 14:52:24        8279   AddrObjRefresh       FIN     OK 14:52:34 

2015/01/10 14:52:13        8278   AddrObjRefresh       FIN     OK 14:52:23 

2015/01/10 14:52:03        8277   AddrObjRefresh       FIN     OK 14:52:12 

2015/01/10 14:51:52        8276   AddrObjRefresh       FIN     OK 14:52:02 

2015/01/10 14:51:42        8275   AddrObjRefresh       FIN     OK 14:51:51 

Now for some questions :-

  1. Would you think that this might be what is interfering with our commit issue?
  2. Is running AddrObjRefresh so frequently a normal thing? If not, is there anything I can do to diagnose what it is trying to do?
  3. Is it possible/sensible to try and temporarily stop the AddrObjRefresh job from being scheduled?

Please be gentle: I'm a clueless newbie Smiley Happy

6 REPLIES 6

L4 Transporter

MikeMeredith,

There are no 'dumb' questions. We all start somewhere, so no worries!

You're on the right path with checking the jobs. Good work.

Could you run the following command in the CLI, via SSH session, and log the session output to a file?

'tail follow yes mp-log ms.log'

This will show you the output from the ms.log management log file. This may show us some more details about what is going on when these AddrObjRefresh jobs kick off.


I'm referencing a very similar case here, with a pair of 5000 series having the exact same issue. (225414)

They also had setup User-ID recently and experienced the same behavior, so this could be the same issue, or very close.


How many profiles were setup for User-ID (ldap, etc) ?


In their scenario, the firewall was having trouble communicating to the LDAP server, due to other network issues, therefore user-id was taking up more cpu/ram than expected and constantly refreshing as it attempted to pull the group information.


I will bet you $100 that if you were to disable the recent changes with User-ID and the server profiles, the issue would stop occurring. If you get the chance, please try this out and verify while it is not in production. This will help narrow down the issue.


Let me know and we can go from there,


Thanks!








Thanks for the reply.

I'm also pretty sure that disabling the "AD" profile will sort things out, but that's a test for Monday unless there's a CLI command along the lines of "commit when you next have a chance to do so" 🙂

There are two LDAP server profiles, but only one is enabled (the Active Directory one). I'd be mildly surprised if there are network issues between either PA-5060 and the relevant AD server as although they're on different VLANs, there not a great deal of network between them. But I'd not be very surprised to find our Active Directory is in a bit of a state - a certain well known Python script that exports the binary database (1.6Gbytes!) to a text version took over a week to run just before Xmas.

I'll raise a support call with our reseller and mention 225414.

A short extract from the output of tail follow yes mp-log ms.log:-

2015-01-11 07:38:41.515 +0000 device server refresh triggered via sysd
2015-01-11 07:38:41.515 +0000 Aborting. Another refresh in progress2015-01-11 07:38:42.162 +0000 client useridd disabled/restarted
2015-01-11 07:38:44.171 +0000 client useridd enabled 
2015-01-11 07:38:44.171 +0000 device server refresh triggered via sysd
2015-01-11 07:38:44.172 +0000 Aborting. Another refresh in progress2015-01-11 07:38:44.823 +0000 client useridd disabled/restarted
2015-01-11 07:38:45.579 +0000 client device reported Phase 1 was SUCCESSFUL
2015-01-11 07:38:46.842 +0000 client useridd enabled 
2015-01-11 07:38:46.843 +0000 device server refresh triggered via sysd
2015-01-11 07:38:46.843 +0000 dnscfgmod: Main refresh function: (unknown)
2015-01-11 07:38:46.849 +0000 dnscfgmod:Fqdn refresh job 14006 scheduled
2015-01-11 07:38:47.503 +0000 client useridd disabled/restarted
2015-01-11 07:38:49.510 +0000 client useridd enabled 
2015-01-11 07:38:49.511 +0000 device server refresh triggered via sysd
2015-01-11 07:38:49.511 +0000 Aborting. Another refresh in progress2015-01-11 07:38:50.187 +0000 client useridd disabled/restarted
2015-01-11 07:38:52.189 +0000 client useridd enabled 
2015-01-11 07:38:52.190 +0000 device server refresh triggered via sysd
2015-01-11 07:38:52.190 +0000 Aborting. Another refresh in progress2015-01-11 07:38:52.844 +0000 client useridd disabled/restarted

MikeMeredith,

You're welcome.

Your log output matches the same errors referenced from the ms.log file in the case I mentioned, so it looks like this may be the same issue after all.

I'm trying to relate their solution to our situation.

While raising a support call, please reference the output from ms.log, along with generating a TechSupport file for them to look at.

Specifically we are looking to see if the RAM/CPU usage of the process 'useridd' is abnormally high.

Could you do a 'show system info' and past the output? What is the uptime?

I know this sounds like a cliche` response, but have you rebooted the device since this started happening? Perhaps something is just hanging up in the management-plane, but that would be too easy. Smiley Happy

You mentioned this is not in production yet, but have you been authenticating many users across the firewall?

Are you pulling all the groups from the AD or just some of them? How many groups are we talking about the firewall pulling/mapping ?

In the previous scenario, the useridd process was being 'oversubscribed'.

This can be potentially mitigated by limiting what groups the firewall will try to map and configuring the access list for the zone.

We are headed the right direction here.

Let me know,

Thanks

EDIT: I'm unaware of a commit type where we can do that, but it would be highly convenient, right? :smileysilly:

Thanks again.

I'll get onto all the necessary details tomorrow when I log a call with support. But some quick responses :-

  1. Yes the useridd process is hogging cpu at times (show system resources follow ("top" is quicker to type) shows useridd at 99% often).
  2. Can't list the groups through the firewall presently (I get an error), but there's a fair few - I recall seeing a figure of 3,000 but don't quote me on that.
  3. Yes the firewalls have been rebooted after the problem arose; they operated fine for a day and then we started being unable to commit changes (as you can imagine when migrating a large ruleset from an old firewall there's a few changes to be made!).
  4. No users authenticating (or being identified) as yet. There's no real traffic passing through, although we've done some "bench tests" - before adding in the AD details!
  5. And lastly for now :-

msm@Hula(active)> show system info

hostname: Hula

ip-address: 10.14.4.64

netmask: 255.255.254.0

default-gateway: 10.14.5.254

ipv6-address: unknown

ipv6-link-local-address: fe80::290:bff:fe37:dd0c/64

ipv6-default-gateway:

mac-address: 00:90:0b:37:dd:0c

time: Sun Jan 11 20:19:24 2015

uptime: 2 days, 13:34:16

family: 5000

model: PA-5060

serial: 001901000769

sw-version: 6.0.7

global-protect-client-package-version: 0.0.0

app-version: 480-2519

app-release-date: 2015/01/06  14:56:48

av-version: 1459-1932

av-release-date: 2015/01/08  04:00:01

threat-version: 480-2519

threat-release-date: 2015/01/06  14:56:48

wildfire-version: 0

wildfire-release-date: unknown

url-filtering-version: 0000.00.00.000

global-protect-datafile-version: 0

global-protect-datafile-release-date: unknown

logdb-version: 6.0.6

platform-family: 5000

logger_mode: False

vpn-disable-mode: off

operational-mode: normal

multi-vsys: off

MikeMeredith,

Do you need the firewall to pull all groups from the AD, or could we get granular  and only have it pull a set of specific ones? That should keep the useridd process cpu/mem usage down.

That has to be the issue then... useridd is running above its means, so we need to optimize it I suppose.

Let me know how it goes!

Don't forget to mark any answers here as 'correct' or 'helpful'. Smiley Happy

Thanks!

L3 Networker

Do you use FQDN Objects or dynamic Objects?

What do you see if you use the following CLI command "request system fqdn show"?

You could try to stop the last commit and then commit your configuration.

>show jobs all

-remember the job id for the AddrObjRefresh commit

>delete job id <id you want to delete>

>commit

  • 5411 Views
  • 6 replies
  • 0 Likes
Like what you see?

Show your appreciation!

Click Like if a post is helpful to you or if you just want to show your support.

Click Accept as Solution to acknowledge that the answer to your question has been provided.

The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!

These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the LIVEcommunity as a whole!

The LIVEcommunity thanks you for your participation!