Commit failure 4.1.10

EdwinD · ‎02-20-2013

After my Palo Alto 2050 in HA active/passive is up for about 1 week, I begin to get errors committing policies.

Management server failed to send phase 1 abourt to client logrcvr

Management server failed to send phase 1 abourt to client sslvpn

Management server failed to send phase 1 abourt to client websrvr

commit failed

This gets worse as uptime increases. This problem existed in 4.1.9 and 4.1.8 as well. The problem began when I started using FQDN object, although I do not know if that is related to this issue.

If I commit, eventually it will take. This occurs from both the web GUI as well as the CLI.

I contacted support, and they simply suggested a commit force.

Can anyone shed any additional insight into this problem?

Thank you.

mharding · ‎02-20-2013

The short answer is the management process is running high, and is consuming all of the memory/CPU on the management side. The command debug software restart management-server will solve the problem temporarily. It will take 10-15 minutes before you can log back into the firewall to make the commit.

PAN Support will need your tech-support files to see why it is running so high.

mbutt · ‎02-20-2013

Hi,

There can be few reason why the commit is failing.

As mentioned above it is highly possible that the management plane is running high.

Running of high CPU can be because of a process (which can be any process user id, reporting etc)

You can run the following command on the device to see if you management is running high

show system resources or

show system resources follow

Here is a sample output

admin@> show system resources

top - 11:00:16 up 22 days, 18:12, 1 user, load average: 0.29, 0.08, 0.11

Tasks: 101 total, 2 running, 99 sleeping, 0 stopped, 0 zombie

Cpu(s): 2.2%us, 1.4%sy, 1.5%ni, 94.1%id, 0.6%wa, 0.0%hi, 0.1%si, 0.0%st

Mem: 995872k total, 915016k used, 80856k free, 4836k buffers

Swap: 2212876k total, 863728k used, 1349148k free, 165548k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

20344 30 10 47116 7336 4144 S 46 0.7 0:00.24 pan_logdb_index

20345 30 10 26984 4260 1948 R 15 0.4 0:00.08 sdb

19778 30 10 3896 1284 1112 S 4 0.1 0:02.05 genindex.sh

20338 20 0 4468 1028 800 R 4 0.1 0:00.05 top

1 20 0 1836 564 536 S 0 0.1 0:02.48 init

2 20 0 0 0 0 S 0 0.0 0:00.00 kthreadd

3 RT 0 0 0 0 S 0 0.0 0:08.24 migration/0

4 20 0 0 0 0 S 0 0.0 0:00.17 ksoftirqd/0

5 RT 0 0 0 0 S 0 0.0 0:08.23 migration/1

6 20 0 0 0 0 S 0 0.0 0:00.07 ksoftirqd/1

7 20 0 0 0 0 S 0 0.0 1:46.39 events/0

8 20 0 0 0 0 S 0 0.0 0:40.84 events/1

9 20 0 0 0 0 S 0 0.0 0:00.02 khelper

12 20 0 0 0 0 S 0 0.0 0:00.00 async/mgr

112 20 0 0 0 0 S 0 0.0 0:00.00 sync_supers

114 20 0 0 0 0 S 0 0.0 0:00.00 bdi-default

115 20 0 0 0 0 S 0 0.0 0:11.58 kblockd/0

116 20 0 0 0 0 S 0 0.0 0:04.81 kblockd/1

125 20 0 0 0 0 S 0 0.0 0:00.00 ata/0

126 20 0 0 0 0 S 0 0.0 0:00.00 ata/1

127 20 0 0 0 0 S 0 0.0 0:00.00 ata_aux

132 20 0 0 0 0 S 0 0.0 0:00.00 khubd

135 20 0 0 0 0 S 0 0.0 0:00.00 kseriod

156 20 0 0 0 0 S 0 0.0 0:00.00 rpciod/0

157 20 0 0 0 0 S 0 0.0 0:00.00 rpciod/1

172 20 0 0 0 0 S 0 0.0 36:30.13 kswapd0

173 20 0 0 0 0 S 0 0.0 0:00.00 aio/0

174 20 0 0 0 0 S 0 0.0 0:00.00 aio/1

175 20 0 0 0 0 S 0 0.0 0:00.00 nfsiod

732 20 0 0 0 0 S 0 0.0 0:00.04 octeon-ethernet

760 20 0 0 0 0 S 0 0.0 0:00.00 scsi_eh_0

765 20 0 0 0 0 S 0 0.0 0:00.99 mtdblockd

793 20 0 0 0 0 S 0 0.0 0:00.00 usbhid_resumer

833 20 0 0 0 0 S 0 0.0 0:41.97 kjournald

886 16 -4 1996 404 400 S 0 0.0 0:01.23 udevd

1867 20 0 0 0 0 S 0 0.0 0:02.78 kjournald

1868 20 0 0 0 0 S 0 0.0 0:00.00 kjournald

1997 20 0 0 0 0 S 0 0.0 1:23.86 flush-8:0

2060 20 0 2008 620 572 S 0 0.1 0:10.14 syslogd

2063 20 0 1892 332 328 S 0 0.0 0:00.02 klogd

2072 20 0 1872 332 236 S 0 0.0 0:04.56 irqbalance

2080 rpc 20 0 2084 492 488 S 0 0.0 0:00.00 portmap

2098 20 0 2116 652 648 S 0 0.1 0:00.05 rpc.statd

2167 20 0 6868 584 500 S 0 0.1 0:02.87 sshd

2215 20 0 6804 388 384 S 0 0.0 0:00.00 sshd

2224 20 0 3280 620 616 S 0 0.1 0:00.03 xinetd

2243 20 0 0 0 0 S 0 0.0 0:00.00 lockd

2244 20 0 0 0 0 S 0 0.0 2:02.54 nfsd

2245 20 0 0 0 0 S 0 0.0 2:01.79 nfsd

2246 20 0 0 0 0 S 0 0.0 2:10.59 nfsd

2247 20 0 0 0 0 S 0 0.0 2:05.77 nfsd

2248 20 0 0 0 0 S 0 0.0 2:09.80 nfsd

2249 20 0 0 0 0 S 0 0.0 2:03.58 nfsd

2250 20 0 0 0 0 S 0 0.0 2:01.06 nfsd

2251 20 0 0 0 0 S 0 0.0 2:07.07 nfsd

2254 20 0 2488 672 580 S 0 0.1 0:01.55 rpc.mountd

2312 0 -20 65136 4624 1888 S 0 0.5 42:19.94 masterd_core

2315 20 0 1888 456 452 S 0 0.0 0:00.01 agetty

2322 0 -20 27864 1412 1036 S 0 0.1 7:29.42 masterd_manager

2329 15 -5 36656 2008 1216 S 0 0.2 254:45.16 sysd

2331 0 -20 32224 5084 1068 S 0 0.5 69:50.52 masterd_manager

2337 20 0 91984 2988 1676 S 0 0.3 1:53.24 dagger

2338 30 10 40568 3624 1656 S 0 0.4 59:15.43 python

2339 20 0 84284 3664 1644 S 0 0.4 0:39.58 cryptod

2340 20 0 166m 1760 1196 S 0 0.2 2:15.46 sysdagent

2354 20 0 7212 612 608 S 0 0.1 0:00.07 tscat

2357 20 0 71580 1056 928 S 0 0.1 0:09.70 brdagent

2358 20 0 31912 1084 928 S 0 0.1 0:25.94 ehmon

2359 20 0 47496 1036 908 S 0 0.1 0:01.15 chasd

2451 20 0 0 0 0 S 0 0.0 0:11.75 kjournald

2492 20 0 2900 628 572 S 0 0.1 0:03.45 crond

2503 20 0 646m 64m 63m S 0 6.7 150:48.80 useridd

2525 20 0 223m 71m 8864 S 0 7.3 45:02.84 devsrvr

2534 20 0 90584 1980 1520 S 0 0.2 0:16.09 ikemgr

2535 20 0 267m 4532 1832 S 0 0.5 2:28.18 logrcvr

2536 20 0 99744 2272 1520 S 0 0.2 0:05.44 rasmgr

2537 20 0 97720 1144 968 S 0 0.1 0:00.84 keymgr

2538 20 0 247m 2172 1532 S 0 0.2 102:25.79 varrcvr

2539 17 -3 56464 1716 1300 S 0 0.2 0:24.44 ha_agent

2540 20 0 112m 7096 1524 S 0 0.7 0:34.04 satd

2541 20 0 102m 1972 1300 S 0 0.2 0:04.48 sslmgr

2542 20 0 57136 1820 1392 S 0 0.2 0:02.48 dhcpd

2543 20 0 74708 2404 1440 S 0 0.2 0:03.97 dnsproxyd

2544 20 0 74392 1736 1356 S 0 0.2 0:04.19 pppoed

2546 20 0 141m 2708 1832 S 0 0.3 0:14.46 routed

2547 20 0 138m 4704 3540 S 0 0.5 2:16.44 authd

3796 20 0 27260 2100 1316 S 0 0.2 0:23.37 snmpd

5184 nobody 20 0 155m 6052 1552 S 0 0.6 1:49.99 appweb3

5190 nobody 20 0 122m 2208 1672 S 0 0.2 1:07.90 appweb3

16879 20 0 3744 3624 2756 S 0 0.4 0:00.02 ntpd

19653 20 0 21340 2448 2016 S 0 0.2 0:00.16 sshd

19664 admin 20 0 21476 1504 1044 S 0 0.2 0:00.03 sshd

19665 admin 20 0 97744 22m 10m S 0 2.3 0:02.97 cli

19695 20 0 2964 496 412 S 0 0.0 0:00.00 crond

19698 20 0 3720 1116 988 S 0 0.1 0:00.02 genindex_batch.

19702 20 0 33704 5176 3020 S 0 0.5 0:00.28 masterd_batch

20335 admin 20 0 2976 668 564 S 0 0.1 0:00.03 less

20337 20 0 3832 1192 1056 S 0 0.1 0:00.08 sh

20339 20 0 1940 536 464 S 0 0.1 0:00.00 sed

22550 20 0 616m 272m 3492 S 0 28.0 57:29.90 mgmtsrvr

26791 nobody 20 0 201m 24m 4296 S 0 2.5 6:35.70 appweb3

Check and see if these process are running high

mgmtsrvr ( if it at 900m and above then it is not good)

devsrvr

If the above process are high then you might have to restart.

Also check if %wa is high.

If it is high it will indicate that you have two much logging and you can try to reduce the logging.

Hopefully this helps.

Thank you

SCoupland · ‎02-20-2013

Just to clarify on the previous comment, it's the 5th (although it will look like 4th) column that you want to check if it's running over 900m. Its easier to read if you perform a "show system resources | match srvr".

I work by the rule of thumb that if either the management server or device server is over 850m consider restarting them when you have a chance, if either are over 950m then restart them as soon as you have a maintenance window and if either are over 1000m then restart them ASAP.

Be aware that a restart of the management server should not impact traffic throughput although you will lose the ability to manage the PAN device while the management server restarts. A restart of the device server will not impact existing sessions however any new sessions will not match any policy with users or groups in them due to there will be no user-ip cache or user-group cache until the device server has restarted. It's always a good idea to do this out of hours or in a maintenance window to reduce the impact of an unforeseeable event.

mikand · ‎02-20-2013

Also stuff that needs ssl-termination (on some models) will be affected during a reboot of the mgmtplane.

EdwinD · ‎02-20-2013

Thank you.

dogbert@PA-2050-Trailer-Rebuild(active)> show system resources | match srvr

2360 20 0 651m 106m 3192 S 0 10.9 535:48.96 mgmtsrvr

2381 20 0 422m 106m 9912 S 0 11.0 610:47.25 devsrvr

It sounds like this means PA2050 is underpowered for my needs. Since I have an HA pair, perhaps the better thing to do would be to completely reboot the main unit so that the passive one takes over. That way I wouldn't loose the ssl-terminations or user to IP mappings. My shop is heavy on inbound and outbound SSL termination, and just about every outbound allow rule is based on active directory user mapping.

Thank you.

SCoupland · ‎02-20-2013

You should only really need to restart the management and device servers if there respective memory values are high or you've been directed to by support. Those values look OK but keep an eye on them over time to see if it increases over time. Another thing you can check for is the presence of backtraces and core files by performing a "show system files" and if there is anything with a recent time stamp for any time you have issues you may want to have PAN support investigate. It could be an as yet undiscovered bug.

If you feel that your 2050s aren't cutting it you may wish to approach your account managers SE about upgrading to 3020s as while they are more expensive I believe they were doing some deals for people wanting to upgrade.

mharding · ‎02-20-2013

+1 on the PA-3000 series if you are doing a lot of QoS and SSL Decryption.

If memory recalls, the PA-2000 series does QoS and SSL decryption in software whereas the PA-3000 and PA-5000 handle this in hardware.

SCoupland · ‎02-20-2013

The PA-2000 series implements SSL decryption in hardware and QoS and decompression are implemented in software. If you watch the below video you will see the layout of components for both the 3020 and 3050 about 3minutes in. The PA-3020 does not have a network processor so all of the routing, QoS and NAT is all done in software. The PA-3050 does have the network processor and implements it all in hardware.

Video Link : 1238

I would reiterate that I would monitor my PA-2000 to see whether there is a problem with the memory usage and then raise a support case if it continues to increase as this could potentially be a bug. I only suggest discussing with your SE replacing the PA-2000 with a PA-3000 if you feel that it is not meeting your requirements after establishing whether this is normal behaviour of the PA-2000.

Unlock your full community experience!

Commit failure 4.1.10

Commit failure 4.1.10

Show your appreciation!