SMC SSD Failure Detection : Press F1 to Resume

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Announcements
Please sign in to see details of an important advisory in our Customer Advisories area.

SMC SSD Failure Detection : Press F1 to Resume

L1 Bithead

I recently stumbled upon this error and found very little documentation on the subject that I only found after the fact. On the SMC of the 7000 series firewalls, there is a SSD. This SSD will fail over time. I found out when rebooting after an upgrade in a remote DC that took hours to get someone on site with a console cable. The chassis was sitting at the following prompt:

 

Press F1 to Resume...

 

Pressing F1 immediately allowed the chassis to boot. The entire console error reads as follows, but if you are not connected to the console at boot time, you cannot see this message.

 

S.M.A.R.T Status Bad, Backup and Replace.

Press F1 to Resume...

 

This indicates the SSD on the SMC has failed, or is about to fail. 

Running the following command on the device will show greater detail.

 

> debug system disk-smart-info disk-1

smartctl 6.4 2015-06-04 r4109 [x86_64-linux-3.10.88-8.1.6.0.44] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Intel 530 Series SSDs
Device Model: INTEL SSDMCEAW080A4
Serial Number: ################
LU WWN Device Id: 5 5cd2e4 000347e58
Firmware Version: DC03
User Capacity: 80,026,361,856 bytes [80.0 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Mon Jul 27 15:32:09 2020 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
See vendor-specific Attribute list for failed Attributes.

General SMART Values:
Offline data collection status: (0x05) Offline data collection activity
was aborted by an interrupting command from host.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 33) The self-test routine was interrupted
by the host with a hard or soft reset.
Total time to complete Offline
data collection: ( 1953) seconds.
Offline data collection
capabilities: (0x7f) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Abort Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 48) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x0025) SCT Status supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 0
9 Power_On_Hours_and_Msec 0x0032 100 100 000 Old_age Always - 11183h+44m+27.390s
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 10
170 Available_Reservd_Space 0x0033 001 001 010 Pre-fail Always FAILING_NOW 0
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 2
183 SATA_Downshift_Count 0x0032 100 100 000 Old_age Always - 11
184 End-to-End_Error 0x0033 100 100 090 Pre-fail Always - 0
187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0032 038 061 000 Old_age Always - 38 (Min/Max -21/61)
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 2
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 1079041
226 Workld_Media_Wear_Indic 0x0032 100 100 000 Old_age Always - 65535
227 Workld_Host_Reads_Perc 0x0032 100 100 000 Old_age Always - 0
228 Workload_Minutes 0x0032 100 100 000 Old_age Always - 65535
232 Available_Reservd_Space 0x0033 001 001 010 Pre-fail Always FAILING_NOW 0
233 Media_Wearout_Indicator 0x0032 040 040 000 Old_age Always - 0
241 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 1079041
242 Host_Reads_32MiB 0x0032 100 100 000 Old_age Always - 4721
249 NAND_Writes_1GiB 0x0032 100 100 000 Old_age Always - 186817

SMART Error Log not supported

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Offline Interrupted (host reset) 10% 11171 -
# 2 Offline Interrupted (host reset) 10% 11171 -
# 3 Offline Interrupted (host reset) 10% 3036 -
# 4 Offline Interrupted (host reset) 10% 4 -
# 5 Offline Interrupted (host reset) 10% 4 -
# 6 Offline Interrupted (host reset) 10% 4 -
# 7 Offline Interrupted (host reset) 10% 1 -
# 8 Offline Interrupted (host reset) 10% 0 -
# 9 Offline Interrupted (host reset) 10% 0 -
#10 Offline Interrupted (host reset) 10% 0 -
#11 Offline Interrupted (host reset) 10% 0 -

SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

According to TAC, failure is imminent and the SMC Needs to be replaced ASAP

Other commands that show the same info in brevity:

 

> debug system disk-smart-info disk-1 | match 232
232 Available_Reservd_Space 0x0033 001 001 010 Pre-fail Always FAILING_NOW 0


> debug system disk-smart-info disk-1 | match FAIL
SMART overall-health self-assessment test result: FAILED!
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
170 Available_Reservd_Space 0x0033 001 001 010 Pre-fail Always FAILING_NOW 0
232 Available_Reservd_Space 0x0033 001 001 010 Pre-fail Always FAILING_NOW 0

 

From the PAN-OS New Features Guide, Upgrade/Downgrade Considerations Section I found this, with no sense of urgency on the SSD Failure.

 

Upgrading a PA-3200, PA-5200, or PA-7000 Series Firewall

If the value for attribute ID #232, Available_Reservd_Space 0x0000, is greater than 20, then proceed with the downgrade.
If the value is less than 20, then contact support for assistance.

 

Apparently this could be an issue on the 3200 & 5200 Platform as well.

From the above examples we can see that TAC intervention is needed and the SMC should be RMA'd

> debug system disk-smart-info disk-1 | match 232
232 Available_Reservd_Space 0x0033 001 001 010 Pre-fail Always FAILING_NOW 0


> debug system disk-smart-info disk-1 | match FAIL
SMART overall-health self-assessment test result: FAILED!
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
170 Available_Reservd_Space 0x0033 001 001 010 Pre-fail Always FAILING_NOW 0
232 Available_Reservd_Space 0x0033 001 001 010 Pre-fail Always FAILING_NOW 0

4 REPLIES 4

Community Team Member

Hi @Kocian ,

 

Thanks for this information.  Much appreciated !!

 

Cheers,

-Kiwi.

 
LIVEcommunity team member, CISSP
Cheers,
Kiwi
Please help out other users and “Accept as Solution” if a post helps solve your problem !

Read more about how and why to accept solutions.

L1 Bithead

Thanks for the info, very useful. Thinking further how do we proactively monitor the SSD or the SMC? so we cannot wait for a failure. Is there any OID for the same? Any thoughts?

As of the time when I wrote this up, there was no proactive monitoring via SNMP available. My customer at the time wrote a couple expect scripts to periodically log in and check using the commands above. There may be OIDs available in the SMC-B Cards now, but I can't be certain. 

L1 Bithead

Thanks for the response.  Below output:

233 Media_Wearout_Indicator 0x0032 040 040 000 Old_age Always - 0

Can we automate monitoring of this value? Should be close to 100 for a healthy disk. And if the Disk is from say Seagate or Intel...don't they publish the oids? I mean this is common issue and must be a way to monitor.. hook or crook...

  • 4203 Views
  • 4 replies
  • 0 Likes
Like what you see?

Show your appreciation!

Click Like if a post is helpful to you or if you just want to show your support.

Click Accept as Solution to acknowledge that the answer to your question has been provided.

The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!

These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the LIVEcommunity as a whole!

The LIVEcommunity thanks you for your participation!