Opened 13 months ago

Last modified 6 months ago

#1715 new enhancement

Allow to ignore certain bits of NVMe Critical Warning byte

Reported by: BradGeeeeeeeeeeeeeeeeeee Owned by:
Priority: minor Milestone: unscheduled
Component: smartd Version: 7.2
Keywords: nvme Cc:

Description

I have a Samsung SSD 960 EVO 250GB with 424TB written. The drive stores large amounts of RRD files for an SNMP-type monitoring system and gets re-written constantly.

The manufacturer warranty for this drive is 100TB, so we are 424% beyond the warranty and thus the "Percentage Used" value. However, the drive works fine and shows no other signs of wearing out.

I recently did an OS update on this Debian host, from Debian 10 to 11, and along with it came a new version of smartmontools. Unfortunately it now complains every 24 hours with the following error:

The following warning/error was logged by the smartd daemon:

Device: /dev/nvme0, Critical Warning (0x04): Reliability

I have been forced to add a "/dev/nvme0 -d ignore" to my /etc/smartd.conf file, but this prevents me from being alerted to any other possible problems, including any reduction of the "Available Spare" value, or thermal warnings.

With ATA drives, it's possible it ignore certain specific attributes with the -i or -I arguments, but I'm not aware of any similar feature which might be helpful here.

Would it be possible to ignore such warnings, while still monitoring the device for other problems? What do you advise?

I suspect this problem will occur more regularly as time goes on and more too-reliable-for-their-own-damned-good drives will begin to annoy their administrators.

The only crime this drive has committed is it's failure to fail! Please end this unfair persecution of my poor, abused but reliable, NVME drive!

This request on Serverfault is similar to mine and might be worth a read (I am not the author):

https://serverfault.com/questions/1118718/how-to-add-to-excludes-alerts-on-smartmontool

My searching found that this request is somewhat similar to bug 1434:

https://www.smartmontools.org/ticket/1434

Below is full output from a smartctl -a.

> sudo smartctl -a /dev/disk/by-id/nvme-Samsung_SSD_960_EVO_250GB_xxxxxxxxxxxxxxx

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.0-21-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 960 EVO 250GB
Serial Number:                      xxxxxxxxxxxxxxx
Firmware Version:                   2B7QCXE7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 250,059,350,016 [250 GB]
Unallocated NVM Capacity:           0
Controller ID:                      2
NVMe Version:                       1.2
Number of Namespaces:               1
Namespace 1 Size/Capacity:          250,059,350,016 [250 GB]
Namespace 1 Utilization:            191,818,444,800 [191 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            xxxxxxxxxxxxxxxxx
Local Time is:                      Thu Apr  6 20:23:49 2023 MST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0007):   Security Format Frmw_DL
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     77 Celsius
Critical Comp. Temp. Threshold:     79 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.04W       -        -    0  0  0  0        0       0
 1 +     5.09W       -        -    1  1  1  1        0       0
 2 +     4.08W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3      210    1500
 4 -   0.0050W       -        -    4  4  4  4     2200    6000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- NVM subsystem reliability has been degraded

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x04
Temperature:                        28 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    255%
Data Units Read:                    151,762,677 [77.7 TB]
Data Units Written:                 830,060,020 [424 TB]
Host Read Commands:                 6,877,731,354
Host Write Commands:                51,000,719,462
Controller Busy Time:               79,390
Power Cycles:                       35
Power On Hours:                     31,418
Unsafe Shutdowns:                   21
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               28 Celsius
Temperature Sensor 2:               35 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

Change History (3)

comment:1 by Christian Franke, 13 months ago

Keywords: nvme added; ignore Critical Warning removed
Milestone: unscheduled
Summary: NVME Critical Warning: 0x04 due to excessive Data Units WrittenAllow to ignore certain bits of NVMe Critical Warning byte

For NVMe devices, smartd only sends warnings if Critical Warning is non zero (-H directive), the Error Information Log Entries count has changed (-l error directive) or a Temperature [Sensor N] reaches the critical limit (-W ... directive).

If no directive or -a is specified, the default for NVMe is -H -l error.

I agree that it would be useful to optionally ignore certain Critical Warning bits. For example by adding an optional argument to the -H directive: -H 0xfb should ignore 0x04. Changing ticket summary accordingly.

... including any reduction of the "Available Spare" value, or thermal warnings.

Thermal warnings require -W ... directive, see man smartd.conf.

Available Spare is not monitored, but should be, because Available Spare Threshold is provided. Please create a separate ticket for this feature.

comment:2 by mhaamann, 6 months ago

We have the same issue now on most of our production servers. Posting comment here to subscribe for updates

in reply to:  2 comment:3 by BradGeeeeeeeeeeeeeeeeeee, 6 months ago

Replying to mhaamann:

We have the same issue now on most of our production servers. Posting comment here to subscribe for updates

See https://www.smartmontools.org/ticket/1716

Note: See TracTickets for help on using tickets.