Opened 6 months ago
Last modified 6 months ago
#1715 new enhancement
Allow to ignore certain bits of NVMe Critical Warning byte
Reported by: | BradGeeeeeeeeeeeeeeeeeee | Owned by: | |
---|---|---|---|
Priority: | minor | Milestone: | unscheduled |
Component: | smartd | Version: | 7.2 |
Keywords: | nvme | Cc: |
Description
I have a Samsung SSD 960 EVO 250GB with 424TB written. The drive stores large amounts of RRD files for an SNMP-type monitoring system and gets re-written constantly.
The manufacturer warranty for this drive is 100TB, so we are 424% beyond the warranty and thus the "Percentage Used" value. However, the drive works fine and shows no other signs of wearing out.
I recently did an OS update on this Debian host, from Debian 10 to 11, and along with it came a new version of smartmontools. Unfortunately it now complains every 24 hours with the following error:
The following warning/error was logged by the smartd daemon: Device: /dev/nvme0, Critical Warning (0x04): Reliability
I have been forced to add a "/dev/nvme0 -d ignore" to my /etc/smartd.conf file, but this prevents me from being alerted to any other possible problems, including any reduction of the "Available Spare" value, or thermal warnings.
With ATA drives, it's possible it ignore certain specific attributes with the -i or -I arguments, but I'm not aware of any similar feature which might be helpful here.
Would it be possible to ignore such warnings, while still monitoring the device for other problems? What do you advise?
I suspect this problem will occur more regularly as time goes on and more too-reliable-for-their-own-damned-good drives will begin to annoy their administrators.
The only crime this drive has committed is it's failure to fail! Please end this unfair persecution of my poor, abused but reliable, NVME drive!
This request on Serverfault is similar to mine and might be worth a read (I am not the author):
https://serverfault.com/questions/1118718/how-to-add-to-excludes-alerts-on-smartmontool
My searching found that this request is somewhat similar to bug 1434:
Below is full output from a smartctl -a.
> sudo smartctl -a /dev/disk/by-id/nvme-Samsung_SSD_960_EVO_250GB_xxxxxxxxxxxxxxx smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.0-21-amd64] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Number: Samsung SSD 960 EVO 250GB Serial Number: xxxxxxxxxxxxxxx Firmware Version: 2B7QCXE7 PCI Vendor/Subsystem ID: 0x144d IEEE OUI Identifier: 0x002538 Total NVM Capacity: 250,059,350,016 [250 GB] Unallocated NVM Capacity: 0 Controller ID: 2 NVMe Version: 1.2 Number of Namespaces: 1 Namespace 1 Size/Capacity: 250,059,350,016 [250 GB] Namespace 1 Utilization: 191,818,444,800 [191 GB] Namespace 1 Formatted LBA Size: 512 Namespace 1 IEEE EUI-64: xxxxxxxxxxxxxxxxx Local Time is: Thu Apr 6 20:23:49 2023 MST Firmware Updates (0x16): 3 Slots, no Reset required Optional Admin Commands (0x0007): Security Format Frmw_DL Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Log Page Attributes (0x03): S/H_per_NS Cmd_Eff_Lg Maximum Data Transfer Size: 512 Pages Warning Comp. Temp. Threshold: 77 Celsius Critical Comp. Temp. Threshold: 79 Celsius Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 6.04W - - 0 0 0 0 0 0 1 + 5.09W - - 1 1 1 1 0 0 2 + 4.08W - - 2 2 2 2 0 0 3 - 0.0400W - - 3 3 3 3 210 1500 4 - 0.0050W - - 4 4 4 4 2200 6000 Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 0 === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! - NVM subsystem reliability has been degraded SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x04 Temperature: 28 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 255% Data Units Read: 151,762,677 [77.7 TB] Data Units Written: 830,060,020 [424 TB] Host Read Commands: 6,877,731,354 Host Write Commands: 51,000,719,462 Controller Busy Time: 79,390 Power Cycles: 35 Power On Hours: 31,418 Unsafe Shutdowns: 21 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Temperature Sensor 1: 28 Celsius Temperature Sensor 2: 35 Celsius Error Information (NVMe Log 0x01, 16 of 64 entries) No Errors Logged
Change History (1)
comment:1 Changed 6 months ago by
Keywords: | nvme added; ignore Critical Warning removed |
---|---|
Milestone: | → unscheduled |
Summary: | NVME Critical Warning: 0x04 due to excessive Data Units Written → Allow to ignore certain bits of NVMe Critical Warning byte |
For NVMe devices, smartd only sends warnings if
Critical Warning
is non zero (-H
directive), theError Information Log Entries
count has changed (-l error
directive) or aTemperature [Sensor N]
reaches the critical limit (-W ...
directive).If no directive or
-a
is specified, the default for NVMe is-H -l error
.I agree that it would be useful to optionally ignore certain
Critical Warning
bits. For example by adding an optional argument to the-H
directive:-H 0xfb
should ignore0x04
. Changing ticket summary accordingly.Thermal warnings require
-W ...
directive, seeman smartd.conf
.Available Spare
is not monitored, but should be, becauseAvailable Spare Threshold
is provided. Please create a separate ticket for this feature.