Opened 18 months ago
Last modified 3 months ago
#1715 new enhancement
Allow to ignore certain bits of NVMe Critical Warning byte
Reported by: | BradGeeeeeeeeeeeeeeeeeee | Owned by: | |
---|---|---|---|
Priority: | minor | Milestone: | unscheduled |
Component: | smartd | Version: | 7.2 |
Keywords: | nvme | Cc: | Felix E. |
Description
I have a Samsung SSD 960 EVO 250GB with 424TB written. The drive stores large amounts of RRD files for an SNMP-type monitoring system and gets re-written constantly.
The manufacturer warranty for this drive is 100TB, so we are 424% beyond the warranty and thus the "Percentage Used" value. However, the drive works fine and shows no other signs of wearing out.
I recently did an OS update on this Debian host, from Debian 10 to 11, and along with it came a new version of smartmontools. Unfortunately it now complains every 24 hours with the following error:
The following warning/error was logged by the smartd daemon: Device: /dev/nvme0, Critical Warning (0x04): Reliability
I have been forced to add a "/dev/nvme0 -d ignore" to my /etc/smartd.conf file, but this prevents me from being alerted to any other possible problems, including any reduction of the "Available Spare" value, or thermal warnings.
With ATA drives, it's possible it ignore certain specific attributes with the -i or -I arguments, but I'm not aware of any similar feature which might be helpful here.
Would it be possible to ignore such warnings, while still monitoring the device for other problems? What do you advise?
I suspect this problem will occur more regularly as time goes on and more too-reliable-for-their-own-damned-good drives will begin to annoy their administrators.
The only crime this drive has committed is it's failure to fail! Please end this unfair persecution of my poor, abused but reliable, NVME drive!
This request on Serverfault is similar to mine and might be worth a read (I am not the author):
https://serverfault.com/questions/1118718/how-to-add-to-excludes-alerts-on-smartmontool
My searching found that this request is somewhat similar to bug 1434:
Below is full output from a smartctl -a.
> sudo smartctl -a /dev/disk/by-id/nvme-Samsung_SSD_960_EVO_250GB_xxxxxxxxxxxxxxx smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.0-21-amd64] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Number: Samsung SSD 960 EVO 250GB Serial Number: xxxxxxxxxxxxxxx Firmware Version: 2B7QCXE7 PCI Vendor/Subsystem ID: 0x144d IEEE OUI Identifier: 0x002538 Total NVM Capacity: 250,059,350,016 [250 GB] Unallocated NVM Capacity: 0 Controller ID: 2 NVMe Version: 1.2 Number of Namespaces: 1 Namespace 1 Size/Capacity: 250,059,350,016 [250 GB] Namespace 1 Utilization: 191,818,444,800 [191 GB] Namespace 1 Formatted LBA Size: 512 Namespace 1 IEEE EUI-64: xxxxxxxxxxxxxxxxx Local Time is: Thu Apr 6 20:23:49 2023 MST Firmware Updates (0x16): 3 Slots, no Reset required Optional Admin Commands (0x0007): Security Format Frmw_DL Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Log Page Attributes (0x03): S/H_per_NS Cmd_Eff_Lg Maximum Data Transfer Size: 512 Pages Warning Comp. Temp. Threshold: 77 Celsius Critical Comp. Temp. Threshold: 79 Celsius Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 6.04W - - 0 0 0 0 0 0 1 + 5.09W - - 1 1 1 1 0 0 2 + 4.08W - - 2 2 2 2 0 0 3 - 0.0400W - - 3 3 3 3 210 1500 4 - 0.0050W - - 4 4 4 4 2200 6000 Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 0 === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! - NVM subsystem reliability has been degraded SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x04 Temperature: 28 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 255% Data Units Read: 151,762,677 [77.7 TB] Data Units Written: 830,060,020 [424 TB] Host Read Commands: 6,877,731,354 Host Write Commands: 51,000,719,462 Controller Busy Time: 79,390 Power Cycles: 35 Power On Hours: 31,418 Unsafe Shutdowns: 21 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Temperature Sensor 1: 28 Celsius Temperature Sensor 2: 35 Celsius Error Information (NVMe Log 0x01, 16 of 64 entries) No Errors Logged
Change History (5)
comment:1 by , 18 months ago
Keywords: | nvme added; ignore Critical Warning removed |
---|---|
Milestone: | → unscheduled |
Summary: | NVME Critical Warning: 0x04 due to excessive Data Units Written → Allow to ignore certain bits of NVMe Critical Warning byte |
follow-up: 3 comment:2 by , 12 months ago
We have the same issue now on most of our production servers. Posting comment here to subscribe for updates
comment:3 by , 12 months ago
Replying to mhaamann:
We have the same issue now on most of our production servers. Posting comment here to subscribe for updates
follow-up: 5 comment:4 by , 3 months ago
Cc: | added |
---|
Hi,
I was trying my hand at this and came to more "architectural" questions:
- Do we want to ignore the selected critical_warning bit(s) just for the result of the pass/fail check (i.e. exit code, "self-assessment test result" & json
smart_status.passed
), or do we want the bit to seemingly vanish from all related values (long-form output of set bits, related jsonsmart_status.nvme
subkey(s) andsmart_status.value
)?
I think this heavily depends on what the use case is:
- If its to work around a device bug, we probably do not want dependant applications to know that the bit was set in the first place. In that case, the above secondly named values would need to be redacted.
- If its just a case of "I don't want a full
smartd
alarm if the TBW value is over 100%" (critical_warning bit 0x04), it might still be relevant for monitoring purposes, so those downstream tools should still be able to get the actual value (but not the FAILED state).
- Are there other fields/values except for critical_warning with NVMes that we might want to ignore in the future (in --health)? If so, maybe a bitmask directly as -H param isn't the best idea?
Also partially related: is the output of smartctl supposed to be machine-readable?
I was thinking of explicitly mentioning set critical_warning bits when they are ignored, but with a special message behind them, maybe like this:
smartctl pre-7.5 2024-05-08 r5613 [x86_64-linux-6.8.8-2-pve] (local build) Copyright (C) 2002-24, Bruce Allen, Christian Franke, www.smartmontools.org === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED - NVM subsystem reliability has been degraded (ignored by -H bitmask)
And one last (meta-) question: What exactly is the preferred way for contribution here? A patch/diff and a maintainer applies that? Mailinglist? I really am not familiar with SVN (sorry) :)
Really sorry for all the questions. I am a beginner with C/C++ and this project.
comment:5 by , 3 months ago
Replying to Felix E.:
...
- Are there other fields/values except for critical_warning with NVMes that we might want to ignore in the future (in --health)? If so, maybe a bitmask directly as -H param isn't the best idea?
This ticket is for smartd
and its evaluation of Critical Warning
byte only. If you have any suggestion for more complex functionality, please create a new ticket and provide detailed suggestions there.
Also partially related: is the output of smartctl supposed to be machine-readable?
Yes, with --json[=cgiosuvy]
option, available since smartmontools 7.0 (Dec 2018).
I was thinking of explicitly mentioning set critical_warning bits when they are ignored, but with a special message behind them, maybe like this:
If you want this functionality also for smartctl
, please create a new ticket.
And one last (meta-) question: What exactly is the preferred way for contribution here? A patch/diff and a maintainer applies that? Mailinglist? I really am not familiar with SVN (sorry) :)
Attach a patch file to a new ticket or create a PR in our github R/O mirror. Or wait until we have moved the project from SF to github (which will happen possibly until the end of this year) and create a PR then.
For NVMe devices, smartd only sends warnings if
Critical Warning
is non zero (-H
directive), theError Information Log Entries
count has changed (-l error
directive) or aTemperature [Sensor N]
reaches the critical limit (-W ...
directive).If no directive or
-a
is specified, the default for NVMe is-H -l error
.I agree that it would be useful to optionally ignore certain
Critical Warning
bits. For example by adding an optional argument to the-H
directive:-H 0xfb
should ignore0x04
. Changing ticket summary accordingly.Thermal warnings require
-W ...
directive, seeman smartd.conf
.Available Spare
is not monitored, but should be, becauseAvailable Spare Threshold
is provided. Please create a separate ticket for this feature.