Opened 4 years ago
Last modified 11 days ago
#1222 new enhancement
Smartd should ignore non-error entries from NVMe Error Information log
Reported by: | Christian Franke | Owned by: | |
---|---|---|---|
Priority: | minor | Milestone: | unscheduled |
Component: | smartd | Version: | 7.0 |
Keywords: | nvme | Cc: | Gerald Turner, Adam Piggott |
Description
Some drives frequently add entries to the NVMe Error Information log which do not reflect an actual error.
Smartd issues a LOG_CRIT message if the Number of Error Information Log Entries from the SMART/Health Information log has increased since the last check. This is misleading for such drives.
Smartd should check whether the number of actual errors has increased since the last check.
For original report and sample outputs see Debian Bug 900244.
Change History (8)
comment:1 Changed 3 years ago by
Cc: | Gerald Turner added |
---|
comment:2 Changed 3 years ago by
comment:3 Changed 2 years ago by
Cc: | Adam Piggott added |
---|
comment:4 Changed 7 months ago by
As more and more computers are equipped with NMVe SSDs, and as we would like people to run quality operating systems with proper disk surveillance, these spurious errors (upon resuming from standby mostly) are a growing UX defect.. Here's a sample specimen:
................. Entry[42] ................. error_count : 0 sqid : 0 cmdid : 0 status_field : 0(Successful Completion: The command completed without error) phase_tag : 0 parm_err_loc : 0 lba : 0 nsid : 0 vs : 0 trtype : The transport type is not indicated or the error is not transport related. cs : 0 trtype_spec_info: 0 .................
Thus, I'd propose this simple solution: do not raise critical alert for NVMe storage devices if both error_count == 0
and status_field == 0
. Does anyone see a potential downside of this?
comment:5 Changed 3 weeks ago by
Results from various sources suggest that smartd should ignore error information log entries with ((status_field >> 1) & 0xfff) <= 0x002
:
SCT/SC 0x0/0x00: Generic Command Status / Successful Completion,
SCT/SC 0x0/0x01: Generic Command Status / Invalid Command Opcode,
SCT/SC 0x0/0x02: Generic Command Status / Invalid Field in Command.
comment:8 Changed 11 days ago by
@Christian Franke:
I checked my nvme error-log
output.
Excerpt:
status_field : 0x2002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field)
And error_count
is increasing with each error log entry.
See here for full details: ticket:1722#comment:5
Also I created a Linux kernel ticket.
https://bugzilla.kernel.org/show_bug.cgi?id=217445
And I guess these can be related too:
- https://bugzilla.kernel.org/show_bug.cgi?id=211573
- https://unix.stackexchange.com/questions/721157/what-cmdid-0x10-and-status-field-0x2002-mean-from-nvme-cli
- https://groups.google.com/g/uk.comp.homebuilt/c/sQoIFy-JYl0/m/Z2EqBinCAAAJ
- https://www.truenas.com/community/threads/970-evo-plus-error-logs-need-help-reading-these.108150/
NVMe 1.4a section 5.14.1.1 - Error Information: The controller should clear this log page by removing all entries on power cycle and Controller Level Reset.
Therefore the Error Information log may be empty after an increase of Number of Error Information Log Entries has been detected. Smartd is unable to check for new actual errors then.
See Ubuntu Bug 1878264 for an example.