Opened 4 years ago

Closed 6 months ago

Last modified 3 hours ago

#1222 closed enhancement (fixed)

Smartd should ignore non-error entries from NVMe Error Information log

Reported by: Christian Franke Owned by: Christian Franke
Priority: minor Milestone: Release 7.4
Component: smartd Version: 7.0
Keywords: nvme Cc: Gerald Turner, Adam Piggott, Patrick Decat, Peter Nowee

Description

Some drives frequently add entries to the NVMe Error Information log which do not reflect an actual error.

Smartd issues a LOG_CRIT message if the Number of Error Information Log Entries from the SMART/Health Information log has increased since the last check. This is misleading for such drives.

Smartd should check whether the number of actual errors has increased since the last check.

For original report and sample outputs see Debian Bug 900244.

Change History (16)

comment:1 Changed 4 years ago by Gerald Turner

Cc: Gerald Turner added

comment:2 Changed 4 years ago by Christian Franke

NVMe 1.4a section 5.14.1.1 - Error Information: The controller should clear this log page by removing all entries on power cycle and Controller Level Reset.

Therefore the Error Information log may be empty after an increase of Number of Error Information Log Entries has been detected. Smartd is unable to check for new actual errors then.

See Ubuntu Bug 1878264 for an example.

comment:3 Changed 3 years ago by Adam Piggott

Cc: Adam Piggott added

comment:4 Changed 13 months ago by Marcel Partap

As more and more computers are equipped with NMVe SSDs, and as we would like people to run quality operating systems with proper disk surveillance, these spurious errors (upon resuming from standby mostly) are a growing UX defect.. Here's a sample specimen:

.................                                                                                                                                                                                                    
 Entry[42]                                                                                                                                                                                                           
.................                                                                                                                                                                                                    
error_count     : 0                                                                                                                                                                                                  
sqid            : 0                                                                                                                                                                                                  
cmdid           : 0                                                                                                                                                                                                  
status_field    : 0(Successful Completion: The command completed without error)                                                                                                                                      
phase_tag       : 0                                                                                                                                                                                                  
parm_err_loc    : 0                                                                                                                                                                                                  
lba             : 0                                                                                                                                                                                                  
nsid            : 0                                                                                                                                                                                                  
vs              : 0                                                                                                                                                                                                  
trtype          : The transport type is not indicated or the error is not transport related.                                                                                                                         
cs              : 0                                                                                                                                                                                                  
trtype_spec_info: 0                                                                                                                                                                                                  
.................                                                                                                                                                                                                    

Thus, I'd propose this simple solution: do not raise critical alert for NVMe storage devices if both error_count == 0 and status_field == 0. Does anyone see a potential downside of this?

comment:5 Changed 7 months ago by Christian Franke

Results from various sources suggest that smartd should ignore error information log entries with ((status_field >> 1) & 0xfff) <= 0x002:
SCT/SC 0x0/0x00: Generic Command Status / Successful Completion,
SCT/SC 0x0/0x01: Generic Command Status / Invalid Command Opcode,
SCT/SC 0x0/0x02: Generic Command Status / Invalid Field in Command.

comment:6 Changed 7 months ago by Christian Franke

Related: ticket #1663.

comment:7 Changed 7 months ago by Christian Franke

Ticket #1722 has been marked as a duplicate of this ticket.

comment:8 Changed 6 months ago by kolAflash

@Christian Franke:
I checked my nvme error-log output.
Excerpt:
status_field : 0x2002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field)
And error_count is increasing with each error log entry.
See here for full details: ticket:1722#comment:5

Also I created a Linux kernel ticket.
https://bugzilla.kernel.org/show_bug.cgi?id=217445

And I guess these can be related too:

comment:9 Changed 6 months ago by Christian Franke

Milestone: unscheduledRelease 7.4
Owner: set to Christian Franke
Status: newaccepted

comment:10 Changed 6 months ago by Christian Franke

Resolution: fixed
Status: acceptedclosed

comment:11 Changed 6 months ago by Patrick Decat

Cc: Patrick Decat added

comment:12 Changed 3 months ago by Christian Franke

GH issues/208 has been marked as a duplicate of this ticket.

comment:13 Changed 2 months ago by Peter Nowee

Cc: Peter Nowee added

comment:14 Changed 8 days ago by ThoughtPolice84

I have this error on a Samsung NVMe SSD drive in my openmediavault NAS system.
Smartd daemon error output:

The following warning/error was logged by the smartd daemon:

Device: /dev/disk/by-id/nvme-SAMSUNG_MZVLB256HAHQ-000H1, number of Error Log entries increased from 348 to 349

Device info:
SAMSUNG MZVLB256HAHQ-000H1, S/N:*, FW:EXD70H1Q, NSID:1, 256 GB

Is see this issue has been marked as resolved (r5472). Is there anything I should do in the configuration of smartmontools?

comment:15 in reply to:  14 Changed 7 days ago by Christian Franke

Replying to ThoughtPolice84:

I have this error on a Samsung NVMe SSD drive in my openmediavault NAS system.
Smartd daemon error output: ...

Please check the syslog for a LOG_INFO message like

Device: /dev/disk/by-id/nvme-SAMSUNG_MZVLB256HAHQ-000H1, NVMe error [INDEX], count 349, status 0xSTATUS: MESSAGE

and report it here including the version of smartd.

The message should precede this LOG_CRIT message:

Device: /dev/disk/by-id/nvme-SAMSUNG_MZVLB256HAHQ-000H1, number of Error Log entries increased from 348 to 349

comment:16 Changed 3 hours ago by ThoughtPolice84

Version:

smartd 7.2 2020-12-30 r5155 [x86_64-linux-6.1.0-0.deb11.11-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

The error:

Device: /dev/disk/by-id/nvme-SAMSUNG_MZVLB256HAHQ-000H1_S425NX1M782076, number of Error Log entries increased from 348 to 349
Note: See TracTickets for help on using tickets.