#1828 closed defect (duplicate)
smartctl -x creates new NVMe errors
Reported by: | rzsn | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | Release 7.5 |
Component: | smartctl | Version: | 7.4 |
Keywords: | nvme | Cc: | rzsn |
Description
I was given a drive to check the condition, but errors kept increasing. After a while I actually noticed that the errors increased by 1 by each smartctl -x call (I was repeatedly checking the state of the ssd while ext4 lazy inode init was doing its job after formatting the partition).
Since the errors are not about failure, but rather a malformed command, could be done something in smartctl in order to prevent such things happening?
My expectation: an error checking tool shall not generate errors while doing so.
I see multiple reasons why to fix this - if somebody calls smartctl repetitively, the errors increase but its not actually representing any error condition of the drive.
If somebody would call smartctl too aggressively, then actual errors might get lost.
(if the nvme error log is a circular buffer which afaik is, 64 entries on mine).
There was a #1222 ticket about something similar?
Smartd should ignore non-error entries from NVMe Error Information log
But hiding is one thing.. while making actual errors is something NOT desireable.
So the drive in question is a WD SN640 nvme:
=== START OF INFORMATION SECTION === Model Number: WUS4BB076D7P3E3 Serial Number: ******** Firmware Version: R111000L PCI Vendor/Subsystem ID: 0x1b96 IEEE OUI Identifier: 0x0014ee Total NVM Capacity: 7,681,501,126,656 [7.68 TB] Unallocated NVM Capacity: 0 Controller ID: 0 NVMe Version: 1.3 Number of Namespaces: 1 Namespace 1 Size/Capacity: 7,681,501,126,656 [7.68 TB] Namespace 1 Formatted LBA Size: 4096 Namespace 1 IEEE EUI-64: 0014ee 83066cbb80 Local Time is: Tue Apr 30 20:30:15 2024 CEST Firmware Updates (0x19): 4 Slots, Slot 1 R/O, no Reset required Optional Admin Commands (0x001f): Security Format Frmw_DL NS_Mngmt Self_Test Optional NVM Commands (0x005e): Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp Log Page Attributes (0x03): S/H_per_NS Cmd_Eff_Lg Warning Comp. Temp. Threshold: 70 Celsius Critical Comp. Temp. Threshold: 80 Celsius Namespace 1 Features (0x02): NA_Fields
Current state at end of smartctl -x call:
Error Information (NVMe Log 0x01, 16 of 256 entries) Num ErrCount SQId CmdId Status PELoc LBA NSID VS Message 0 12 0 0xd009 0xc004 - 0 1 - Invalid Field in Command 1 11 0 0xc008 0xc004 - 0 1 - Invalid Field in Command 2 10 0 0xa00b 0xc004 - 0 1 - Invalid Field in Command 3 9 0 0x900a 0xc004 - 0 1 - Invalid Field in Command 4 8 0 0x8009 0xc004 - 0 1 - Invalid Field in Command 5 7 0 0xa00e 0xc004 - 0 1 - Invalid Field in Command 6 6 0 0x900d 0xc004 - 0 1 - Invalid Field in Command 7 5 0 0x7008 0xc004 - 0 1 - Invalid Field in Command 8 4 0 0x800c 0xc004 - 0 1 - Invalid Field in Command 9 3 0 0x600c 0xc004 - 0 1 - Invalid Field in Command 10 2 0 0x100a 0xc004 - 0 1 - Invalid Field in Command 11 1 0 0x300e 0xc004 0x028 0 0 - Invalid Field in Command
And using nvme-cli error-log, i see that all these errors (except the oldest) are of this kind:
................. Entry[ 0] ................. error_count : 12 sqid : 0 cmdid : 0xd009 status_field : 0x6002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field) phase_tag : 0 parm_err_loc : 0xffff lba : 0 nsid : 0x1 vs : 0 trtype : The transport type is not indicated or the error is not transport related. csi : 0 opcode : 0 cs : 0 trtype_spec_info: 0 log_page_version: 0
How can we trace this to an exact query which smartctl does?
Change History (5)
comment:1 by , 5 months ago
Keywords: | error log removed |
---|---|
Milestone: | → undecided |
comment:2 by , 5 months ago
On my gentoo installed version 7.4:
- errors are NOT generated when using /dev/nvme0
- errors are GENERATED with /dev/nvme0n1 as an argument
- errors are NOT generated when the -l -selftest is omitted (with either nvme0 or nvme0n1)
Now with CI build version:
smartctl pre-7.5 2024-04-26 r5612 [x86_64-linux-6.8.1-gentoo-x86_64] (CircleCI)
- no errors are generated, for all 4 combinations (-l selftest/without)*(nvme0/nvme0n1)
I declare that your hint was correct and right on spot - and that this bug is resolved in the CI build.
The wording in bug #1741 is different than mine, but the root cause of both tickets is same - I was more worried having the drive's error log filled with bogus entries, than that your sub-command actually fails. But yes - after further review of the outputs, I am now seeing that message which I have overlooked before:
Read Self-test Log failed: Invalid Field in Command (0x6002)
Thanks very much for excellent support!
comment:3 by , 5 months ago
Resolution: | → duplicate |
---|---|
Status: | new → closed |
This ticket is a duplicate of ticket #1741.
comment:5 by , 5 months ago
Milestone: | undecided → Release 7.5 |
---|
smartctl -x
is the same assmartctl -H -i -c -A -l error -l selftest
.Please try the latter without
-l selftest
or use broadcast namespace instead of namespace 1 or use a recent CI build from https://builds.smartmontools.org/If this works, this ticket is a duplicate of ticket #1741.