Opened 3 years ago

Closed 3 years ago

Last modified 3 years ago

#1404 closed defect (fixed)

"smartctl -l error" on NVMe device crashes the Linux kernel

Reported by: Jerome Kieffer Owned by: Christian Franke
Priority: major Milestone: Release 7.2
Component: all Version: 7.1
Keywords: linux nvme Cc:

Description (last modified by Christian Franke)

Reading the error logged in the drive crashes the computer. The NVMe drive can be found (again) only after power cycle.

Device:

smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.9.0-3-amd64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Micron_2200_MTFDHBA1T0TCK
Serial Number:                      1941246F5E81
Firmware Version:                   P1MU003
PCI Vendor/Subsystem ID:            0x1344
IEEE OUI Identifier:                0x00a075
Controller ID:                      0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,024,209,543,168 [1.02 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            00a075 01246f5e81
Local Time is:                      Mon Nov 30 21:32:25 2020 CET

Log of the linux kernel (version 5.9)when crashing:

Nov 30 20:59:31 antarctica kernel: [ 1068.251549] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0034 address=0xf88b0000 flags=0x0000]
Nov 30 20:59:31 antarctica kernel: [ 1068.262228] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0034 address=0xf88b0000 flags=0x0000]
Nov 30 20:59:31 antarctica kernel: [ 1068.273520] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0034 address=0xf88b0000 flags=0x0000]
Nov 30 20:59:31 antarctica kernel: [ 1068.284264] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0034 address=0xf88b0000 flags=0x0000]
Nov 30 20:59:32 antarctica kernel: [ 1068.294988] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0034 address=0xf88b0000 flags=0x0000]
Nov 30 20:59:32 antarctica kernel: [ 1068.305695] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0034 address=0xf88b0000 flags=0x0000]
Nov 30 20:59:32 antarctica kernel: [ 1068.316401] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0034 address=0xf88b0000 flags=0x0000]
Nov 30 20:59:32 antarctica kernel: [ 1068.327219] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0034 address=0xf88b0000 flags=0x0000]
Nov 30 20:59:32 antarctica kernel: [ 1068.337824] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0034 address=0xf88b0000 flags=0x0000]
Nov 30 20:59:32 antarctica kernel: [ 1068.348643] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0034 address=0xf88b0000 flags=0x0000]
Nov 30 20:59:32 antarctica kernel: [ 1068.359339] amd_iommu_report_page_fault: 4 callbacks suppressed
Nov 30 20:59:32 antarctica kernel: [ 1068.359342] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0034 address=0xf88b0000 flags=0x0000]
Nov 30 20:59:32 antarctica kernel: [ 1068.369938] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0034 address=0xf88b0000 flags=0x0000]
Nov 30 20:59:32 antarctica kernel: [ 1068.380664] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0034 address=0xf88b0000 flags=0x0000]
Nov 30 20:59:32 antarctica kernel: [ 1068.391459] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0034 address=0xf88b0000 flags=0x0000]
Nov 30 20:59:32 antarctica kernel: [ 1068.402169] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0034 address=0xf88b0000 flags=0x0000]
Nov 30 20:59:32 antarctica kernel: [ 1068.412776] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0034 address=0xf88b0000 flags=0x0000]
Nov 30 20:59:32 antarctica kernel: [ 1068.424438] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0034 address=0xf88b0000 flags=0x0000]
Nov 30 20:59:32 antarctica kernel: [ 1068.435706] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0034 address=0xf88b0000 flags=0x0000]
Nov 30 20:59:32 antarctica kernel: [ 1068.435722] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0034 address=0xf88b0380 flags=0x0000]
Nov 30 20:59:32 antarctica kernel: [ 1068.446479] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0034 address=0xf88b0000 flags=0x0000]

Change History (15)

comment:1 by Christian Franke, 3 years ago

Description: modified (diff)

comment:2 by Christian Franke, 3 years ago

Keywords: linux nvme added; NVMe removed
Milestone: undecided
Priority: minormajor

See Debian Bug 947803 for a related report for a Micron 2200S with firmware 22001030 and 22001040 under Linux.

How many error log pages does this device support?

If possible, please run (from package nvme-cli):

# nvme id-ctrl /dev/nvme... | grep elpe

The value elpe+1 (Error Log Page Entries) specifies number of error log pages.

Does the crash also occur with nvme-cli and if only a subset of all pages is read?

nvme error-log --log-entries=1 /dev/nvme...
nvme error-log --log-entries=256 /dev/nvme...

comment:3 by Christian Franke, 3 years ago

Summary: "smartctl -l error" on NVMe device crashes the computer"smartctl -l error" on NVMe device crashes the Linux kernel

comment:4 by Jerome Kieffer, 3 years ago

Dear Christian,

Thanks for taking seriously this bug. I checked if Micron has a new firmware without success. Dell and HP, who ship this kind of SSDs in their laptop, did upgrade their firmware.

~$ sudo nvme  id-ctrl /dev/nvme0n1 |grep elp
elpe      : 255

The first entry could be read without error:

~$ sudo nvme error-log --log-entries=1 /dev/nvme0n1
Error Log Entries for device:nvme0n1 entries:1
.................
 Entry[ 0]   
.................
error_count	: 0
sqid		: 0
cmdid		: 0
status_field	: 0(SUCCESS: The command completed successfully)
parm_err_loc	: 0
lba		: 0
nsid		: 0
vs		: 0
trtype		: The transport type is not indicated or the error is not transport related.
cs		: 0
trtype_spec_info: 0
.................

And the last one:

~$ sudo nvme error-log --log-entries=256 /dev/nvme0n1
NVMe status: INVALID_FIELD: A reserved coded value or an unsupported value in a defined field(0x2)

Apparently only 64 fields are readable unlike the 256 advertised.
None of the commands suggested triggered any warnings in the logs not crashes.

comment:5 by Christian Franke, 3 years ago

The root of the problem is possibly the transfer size. nvme error-log ... limits single transfer to 4KiB (64 entries) since commit 465a4d (requires NVMe 1.2.1+).

Please check the MDTS (Maximum Data Transfer Size):

$ sudo smartctl -c /dev/nvme0n1 | grep Maximum
$ sudo nvme id-ctrl /dev/nvme0n1 | grep mdts

comment:6 by Jerome Kieffer, 3 years ago

$ sudo smartctl -c /dev/nvme0n1 | grep Maximum
Maximum Data Transfer Size:         128 Pages
$ sudo nvme id-ctrl /dev/nvme0n1 | grep mdts
mdts      : 7

Debian testing provides (currently) nvme-cli version 1.12-5

comment:7 by Christian Franke, 3 years ago

Milestone: undecidedRelease 7.2
Owner: set to Christian Franke
Status: newaccepted

nvme-cli 1.12-... includes this change.

With MDTS 5 (512KiB), reading the whole 16KiB error log, as smartctl currently does, should work. There might be another restriction (or bug) in the NVMe pass-through I/O-control of the kernel. Similar problems were not reported for other Platforms.

I will change smartctl such that log transfers are limited to 4KiB. This should prevent such kernel or device crashes. Drawback: Error log entries > 64 could not be read from older (NMVe 1.2 or earlier) drives or if NVMe pass-through layer could not pass CDW12 (-d sntrealtek).

comment:8 by Christian Franke, 3 years ago

comment:9 by Christian Franke, 3 years ago

If possible, please test r5123 or later. Binaries are available here: https://builds.smartmontools.org/.

comment:10 by Jerome Kieffer, 3 years ago

Thank you for your reactivity. I tested the r5123 and the logs can be read without issue:

local/sbin$ sudo ./smartctl -a /dev/nvme0n1
[sudo] password for jerome: 
smartctl 7.2 2020-12-03 r5123 [x86_64-linux-5.9.0-3-amd64] (CircleCI)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Micron_2200_MTFDHBA1T0TCK
Serial Number:                      1941246F5E81
Firmware Version:                   P1MU003
PCI Vendor/Subsystem ID:            0x1344
IEEE OUI Identifier:                0x00a075
Controller ID:                      0
NVMe Version:                       1.2.1
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,024,209,543,168 [1.02 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            00a075 01246f5e81
Local Time is:                      Thu Dec  3 21:39:40 2020 CET
Firmware Updates (0x02):            1 Slot
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0017):     Comp Wr_Unc DS_Mngmt Sav/Sel_Feat
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     8.25W       -        -    0  0  0  0        0       0
 1 +     2.40W       -        -    1  1  1  1        0       0
 2 +     1.90W       -        -    2  2  2  2        0       0
 3 -   0.0800W       -        -    3  3  3  3    10000    2500
 4 -   0.0050W       -        -    4  4  4  4     5000   44000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        44 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    0%
Data Units Read:                    64,222 [32.8 GB]
Data Units Written:                 296,424 [151 GB]
Host Read Commands:                 692,480
Host Write Commands:                986,574
Controller Busy Time:               24
Power Cycles:                       34
Power On Hours:                     8
Unsafe Shutdowns:                   12
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               44 Celsius
Temperature Sensor 2:               47 Celsius

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

comment:11 by Christian Franke, 3 years ago

Thanks! Please test also smartctl -l error,256 /dev/nvme0n1 as I don't have a device with > 64 error log entries so reads with LPO (Log Page Offset) are not needed.

comment:12 by Jerome Kieffer, 3 years ago

The 192 missing logs are seen as such:

local/sbin$ sudo ./smartctl -l error,256 /dev/nvme0n1
smartctl 7.2 2020-12-03 r5123 [x86_64-linux-5.9.0-3-amd64] (CircleCI)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
Read Error Information Log failed, 192 entries missing: NVMe Status 0x02
Error Information (NVMe Log 0x01, 64 of 256 entries)
No Errors Logged

comment:13 by Christian Franke, 3 years ago

NVMe Status 0x02 means Invalid Field in Command. This suggests that this device does not implement LPO (Log Page Offset) support. It IMO should as it advertises NVMe 1.2.1 compatibility. So the full error log could neither be read in one step nor in smaller chunks.

I consider this ticket as fixed, although the actual reason for the crash could not be fully determined (bug in device firmware, bug in Linux kernel, still hidden bug in smartmontools, ...).

Thanks for this bug report and testing.

comment:14 by Christian Franke, 3 years ago

Resolution: fixed
Status: acceptedclosed

comment:15 by Christian Franke, 3 years ago

... this device does not implement LPO (Log Page Offset) support. It IMO should as it advertises NVMe 1.2.1 compatibility.

I take that back, LPO support is optional and indicated in LPA field of Identify Controller data structure. Fixed in r5124. With r5125, LPA field is printed as Log Page Attributes ....

Note: See TracTickets for help on using tickets.