Opened 3 months ago

Last modified 3 months ago

#1215 new defect

smartctl selects the wrong NVMe device

Reported by: bendreth Owned by:
Priority: major Milestone: undecided
Component: all Version:
Keywords: nvme linux Cc:

Description (last modified by Christian Franke)

In the currently nightly, smartctl, when used on /dev/nvmeX rathern than /dev/nvmeXn1, selects the wrong device when two identical nvme modules are installed. Example:

# ls -l nvme-Samsung_SSD_970_EVO_2TB_S464NB0M200088N | cut -c40-
nvme-Samsung_SSD_970_EVO_2TB_S464NB0M200088N -> ../../nvme0n1
# ls -l nvme-Samsung_SSD_970_EVO_2TB_S464NB0M200161Y | cut -c40-
nvme-Samsung_SSD_970_EVO_2TB_S464NB0M200161Y -> ../../nvme1n1

# ./smartctl -a /dev/nvme0n1 | sed -n '1p;5,7p'
smartctl 7.1 2019-07-01 r4934 [x86_64-linux-4.18.0-24-generic] (local build)
Model Number:                       Samsung SSD 970 EVO 2TB
Serial Number:                      S464NB0M200088N
Firmware Version:                   2B2QEXE7
# ./smartctl -a /dev/nvme1n1 | sed -n '1p;5,7p'
smartctl 7.1 2019-07-01 r4934 [x86_64-linux-4.18.0-24-generic] (local build)
Model Number:                       Samsung SSD 970 EVO 2TB
Serial Number:                      S464NB0M200161Y
Firmware Version:                   2B2QEXE7

# ./smartctl -a /dev/nvme0 | sed -n '1p;5,7p'
smartctl 7.1 2019-07-01 r4934 [x86_64-linux-4.18.0-24-generic] (local build)
Model Number:                       Samsung SSD 970 EVO 2TB
Serial Number:                      S464NB0M200161Y
Firmware Version:                   2B2QEXE7
# ./smartctl -a /dev/nvme1 | sed -n '1p;5,7p'
smartctl 7.1 2019-07-01 r4934 [x86_64-linux-4.18.0-24-generic] (local build)
Model Number:                       Samsung SSD 970 EVO 2TB
Serial Number:                      S464NB0M200088N
Firmware Version:                   2B2QEXE7

Change History (4)

comment:1 Changed 3 months ago by Christian Franke

Description: modified (diff)
Keywords: nvme linux added
Milestone: undecided

Did it work with an older release on same machine?

Smartctl does not explicitly select a device, it simply accesses the pass-through I/O-control behind the specified device node. Are the major/minor device numbers correctly set for these nodes?

Please provide output of the following commands:

ls -l /dev/nvme*
egrep ':|nvme' /proc/devices

./smartctl -d nvme,0x1 -a /dev/nvme0 | sed -n '1p;5,7p'

comment:2 Changed 3 months ago by bendreth

It looks the same on older versions. Here is the output:

# smartctl -? | head -n1
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.18.0-24-generic] (local build)
# smartctl -a /dev/nvme0n1 | sed -n '6p'
Serial Number:                      S464NB0M200088N
# smartctl -a /dev/nvme1n1 | sed -n '6p'
Serial Number:                      S464NB0M200161Y
# smartctl -a /dev/nvme0 | sed -n '6p'
Serial Number:                      S464NB0M200161Y
# smartctl -a /dev/nvme1 | sed -n '6p'
Serial Number:                      S464NB0M200088N

# ls -l /dev/nvme*
crw------- 1 root root 241, 0 Jun 25 02:29 /dev/nvme0
brw-rw---- 1 root disk 259, 1 Jun 25 02:29 /dev/nvme0n1
brw-rw---- 1 root disk 259, 4 Jun 25 02:29 /dev/nvme0n1p1
brw-rw---- 1 root disk 259, 5 Jun 25 02:29 /dev/nvme0n1p9
crw------- 1 root root 241, 1 Jun 25 02:29 /dev/nvme1
brw-rw---- 1 root disk 259, 0 Jun 25 02:29 /dev/nvme1n1
brw-rw---- 1 root disk 259, 2 Jun 25 02:29 /dev/nvme1n1p1
brw-rw---- 1 root disk 259, 3 Jun 25 02:29 /dev/nvme1n1p9
# egrep ':|nvme' /proc/devices
Character devices:
241 nvme
Block devices:
# ./smartctl -d nvme,0x1 -a /dev/nvme0 | sed -n '1p;5,7p'
smartctl 7.1 2019-07-01 r4934 [x86_64-linux-4.18.0-24-generic] (local build)
Model Number:                       Samsung SSD 970 EVO 2TB
Serial Number:                      S464NB0M200161Y
Firmware Version:                   2B2QEXE7

This is on Ubuntu 18.04.2.

comment:3 Changed 3 months ago by bendreth

259 is:

Block devices:
259 blkext

The kernel version is:

$ uname -a
Linux neptune 4.18.0-24-generic #25~18.04.1-Ubuntu SMP Thu Jun 20 11:13:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

According to https://people.canonical.com/~kernel/info/kernel-version-map.html this corresponds to a mainline kernel 4.18.20.

comment:4 Changed 3 months ago by bendreth

The entries are mixed up in /sys:

$ ls -l /sys/dev/block/259* /sys/dev/char/241* | cut -c39-
/sys/dev/block/259:0 -> ../../devices/pci0000:00/0000:00:1b.0/0000:01:00.0/nvme/nvme0/nvme1n1
/sys/dev/block/259:1 -> ../../devices/pci0000:00/0000:00:1b.4/0000:02:00.0/nvme/nvme1/nvme0n1
/sys/dev/block/259:2 -> ../../devices/pci0000:00/0000:00:1b.0/0000:01:00.0/nvme/nvme0/nvme1n1/nvme1n1p1
/sys/dev/block/259:3 -> ../../devices/pci0000:00/0000:00:1b.0/0000:01:00.0/nvme/nvme0/nvme1n1/nvme1n1p9
/sys/dev/block/259:4 -> ../../devices/pci0000:00/0000:00:1b.4/0000:02:00.0/nvme/nvme1/nvme0n1/nvme0n1p1
/sys/dev/block/259:5 -> ../../devices/pci0000:00/0000:00:1b.4/0000:02:00.0/nvme/nvme1/nvme0n1/nvme0n1p9
/sys/dev/char/241:0 -> ../../devices/pci0000:00/0000:00:1b.0/0000:01:00.0/nvme/nvme0
/sys/dev/char/241:1 -> ../../devices/pci0000:00/0000:00:1b.4/0000:02:00.0/nvme/nvme1

nvme0 points to 0000:01:00.0, but nvme1/nvme0n1 definitely looks wrong, and points to the other device.

Note: See TracTickets for help on using tickets.