Opened 17 months ago

Last modified 4 months ago

#1670 new defect

smartd: per device rules don't match for nvmes

Reported by: calestyo Owned by:
Priority: minor Milestone: undecided
Component: smartd Version:
Keywords: nvme linux Cc: cemysce, aetf

Description

Hey.

Not sure if this is a bug or I just misunderstand something:

Previously, with SATA HDDs, I had e.g. the following in smartd.conf to specifically set temperature ranges, etc.:

/dev/disk/by-id/ata-Samsung_SSD_850_PRO_1TB     -d auto -d removable -n standby,4 -a -W 0,60,70 -m root,calestyo,mail@example.org -M exec /usr/share/smartmontools/smartd-runner
DEVICESCAN                                                      -d auto -d removable -n standby,4 -a -W 0,45,50 -m root,calestyo,mail@example.org -M exec /usr/share/smartmontools/smartd-runner

with the last one being a catch-all rule.

For my new NVMe I've added:

/dev/disk/by-id/nvme-SAMSUNG_MZVL22T0HBLB-00B07_S63JNX0T475928  -d auto -d removable -n standby,4 -a -W 0,60,70 -m root,calestyo,mail@example.org -M exec /usr/share/smartmontools/smartd-runner

in the top.

Yet I still get notification like:

This message was generated by the smartd daemon running on:

   host name:  heisenberg
   DNS domain: scientia.org

The following warning/error was logged by the smartd daemon:

Device: /dev/nvme0, Temperature 55 Celsius reached critical limit of 50 Celsius (Min/Max 32/72)

Device info:
SAMSUNG MZVL22T0HBLB-00B07, S/N:S63JNX0T475928, FW:GXB7602Q, 2.04 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
Another message will be sent in 24 hours if the problem persists.

I guess the reason might be that:
/dev/disk/by-id/nvme-SAMSUNG_MZVL22T0HBLB-00B07_S63JNX0T475928 is a symlink to ../../nvme0n1
whereas smartd seems to report the temperature warning for /dev/nvme0.

First, that's a bit a surprise to me, since nvme0n1 seems to be the storage block device, e.g.:

# blockdev --getsize64 /dev/nvme0
blockdev: ioctl error on BLKGETSIZE64: Inappropriate ioctl for device
# blockdev --getsize64 /dev/nvme0n1
2048408248320

Second, udev doesn't create any stable name link for /dev/nvme0 in /dev/disk/by-*/ so I'd have to use /dev/nvme0 in the config, I guess(?), which is a bit ugly IMO.

Any ideas?

Thanks,
Chris.

Change History (10)

comment:1 by Christian Franke, 17 months ago

Component: allsmartd
Keywords: nvme linux added
Milestone: undecided

Device: /dev/nvme0, Temperature 55 Celsius reached critical limit of 50 Celsius (Min/Max 32/72)

/dev/nvme0 is possibly the result of DEVICESCAN because duplicate detection does not work. Possibly we need some enhancement here.

Please provide contents of smartd.conf and output of smartd -q onecheck. Remove device serial numbers, WWN, ... if desired.

First, that's a bit a surprise to me, since nvme0n1 seems to be the storage block device, e.g.: ...

NVMe SMART/Health and Error Information are not namespace specific, so /dev/nvme0 should be an appropriate device name. Even with /dev/nvme0n1, smartctl and smartd would use broadcast namespace for these logs. DEVICESCAN does not return the actual block devices to avoid that identical problems are reported for each namespace.

Second, udev doesn't create any stable name link for /dev/nvme0 in /dev/disk/by-*/ so I'd have to use /dev/nvme0 in the config, I guess(?), which is a bit ugly IMO.

Yes. I have no idea why udev is unable to create stable links.

comment:2 by calestyo, 17 months ago

smartd.conf is really just that (comments removed):

/dev/disk/by-id/nvme-SAMSUNG_MZVL22T0HBLB-00B07_S63JNX0T475928	-d auto -d removable -n standby,4 -a -W 0,60,70 -m root,calestyo,mail@example.org -M exec /usr/share/smartmontools/smartd-runner
/dev/disk/by-id/ata-Samsung_SSD_850_PRO_1TB_S252NXAG910017F	-d auto -d removable -n standby,4 -a -W 0,60,70 -m root,calestyo,mail@example.org -M exec /usr/share/smartmontools/smartd-runner
DEVICESCAN							-d auto -d removable -n standby,4 -a -W 0,45,50 -m root,calestyo,mail@example.org -M exec /usr/share/smartmontools/smartd-runner

and:

# smartd -q onecheck
smartd 7.3 2022-02-28 r5338 [x86_64-linux-6.0.0-4-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

Opened configuration file /etc/smartd.conf
Configuration file /etc/smartd.conf was parsed, found DEVICESCAN, scanning devices
Device: /dev/disk/by-id/nvme-SAMSUNG_MZVL22T0HBLB-00B07_S63JNX0T475928, unique name: /dev/nvme0n1
Device: /dev/disk/by-id/nvme-SAMSUNG_MZVL22T0HBLB-00B07_S63JNX0T475928, opened
Device: /dev/disk/by-id/nvme-SAMSUNG_MZVL22T0HBLB-00B07_S63JNX0T475928, SAMSUNG MZVL22T0HBLB-00B07, S/N:S63JNX0T475928, FW:GXB7602Q, NSID:1, 2.04 TB
Device: /dev/disk/by-id/nvme-SAMSUNG_MZVL22T0HBLB-00B07_S63JNX0T475928, is SMART capable. Adding to "monitor" list.
Device: /dev/disk/by-id/nvme-SAMSUNG_MZVL22T0HBLB-00B07_S63JNX0T475928, state read from /var/lib/smartmontools/smartd.SAMSUNG_MZVL22T0HBLB_00B07-S63JNX0T475928-n1.nvme.state
Device: /dev/disk/by-id/ata-Samsung_SSD_850_PRO_1TB_S252NXAG910017F, unable to autodetect device type
Device: /dev/disk/by-id/ata-Samsung_SSD_850_PRO_1TB_S252NXAG910017F, not available
Device: /dev/nvme0, opened
Device: /dev/nvme0, SAMSUNG MZVL22T0HBLB-00B07, S/N:S63JNX0T475928, FW:GXB7602Q, 2.04 TB
Device: /dev/nvme0, is SMART capable. Adding to "monitor" list.
Device: /dev/nvme0, state read from /var/lib/smartmontools/smartd.SAMSUNG_MZVL22T0HBLB_00B07-S63JNX0T475928.nvme.state
Monitoring 0 ATA/SATA, 0 SCSI/SAS and 2 NVMe devices
Device: /dev/disk/by-id/nvme-SAMSUNG_MZVL22T0HBLB-00B07_S63JNX0T475928, opened NVMe device
Device: /dev/disk/by-id/nvme-SAMSUNG_MZVL22T0HBLB-00B07_S63JNX0T475928, initial Temperature is 36 Celsius (Min/Max 32/73)
Device: /dev/nvme0, opened NVMe device
Device: /dev/nvme0, initial Temperature is 36 Celsius (Min/Max 32/72)
Device: /dev/disk/by-id/nvme-SAMSUNG_MZVL22T0HBLB-00B07_S63JNX0T475928, state written to /var/lib/smartmontools/smartd.SAMSUNG_MZVL22T0HBLB_00B07-S63JNX0T475928-n1.nvme.state
Device: /dev/nvme0, state written to /var/lib/smartmontools/smartd.SAMSUNG_MZVL22T0HBLB_00B07-S63JNX0T475928.nvme.state
Started with '-q onecheck' option. All devices successfully checked once.
smartd is exiting (exit status 0)

So /dev/nvme0 is basically the controller when more than one would be connected?

Yes. I have no idea why udev is unable to create stable links.

I guess because from the kernel PoV it's not a storage block device...?

in reply to:  2 comment:3 by Christian Franke, 17 months ago

# smartd -q onecheck
...
Device: /dev/disk/by-id/nvme-SAMSUNG_MZVL22T0HBLB-00B07_S63JNX0T475928, unique name: /dev/nvme0n1
Device: /dev/disk/by-id/nvme-SAMSUNG_MZVL22T0HBLB-00B07_S63JNX0T475928, opened
Device: /dev/disk/by-id/nvme-SAMSUNG_MZVL22T0HBLB-00B07_S63JNX0T475928, SAMSUNG MZVL22T0HBLB-00B07, S/N:S63JNX0T475928, FW:GXB7602Q, NSID:1, 2.04 TB
...
Device: /dev/nvme0, opened
Device: /dev/nvme0, SAMSUNG MZVL22T0HBLB-00B07, S/N:S63JNX0T475928, FW:GXB7602Q, 2.04 TB

DEVICESCAN returns /dev/nvme0 and duplicate detection does not work here. This needs some improvement for NVMe.
Please try DEVICESCAN -d by-id (hidden experimental feature).

So /dev/nvme0 is basically the controller when more than one would be connected?

Yes, but this makes not much difference for smartmontools because the NVMe pass-through I/O-control always addresses the physical device and SMART/Health info is always read from broadcast namespace.
The outputs of smartctl -x /dev/nvme0 and smartctl -x /dev/nvme0n1 should be similar.

PS: -d auto is redundant because it is the default. Common settings could be moved to a DEFAULT directive. This example should be equivalent to your smartd.conf:

DEFAULT -d removable -n standby,4 -a -W 0,60,70 -m root,calestyo,mail@example.org -M exec /usr/share/smartmontools/smartd-runner
/dev/disk/by-id/nvme-SAMSUNG_MZVL22T0HBLB-00B07_S63JNX0T475928
/dev/disk/by-id/ata-Samsung_SSD_850_PRO_1TB_S252NXAG910017F
DEVICESCAN

comment:4 by calestyo, 16 months ago

Hey.

So I guess you will sooner or later improve duplicate detection? :-)

What exactly does -d by-id do?

Cheers,
Chris.

PS: Thanks for the tips in your PS.

in reply to:  4 comment:5 by Christian Franke, 16 months ago

So I guess you will sooner or later improve duplicate detection? :-)

Yes. But I'm still not sure what the best way is. Keep /dev/nvme0 or /dev/nvme0n1?

If possible, please try whether the NVMe pass-through I/O-control behaves differently for both devices, for example:

smartctl -x /dev/nvme0 > nvme0.txt
smartctl -x /dev/nvme0n1 > nvme0n1.txt
diff -u nvme0.txt nvme0n1.txt

What exactly does -d by-id do?

It first scans /dev/disk/by-id/* before the remaining devices, resolves the symlinks and removes duplicates. But this currently ignores all links not leading to /dev/sdX.

You could possibly try this if the duplicate NVME device is always /dev/nvme0:

DEFAULT ...
/dev/disk/by-id/nvme-SAMSUNG_MZVL22T0HBLB-00B07_S63JNX0T475928
/dev/disk/by-id/ata-Samsung_SSD_850_PRO_1TB_S252NXAG910017F
/dev/nvme0 -d ignore
DEVICESCAN

comment:6 by calestyo, 16 months ago

Well I don't know that. I mean from the (end-)user-perspective it should probably be nvme0n1 because that's closest to sda, which people would have used previously.

But I don't really understand too much of NVMe, that I could tell what's best. I mean what if there'd be more SSDs attached to nvme0 ... or would that even work (n1, n2)? and if so, would they share the SMART data? Or would one see differences when specifically targetting n1 or n2?

diff <(smartctl -x /dev/nvme0) <(smartctl -x /dev/nvme0n1) gives no differences in my case.

/dev/nvme0 -d ignore

I also had that idea but feared I'd forget about it and it might completely remove the device, once you have the duplicate detection in place.

comment:7 by cemysce, 15 months ago

Cc: cemysce added

comment:8 by aetf, 14 months ago

Cc: aetf added

comment:9 by calestyo, 4 months ago

Anything new here, with respect to duplicate detection? :-) At least as of 7.4, I still get "both" devices detected :-(

comment:10 by Christian Franke, 4 months ago

Sorry no, otherwise this ticket would have been closed.

Note: See TracTickets for help on using tickets.