Opened 19 months ago

Closed 18 months ago

Last modified 9 months ago

#1722 closed enhancement (duplicate)

error_log on Seagate FireCuda 510 NVMe (Debian 11 and 12)

Reported by: kolAflash Owned by:
Priority: minor Milestone: Release 7.4
Component: smartd Version: 7.3
Keywords: nvme Cc:

Description

I'm getting error_log entries for my Seagate FireCuda 510 SSD ZP2000GM30001.
See the attachment.
OS: Debian 11 and 12 (Linux 5.10 and 6.1)

I guess there's not really a hardware defect. Because the system is running totally fine. (yes, I do daily backups)

Maybe it's some bad NVMe commands like in #1663 !?
If yes, how can I find out which commands and why they are being send?
Is opening a ticket on https://bugzilla.kernel.org/ a good idea to get those bad NVMe commands fixed?

P.S.
With Debian-12 this became more prominent, because now there's a graphical notification (smart-notifier).

Hard Disk Health Warning
The hard disk health status has changed. This could mean that hard drive failure is imminent. It is always a good idea to have up to date backups.
This message was generated by the smartd daemon running on:

   host name:  myhost
   DNS domain: mydomain

The following warning/error was logged by the smartd daemon:

Device: /dev/nvme0, number of Error Log entries increased from 343 to 345

Device info:
Seagate FireCuda 510 SSD ZP2000GM30001, S/N:XXXXXXXX, FW:STES1024, 2.00 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Fri Dec 25 13:31:06 2020 CET
Another message will be sent in 24 hours if the problem persists.

Attachments (2)

Change History (9)

by kolAflash, 19 months ago

comment:1 by kolAflash, 19 months ago

P.S.
The error log count increases by one every time I put the notebook into standby (S3).
So I guess this could really be some bad NVMe command on standby or wakeup!?

Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0        348     0  0x800e  0x4004  0x028            0     0     -
  1        347     0  0x1018  0x4004  0x028            0     0     -
  2        346     0  0x4000  0x4004  0x028            0     0     -

See attachment:smart_Seagate-FireCuda-510-SSD.txt​ for previous entries.

Notebook: HP EliteBook 735 G6 (Ryzen 3500U)

P.S. (2)
Some Linux-6.1.25 dmesg messages from bootup and when entering suspend (S3).

2023-05-09T10:25:55.800449+02:00 myhost kernel: [    2.933308] nvme nvme0: pci function 0000:04:00.0
2023-05-09T10:25:55.800454+02:00 myhost kernel: [    2.939022] nvme nvme0: missing or invalid SUBNQN field.
2023-05-09T10:25:55.800459+02:00 myhost kernel: [    2.939073] nvme nvme0: Shutdown timeout set to 10 seconds
2023-05-09T10:25:55.800464+02:00 myhost kernel: [    2.941185] nvme nvme0: 8/0/0 default/read/poll queues
2023-05-09T10:25:55.800468+02:00 myhost kernel: [    2.941736] nvme nvme0: ctrl returned bogus length: 16 for NVME_NIDT_EUI64
2023-05-09T10:25:55.800470+02:00 myhost kernel: [    2.943756]  nvme0n1: p1 p2 p3 p4
[...]
2023-05-09T10:37:48.685064+02:00 myhost kernel: [  735.372074] PM: suspend entry (deep)
[...]
2023-05-09T11:06:25.945097+02:00 myhost kernel: [  737.706880] nvme nvme0: Shutdown timeout set to 10 seconds
2023-05-09T11:06:25.945100+02:00 myhost kernel: [  737.708557] nvme nvme0: 8/0/0 default/read/poll queues
2023-05-09T11:06:25.945103+02:00 myhost kernel: [  737.708901] nvme nvme0: ctrl returned bogus length: 16 for NVME_NIDT_EUI64
[...]
2023-05-09T11:06:26.000696+02:00 myhost kernel: [  739.403068] PM: suspend exit
Last edited 18 months ago by kolAflash (previous) (diff)

comment:2 by kolAflash, 18 months ago

I've got another notebook (HP EliteBook 845 G8, Ryzen-5650U) with the same SSD model (Seagate FireCuda 510) showing the same behavior.
The error log count increases by one on every standby (S3).

P.S.
Found this in the EliteBook 845 G8 dmesg at boot.:

2023-05-09T10:39:48.899014+02:00 myhost kernel: [    0.915830][  T449] nvme 0000:03:00.0: platform quirk: setting simple suspend
2023-05-09T10:39:48.899014+02:00 myhost kernel: [    0.915931][  T449] nvme nvme0: pci function 0000:03:00.0
[...]
2023-05-09T10:39:48.899018+02:00 myhost kernel: [    0.919920][   T89] nvme nvme0: missing or invalid SUBNQN field.
2023-05-09T10:39:48.899019+02:00 myhost kernel: [    0.919939][   T89] nvme nvme0: Shutdown timeout set to 10 seconds
2023-05-09T10:39:48.899020+02:00 myhost kernel: [    0.921188][   T89] nvme nvme0: 8/0/0 default/read/poll queues
2023-05-09T10:39:48.899020+02:00 myhost kernel: [    0.922707][   T90]  nvme0n1: p1 p2 p3 p4

But there's no nvme error when entering suspend (standby with s2idle a.k.a. S0ix).

Version 1, edited 18 months ago by kolAflash (previous) (next) (diff)

in reply to:  description comment:3 by Christian Franke, 18 months ago

Maybe it's some bad NVMe commands like in #1663 !?

Yes.

If yes, how can I find out which commands and why they are being send?

Unlike the ATA error log, the NVMe error information log does not contain information about the command codes used.

Is opening a ticket on https://bugzilla.kernel.org/ a good idea to get those bad NVMe commands fixed?

Yes. The kernel should not normally not issue unsupported commands. It possibly does this to probe whether certain optional commands are supported.

comment:4 by Christian Franke, 18 months ago

Component: allsmartd
Keywords: nvme added
Resolution: duplicate
Status: newclosed
Type: taskenhancement

See ticket #1222.

comment:5 by kolAflash, 18 months ago

Linux Kernel Ticket

Created: https://bugzilla.kernel.org/show_bug.cgi?id=217445



nvme error-log

Uploaded the output of nvme error-log /dev/nvme0:
attachment:nvme-error-log_Seagate-FireCuda-510-SSD_HP-EliteBook-735-G6-Ryzen3500U-Debian-12.txt

Excerpt:
status_field : 0x2002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field)
And error_count is increasing with each error log entry.

So Christians suggestion in ticket:1222#comment:5 to look at this values probably won't help here.

INTERESTING:
Nearly each of the 63 log entries has a different cmdid.
Just these appear a few more times, but most cmdids seem random.
0x4, 0x8014, 0xc012, 0xd00e, 0xe



some notes

# show more log entries
smartctl --log=error,256 /dev/nvme0n1

# Print more details about log entries.
# See: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=900244#50
nvme error-log /dev/nvme0

Copied from #1222:
Debian Bug 900244
Ubuntu Bug 1878264



Workarounds / silence smart-notifier popup

Any good ideas how to silence log messages and graphical popups for the moment?
Especially without risking to miss important stuff.

smart-notifier: https://packages.debian.org/de/bookworm/smart-notifier (screenshot)

Workaround 1:
(silence graphical popups, Debian-12)
apt remove smart-notifier

Workaround 2:
/etc/smartd.conf (may be /etc/smartmontools/smartd.conf on some systems)
Add BEFORE DEVICESCAN line: /dev/nvme0 -d ignore
systemctl restart smartmontools.service
systemctl status smartmontools.service
Will report Unable to monitor any SMART enabled devices. all existing devices are ignored.

Workaround 3:
systemctl disable --now smartmontools.service

https://unix.stackexchange.com/questions/80894/how-to-get-smartd-to-ignore-an-hdd
https://askubuntu.com/questions/1051710/how-to-disable-smart-checks-for-removable-drives

Last edited 9 months ago by kolAflash (previous) (diff)

comment:6 by Christian Franke, 18 months ago

Milestone: Release 7.4

comment:7 by kolAflash, 18 months ago

@Christian
I am having a little discussion about how the kernel could support a solution for this.
Would you mind having a look?
https://bugzilla.kernel.org/show_bug.cgi?id=217445#c7

Last edited 18 months ago by kolAflash (previous) (diff)
Note: See TracTickets for help on using tickets.