Opened 3 years ago

Closed 3 years ago

Last modified 13 months ago

#1434 closed defect (invalid)

NMVe SMART/Health Critical Warning Bit 3 (0x4) not necessary an error

Reported by: ThomasH Owned by:
Priority: minor Milestone:
Component: all Version:
Keywords: nvme Cc:

Description (last modified by Christian Franke)

Hello,
we are using a NVME "SAMSUNG MZVLB512HAJQ-00000"
The Smartmontool is reporting the following error:

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- NVM subsystem reliability has been degraded

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x04

The Attribut "Critical Warning" has different bits, signaling different status.
According to the documentation the flag for bit 4 is:
"If set to ‘1’, then the volatile memory backup device has failed. This field is only valid if the controller has a volatile memory backup solution"

The referred documentation is at https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4-2019.06.10-Ratified.pdf (page 122).

Currently the bit 4 is reported as an error but this depends on the model (whether it has a capacitor for power loss protection).

Either this should be detected or maybe a parameter could be used to signal whether this is an error or not.

Attachments (2)

nvme.png (39.4 KB ) - added by ThomasH 3 years ago.
nvme-structure.txt (1.8 KB ) - added by ThomasH 3 years ago.

Download all attachments as: .zip

Change History (13)

by ThomasH, 3 years ago

Attachment: nvme.png added

comment:1 by Christian Franke, 3 years ago

Description: modified (diff)

comment:2 by Christian Franke, 3 years ago

Keywords: nvme added
Milestone: undecided
Summary: Smart-Attribute Critical Warning Bit 4 not necessary an errorNMVe SMART/Health Critical Warning Bit 4 not necessary an error

If a device has no volatile memory backup solution, then it should never set this bit. Everything else is IMO a firmware bug.

Does any bit in Identify Controller structure report whether bit 4 of Critical Warning is valid?

PS: Please don't send smartctl outputs as screen shots. Use plain-text attachments or wiki markup instead.

comment:3 by ThomasH, 3 years ago

Hello,
thanks for the quick reply.
The specification only tells, when this field is considered valid and how to interpret it in this case. If there is no "volatile memory backup solution", then there is no definition about the meaning of that bit. So I would not see it as a bug, if this bit is set in absence of the volatile memoery backup solution.

I checked the Identify Controller structure, as you recommended. I found two elements in the structure which might be related to it:

vwc       : 0x1
awupf     : 0

For completeness I have attached the full structure.

awupf is described in the specification as:
"This field indicates the size of the write
operation guaranteed to be written atomically to the NVM across all namespaces
with any supported namespace format during a power fail or error condition."

Maybe we should delete/ignore bit 3 of "Critical Warning" when awupf is zero (?)

Version 0, edited 3 years ago by ThomasH (next)

by ThomasH, 3 years ago

Attachment: nvme-structure.txt added

comment:4 by ThomasH, 3 years ago

Summary: NMVe SMART/Health Critical Warning Bit 4 not necessary an errorNMVe SMART/Health Critical Warning Bit 3 (0x4) not necessary an error

comment:5 by Alex Samorukov, 3 years ago

It does not seems to be a healthy drive to me.

  1. None of the other reports we saw had this bit to 1. Also typically if not in use fields are zeroed.
  2. There is no clear definition of what "Volatile Memory Backup System" is. However, intel datasheet has Bit 4: Volatile Memory Backup System has failed (e.g., enhanced power loss capacitor test failure). In this case it is just about capacitor which gives a chance to SSD to shutdown correctly in case of power loss.
  3. I found at least one report in the net for similar drive, where this bit is set to 0.

My suggestion is to try to RMA this drive or at least to contact vendor about that. Not sure if adding any logic to hide this warnings does make a sense, at least until we 100% confident in correctness of it.

comment:6 by Christian Franke, 3 years ago

I agree. I don't remember any similar report since we added NVMe support in 6.5 (2016, #657).

  • Setting an invalid error bit to 1 or (rand() & 1) is at least bad software engineering.
  • There might also be an unrelated error but the firmware reports it with the wrong bit.
  • NVMe 1.4 does not specify something like: Bit 4 shall be ignored if the controller does not indicate a volatile memory backup solution in the FOOBAR field of the Identify Controller data structure.

Leaving ticket open as undecided for now.

comment:7 by ThomasH, 3 years ago

Hello,
thanks for the input and thoughts about that topic.
We will open a ticket at Samsung and try to get further information.
We have 2 servers with Software-Raid-1 each and the second nvme shows this entry 0x04.
The first nvmes in the Raid show 0x00 each. Maybe this is random or related.

Last edited 3 years ago by ThomasH (previous) (diff)

comment:8 by Christian Franke, 3 years ago

Please report the result here (if possible).

comment:9 by ThomasH, 3 years ago

Hello,
Samsung didn't offer any support because it is an OEM product.

Thus we contacted the hosting provider.
Our provider seems to be aware of this problem and recommended to clean the slot
of the NVMe.
Today, the provider conducted the cleaning and the error went away from the first nvme.
So it seems to be an electrical issues, not a software issue.

Thanks for your input and your suggestions!

comment:10 by Christian Franke, 3 years ago

Milestone: undecided
Resolution: invalid
Status: newclosed

This is likely a hardware or firmware issue and not a smartmontools bug.

comment:11 by Christian Franke, 13 months ago

Related: ticket #1715.

Note: See TracTickets for help on using tickets.