Opened 5 years ago

Closed 3 years ago

Last modified 3 years ago

#1227 closed defect (fixed)

Crucial/Micron client SSDs: Don't interpret attribute 197 as Current_Pending_Sector

Reported by: hse Owned by: Christian Franke
Priority: minor Milestone: Release 7.2
Component: drivedb Version: 7.0
Keywords: ssd Cc: sersorrel

Description

Hello.

On some servers we use CRUCIAL CT2000MX500SSD1 ssd's. We test them regularly every day. The tests finishes most of the time without problem but sometimes we receive following notification:

This message was generated by the smartd daemon running on:

   host name:  example
   DNS domain: example.com

The following warning/error was logged by the smartd daemon:

Device: /dev/sdx [SAT], 1 Currently unreadable (pending) sectors

Device info:
CT2000MX500SSD1, S/N:00000000000, WWN:000000000000000, FW:M3CR023, 2.00 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
No additional messages about this problem will be sent.

But when I check the SMART Self-test revisions there is no issue reported:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       199         -
# 2  Short offline       Completed without error       00%       191         -
# 3  Short offline       Completed without error       00%       182         -
# 4  Short offline       Completed without error       00%       174         -
# 5  Extended offline    Completed without error       00%       167         -
# 6  Short offline       Completed without error       00%       166         -
# 7  Short offline       Completed without error       00%       159         -
# 8  Short offline       Completed without error       00%       152         -
# 9  Short offline       Completed without error       00%       145         -
#10  Short offline       Completed without error       00%       139         -
#11  Short offline       Completed without error       00%       132         -
#12  Short offline       Completed without error       00%       125         -
#13  Extended offline    Completed without error       00%       119         -
#14  Short offline       Completed without error       00%       118         -
#15  Short offline       Completed without error       00%       111         -
#16  Short offline       Completed without error       00%       104         -
#17  Short offline       Completed without error       00%        96         -
#18  Short offline       Completed without error       00%        89         -
#19  Short offline       Completed without error       00%        82         -
#20  Short offline       Completed without error       00%        75         -
#21  Extended offline    Completed without error       00%        68         -

In the SMART Attributes there is no telling something is wrong with the device. No Pending sectors has been logged and the drive is completely new:

SMART Attributes Data Structure revision number: 16                           
Vendor Specific SMART Attributes with Thresholds:                             
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       203
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       1
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   100   100   000    Old_age   Always       -       4
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       0
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       83
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   059   046   000    Old_age   Always       -       41 (Min/Max 0/54)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   100   100   001    Old_age   Offline      -       0
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_Host_Sector_Write 0x0032   100   100   000    Old_age   Always       -       2966529424
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       56912802
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       149379986

smartctl -i /dev/sdx:

# smartctl -i /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Crucial/Micron BX/MX1/2/3/500, M5/600, 1100 SSDs
Device Model:     CT2000MX500SSD1
Serial Number:    000000000000
LU WWN Device Id: 0 000000 000000000
Firmware Version: M3CR023
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Aug  7 13:09:20 2019 -00
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

So I am creating this as bug with false positive defect.

Thank you and have a nice day.

Attachments (1)

smart_drivedb.h (1.2 KB ) - added by Christian Franke 4 years ago.
Local drive database entry for MX500 with firmware M3CR023

Download all attachments as: .zip

Change History (31)

comment:1 by Christian Franke, 5 years ago

Component: smartctlsmartd
Milestone: undecided

Please provide configuration line for this device from smartd.conf file.

Please check syslog for smartd ... Currently unreadable (pending) sectors messages.

comment:2 by Sjonny, 4 years ago

Chiming in on this ticket. I have 2 seemingly bogus reports about a CurrentPendingSector error. I do not run a smart test regularly. I only ran a short one after I got the first report. I have no clue what it all means but it seems a bit contradictionary (output at the end of this).

System: Debian 10
Package: smartmontools 6.6-1

1st email was on Fri, 4 Oct 2019 23:32:47 +0200
2nd email was on Mon, 7 Oct 2019 08:02:47 +0200

System startup messages concerning harddisk:

Oct  3 19:02:47 kirika kernel: [    2.567604] sd 0:0:0:0: [sda] 976773168 512-byte logical blocks: (500 GB/466 GiB)
Oct  3 19:02:47 kirika kernel: [    2.567606] sd 0:0:0:0: [sda] 4096-byte physical blocks
Oct  3 19:02:47 kirika kernel: [    2.567613] sd 0:0:0:0: [sda] Write Protect is off
Oct  3 19:02:47 kirika kernel: [    2.567614] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
Oct  3 19:02:47 kirika kernel: [    2.567624] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Oct  3 19:02:47 kirika kernel: [    2.568219]  sda: sda1
Oct  3 19:02:47 kirika kernel: [    2.568941] sd 0:0:0:0: [sda] supports TCG Opal
Oct  3 19:02:47 kirika kernel: [    2.568944] sd 0:0:0:0: [sda] Attached SCSI disk
Oct  3 19:02:47 kirika kernel: [    3.734781] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
Oct  3 19:02:47 kirika kernel: [    4.070924] EXT4-fs (sda1): re-mounted. Opts: errors=remount-ro
Oct  3 19:02:47 kirika smartd[465]: Device: /dev/sda [SAT], CT500MX500SSD1, S/N:1927E2119CC7, WWN:5-00a075-1e2119cc7, FW:M3CR023, 500 GB
Oct  3 19:02:47 kirika smartd[465]: Device: /dev/sda [SAT], is SMART capable. Adding to "monitor" list.
Oct  3 19:02:47 kirika smartd[465]: Device: /dev/sda [SAT], not found in smartd database.
Oct  3 19:02:47 kirika smartd[465]: Device: /dev/sda [SAT], opened
Oct  3 19:02:47 kirika smartd[465]: Device: /dev/sda [SAT], state read from /var/lib/smartmontools/smartd.CT500MX500SSD1-1927E2119CC7.ata.state
Oct  3 19:02:47 kirika smartd[465]: Device: /dev/sda [SAT], state written to /var/lib/smartmontools/smartd.CT500MX500SSD1-1927E2119CC7.ata.state
Oct  3 19:02:47 kirika smartd[465]: Device: /dev/sda, type changed from 'scsi' to 'sat'

Syslogs without temperature lines (temp changes between 65 and 67 values)

for 1st email:

Oct  4 23:32:47 kirika smartd[465]: Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors
Oct  5 00:02:47 kirika smartd[465]: Device: /dev/sda [SAT], No more Currently unreadable (pending) sectors, warning condition reset after 1 email

for 2nd email:

Oct  7 08:02:47 kirika smartd[465]: Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors
Oct  7 08:32:47 kirika smartd[465]: Device: /dev/sda [SAT], No more Currently unreadable (pending) sectors, warning condition reset after 1 email

I upgraded /var/lib/smartmontools/drivedb/drivedb.h in between the reports hoping it would fix something. I see more names in smartctl since but the second email still arrived.

$ ls -l /var/lib/smartmontools/drivedb
total 380
-rw-r--r-- 1 root root 207995 okt  5 09:41 drivedb.h
-rw-r--r-- 1 root root 179580 okt 15  2018 drivedb.h.old

Here, the .old file is the one shipped by Debian. I did not restart the smartd daemon though, it's only just now that I did that. Maybe that would help?

The /etc/smartd.conf is the original debian one, one line when comments are stipped:

DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner

Current output of smartctl -a /dev/sda:

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.2.0-0.bpo.2-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Crucial/Micron BX/MX1/2/3/500, M5/600, 1100 SSDs
Device Model:     CT500MX500SSD1
Serial Number:    1927E2119CC7
LU WWN Device Id: 5 00a075 1e2119cc7
Firmware Version: M3CR023
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Oct  7 23:10:39 2019 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  30) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x0031) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       206
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       9
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   100   100   000    Old_age   Always       -       3
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       0
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       42
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   067   049   000    Old_age   Always       -       33 (Min/Max 0/51)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   100   100   001    Old_age   Offline      -       0
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       501960200
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       9157853
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       25021935

SMART Error Log Version: 1
Warning: ATA error count 0 inconsistent with error log pointer 1

ATA Error Count: 0
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 0 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was in an unknown state.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 ec 00 00 00 00 00

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ec 00 00 00 00 00 00 00      00:00:00.000  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00      00:00:00.000  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00      00:00:00.000  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00      00:00:00.000  IDENTIFY DEVICE
  c8 00 00 00 00 00 00 00      00:00:00.000  READ DMA

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       160         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The drive is barely two weeks old yet the short offline error report talks about 0 days/hours. Also the "device state unknown" seems that there is a compatibility issue.

comment:3 by Sorin Sbarnea, 4 years ago

Can we do something about the CurrentPendingSector errors? I used to get one per week and now is 1-2/day. It was never more than one sector.

Clearly Crucial is not going to ever fix that issue with a new firmware but at least can we silence this in particular? How?

Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors

comment:4 by Christian Franke, 4 years ago

Add -C 0 to smartd.conf line. See smartd.conf man page for details.

comment:5 by offsides, 4 years ago

This appears to be an issue with the Crucial MX500 firmware, I just got one and it's generating 5-10 of these errors daily, but there's never a corresponding increase to Reallocated_Event_Count nor can I find the "bad" sector in question as it goes away without any other errors. The one thing I can think of as a fix other than totally disabling the current pending sectors check is to allow it to complain if the value is greater than 1, rather than just non-zero. Obviously this should be configurable, as for drives without this bug a value of 1 may indicate a real problem, but a configurable threshold would solve this problem and make it easier to handle future buggy drives as well...

by Christian Franke, 4 years ago

Attachment: smart_drivedb.h added

Local drive database entry for MX500 with firmware M3CR023

comment:6 by Christian Franke, 4 years ago

Component: smartddrivedb
Keywords: ssd added
Summary: SMART test generates possible false positivesCrucial MX500 firmware M3CR023 returns bogus attribute 197
Type: defectenhancement

The attached local drive database entry should suppress Currently unreadable (pending) sectors reports from smartd for this specific drive and firmware only.

Copy the file to the default location of the local(!) drive database or append it to the existing file. See -B option on smartctl man page or smartctl -h output for the configured default location (usually /etc/smart_drivedb.h).

Please report the test result.

comment:7 by Sjonny, 4 years ago

startup shows the warning

Dec  9 18:51:18 kirika smartd[29114]: smartd 6.6 2017-11-05 r4594 [x86_64-linux-5.2.0-0.bpo.2-amd64] (local build)
Dec  9 18:51:18 kirika smartd[29114]: Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
Dec  9 18:51:18 kirika smartd[29114]: Opened configuration file /etc/smartd.conf
Dec  9 18:51:18 kirika smartd[29114]: Drive: DEVICESCAN, implied '-a' Directive on line 21 of file /etc/smartd.conf
Dec  9 18:51:18 kirika smartd[29114]: Configuration file /etc/smartd.conf was parsed, found DEVICESCAN, scanning devices
Dec  9 18:51:18 kirika smartd[29114]: Device: /dev/sda, type changed from 'scsi' to 'sat'
Dec  9 18:51:18 kirika smartd[29114]: Device: /dev/sda [SAT], opened
Dec  9 18:51:18 kirika smartd[29114]: Device: /dev/sda [SAT], CT500MX500SSD1, S/N:1927E2119CC7, WWN:5-00a075-1e2119cc7, FW:M3CR023, 500 GB
Dec  9 18:51:18 kirika smartd[29114]: Device: /dev/sda [SAT], found in smartd database: Crucial/Micron MX500 SSDs
Dec  9 18:51:18 kirika smartd[29114]: Device: /dev/sda [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Dec  9 18:51:18 kirika smartd[29114]: Device: /dev/sda [SAT], is SMART capable. Adding to "monitor" list.
Dec  9 18:51:18 kirika smartd[29114]: Device: /dev/sda [SAT], state read from /var/lib/smartmontools/smartd.CT500MX500SSD1-1927E2119CC7.ata.state

smartctl -a /dev/sda shows the bogus line.

the comment in the file mentions ticket #1237, but that should be ticket #1227

comment:8 by Christian Franke, 4 years ago

the comment in the file mentions ticket #1237, but that should be ticket #1227

Thanks for catching.

Please check whether the new entry actually suppresses the Currently unreadable (pending) sectors messages and emails.

If -R 197 (or -R 197!, see man page) is temporarily added to smartd.conf, changes of this attribute are logged, for example ... Attribute: 197 ... changed from 100 [Raw 0] to 100 [Raw 1].

comment:9 by Telsin, 4 years ago

This seems to be happening in earlier firmwares too, I can confirm it for "M3CR020" and "M3CR022", which your current update doesn't address. I'm testing with:

"M3CR02[0-3]", Firmware with bogus attribute 197 (see ticket #1237)

which detects my drives with other firmwares and will report on proper message suppression in a few days.

comment:10 by Telsin, 4 years ago

Can confirm it suppresses the messages, saw these in logs (with -R 197) and was not notified of it via email:

Dec 27 11:31:51 spire smartd[13210]: Device: /dev/sdi [SAT], SMART Usage Attribute: 197 Bogus_Current_Pend_Sect changed from 100 [Raw 0] to 100 [Raw 1]
Dec 27 12:01:52 spire smartd[13210]: Device: /dev/sdi [SAT], SMART Usage Attribute: 197 Bogus_Current_Pend_Sect changed from 100 [Raw 1] to 100 [Raw 0]

sdi is a MX500.

comment:11 by Christian Franke, 4 years ago

Milestone: undecidedRelease 7.1
Owner: set to Christian Franke
Status: newaccepted

Thanks for testing. Then we could add the (enhanced) entry to drive database.

comment:12 by Christian Franke, 4 years ago

Resolution: fixed
Status: acceptedclosed

comment:13 by Christian Franke, 4 years ago

Related: ticket #1294.

comment:14 by Thomas Guyot, 3 years ago

FYI, the attribute returned by this SSD is not bogus, but shouldn't be used for monitoring (unless maybe if steadily in this state without writes maybe...).

I have obtained the controller's technical documentation from Crucial directly - according to them it applies to all their consumer drives, even my BX300 that I was specifically inquiring about too even if it's not listed. You should be able to find it online too, search for tnfd22_client_ssd_smart_attributes.pdf

According to this document, 197 "Current pending ECC count" is:

"This value represents the total number of ECC events found as a result of host com-
mands (for example, READ commands) or during background operations."

An older version of the M500 firmware has its own doc with a slightly different description for that attribute, yet that could still fall within the newer firmware's description:

"This value gives the number of blocks waiting to be remapped"

This really is a number of current, in-flight events on the disk. I would expect uncorrectable errors to be counted in "187 Reported Uncorrectable Errors" and since it's not likely to be recoverable upon later reads like for magnetic media I wold assume these to be remapped almost immediately. In fact testing has showed I will occasionally get a 1 during heavy writes but it never last.

This is clearly different to the traditional spinning disk's meaning - the number of unreadable sectors that haven't yet been remapped (which can be remapped by writing over full sectors, i.e. 4k - smaller writes requires sectors to be readable as disks have been using 4k internally long before AF). The disk will let you retry reads as many times as you want, then it will remap or fix these upon successful read or full write, but it will not fix then otherwise.

comment:15 by Christian Franke, 3 years ago

Milestone: Release 7.1Release 7.2
Resolution: fixed
Status: closedreopened
Type: enhancementdefect

Reopen because new info is available, see above.

comment:16 by Christian Franke, 3 years ago

Thanks for the info. I found revision E of tnfd22_client_ssd_smart_attributes.pdf. The same search also found revision D in ticket #812 :-). The documentation of attributes 197/198 has been changed (fixed?):

Revision C (2014-12-19: M500 (FW>=MU03), M510, M550, MX100, M600, MX200):
197: Current Pending Sector Count - This value gives the number of blocks waiting to be remapped.
198: SMART offline scan uncorrectable sector count - This value is the cumulative number of unrecoverable read errors found in a background media scan. If no background media scan has been run, a value of 0 will be returned.

Revisions D (2016-09-23: +1100, +MX300) and E (2018-09-28: +1300):
197: Current pending ECC count - This value represents the total number of ECC events found as a result of host commands (for example, READ commands) or during background operations.
198: SMART offline scan uncorrectable error count - This value is the cumulative number of unrecoverable read errors (UECC) found in the most recent media scan triggered by a SMART EXECUTE OFF-LINE IMMEDIATE command. At the beginning of each media scan, this value shall reset to zero. If no media scan has been previously run, this field will be zero.

Conclusion: Attribute 197 is not bogus but has a different meaning and should not be monitored as Current_Pending_Sector. Attribute 198 could be left as Offline_Uncorrectable.

comment:17 by Christian Franke, 3 years ago

Summary: Crucial MX500 firmware M3CR023 returns bogus attribute 197Crucial/Micron client SSDs: Don't interpret attribute 197 as Current_Pending_Sector

Change summary accordingly.

comment:18 by Christian Franke, 3 years ago

Ticket #1294 has been marked as a duplicate of this ticket.

comment:19 by sersorrel, 3 years ago

Cc: sersorrel added

comment:20 by Christian Franke, 3 years ago

Tickets #1311 and #1336 have been marked as a duplicate of this ticket.

comment:21 by Thomas Guyot, 3 years ago

I don't know whenever it' been fixed in the doc or not, but I seen to recall it missing on some of my crucial drives, or changing the definition to a different one caused it to start throwing alerts (there seem to be multiple definitions for drives using the same chipset/SMART specs - at least in the version of drivedb I have - some cleanup would be needed...)

I'll have to look what causes this error, I get it on my system - but I just realized for some reason I don't recall I commented out my own drive's entry, so it used the other definition that should have the same smart specs regardless (the regex match bot both, I had commented the one that comes first)

This message was generated by the smartd daemon running on:

   host name:  debian
   DNS domain: local

The following warning/error was logged by the smartd daemon:

Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors

Device info:
CT250MX500SSD1, S/N:XXXXXXXXXXXX, WWN:5-00a075-XXXXXXXXX, FW:M3CR023, 250 GB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
Another message will be sent in 24 hours if the problem persists.

If this message comes from smartmontool (and not some debian log monitoring add-on), besides correcting drivedb.h we should make sure it doesnn't alert if that attributes is the ECC one or the drive isn't in drivedb.h (it is already the case?)

Here's the diff between the two definitions (same drive, I just commended one regex to match the other definition instead):

 diff -u1 1 2
--- 1	2020-10-03 19:27:52.922133699 -0400
+++ 2	2020-10-03 19:26:30.927444584 -0400
@@ -4,3 +4,3 @@
 === START OF INFORMATION SECTION ===
-Model Family:     Crucial/Micron BX/MX1/2/3/500, M5/600, 1100 SSDs
+Model Family:     Crucial/Micron MX500 SSDs
 Device Model:     CT250MX500SSD1
@@ -17,2 +17,5 @@
 Local Time is:    Sat Oct  3 19:26:30 2020 EDT
+
+==> WARNING: This firmware returns bogus raw values in attribute 197
+
 SMART support is: Available - device has SMART capability.
@@ -73,3 +76,3 @@
 196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
-197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
+197 Current_Pend_ECC_Ct     0x0032   100   100   000    Old_age   Always       -       0
 198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0

Oddly the warning is on the definition that also has the proper name!

comment:22 by Christian Franke, 3 years ago

This should be only logged if the drive database entry does NOT override the name of attribute 197:

Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors

This should be only printed if the drive database entry contains this warning string:

==> WARNING: This firmware returns bogus raw values in attribute 197

comment:23 by Thomas Guyot, 3 years ago

To be clear, by drive can really match two drive definitions, both describing attributes for the same chipset / based on the same technical document - if I "comment out" the first regex the 2nd one match, but I also had an older drivedb.h...

So the first definition (Model Family: Crucial/Micron MX500 SSDs):

  • had Current_Pend_ECC_Ct, now renamed to Bogus_Current_Pend_Sect (I think this is wrong)
  • shows the warning text on attr 197

The 2nd definition (Model Family: Crucial/Micron BX/MX1/2/3/500, M5/600, 11/1300 SSDs):

  • has Current_Pending_Sector
  • doesn't show the warnign about attr 197

What I need to test now:

I realize I may have not restarted smartctl after changing the drivedb.h, so if it's loaded once my test didn't work. I will check if the alert stops now that the name is Bogus_Current_Pend_Sect, and will try again with the older name which I believe should be used.

Also unrelated to 197, attr 202 name is also wrong for both, it's used, not remaining lifetime.

Once we get it right both entries should be merged - to make it clean it will require a in-depth analysis of the regexes for both and merging them - I'm fluent with regex so I could submit a PR with clear comments of what is being changed/merged and how for review.

comment:24 by Christian Franke, 3 years ago

The MX500 entry intentionally matches a proper subset of the ...BX/MX1/2/3/500... entry. I will simply remove the MX500 entry (this will remove the WARNING: This firmware...) and change attribute 197 in the remaining entry accordingly (this will suppress the smartd Currently unreadable ... logs).

Thanks for the hint about attribute 202.

comment:25 by Thomas Guyot, 3 years ago

You're right - I overlooked it, only one model match in the MX500 entry, and the 2nd is firmware match? so only specific firmwares on top of that.

I'm curious about the origin of that warning - It appears these drives (mine included) are plagued with a pretty insane write amplification "issue" (Crucial wouldn't qualify it that way...), maybe it's better in later firmwares, but one thing for sure when the drive writes 10 time as many cells as written from the host then it's 10 times more likely to be caught with a pending ECC operation when smartd looks it up (I see one every week or so, on average, also varies based on write load on my desktop). My SSD will last just about 5 years based on current estimations which is I believe the warranty period of that drive, that is if write amplification doesn't get any worse... so I might be on the higher-end of desktop write loads too.

I have a bx300 I could monitor too if smartd runs on Windows - I snuck it in a pseudo-deskop-TV-appliance I pretty much use only as a TV... and Windows can run sshd nowadays :)

comment:26 by Christian Franke, 3 years ago

Status: reopenedaccepted

comment:27 by Christian Franke, 3 years ago

Resolution: fixed
Status: acceptedclosed

comment:28 by sadifika, 3 years ago

Dear developers,

I see these errors are deemed to be bogus and will be ignored, they may actually be a valid warning rather than a bogus one.

I'm quoting Lucretia19 from tomshardware forum below. He was having issues with getting confirmation email and kindly asked me to post this:

"By logging SMART data at a high rate (every second) using smartctl.exe, I established that the Bogus_Current_Pending_Sectors bug correlates perfectly with the Crucial MX500's excessive write amplification bug. Specifically, Current_Pending_Sectors changes to 1 when the ssd's FTL controller begins writing a multiple of about 37000 NAND pages (37000 NAND pages is approximately 1 GByte) and changes back to 0 when the FTL write burst ends. Although the correlation is perfect, it's unknown which is more closely related to the cause and which is more closely related to the effect. (Crucial presumably knows.) Fortunately, the excessive write amplification can be largely tamed by running ssd selftests nearly nonstop. (I insert a 30 seconds pause between 19.5 minutes selftests as a precaution, just in case the ssd's health depends on occasional FTL write bursts.) My logs show that the FTL write bursts occur only during the pauses between selftests, presumably because an FTL write burst is a lower priority process than a selftest. Selftests appear not to slow the ssd performance, presumably because a selftest is a lower priority process than host reads and writes. The only known downside is that the ssd appears to consume about 1 watt extra while running a selftest. The selftests raise the ssd temperature by a few degrees Celsius and keep the ssd temperature more stable."

His thread about the write amplification issue is at https://forums.tomshardware.com/threads/crucial-mx500-500gb-sata-ssd-remaining-life-decreasing-fast-despite-few-bytes-being-written.3571220/

A lot of other users on other forums are also facing the WAF issue so Current_Pending_Sector may be a useful warning about it.

comment:29 by Christian Franke, 3 years ago

Thanks for the info. All we could possibly do is to re-add the MX500 specific entry with an updated (which?) warning and attribute 197 unchanged such that the smartd Current_Pending_Sector warning is no longer suppressed. Please create a new ticket and suggest details. Don't reopen this ticket.

comment:30 by Thomas Guyot, 3 years ago

I have one of such "high WAF" drives - the warnings currently in place are sufficient - drive pct used is a clear indication whenever you're going to hit a wall in two years or if your drive will last for many years to come. Mine - if waf doesn't get worse - is going to last for about 5 years total with almost one year in - no particularly intensive write load. It's disappointing but I can get a better drive by then.

I've written a lot more on my previous Crucial M4 128G, running write-heavy stuff for well over a year, and yet it seems that under the same load as my MX500 500GB it could over-last it! It went to 35 pct used from Dec 2013 to Dec 2019, with my oldest smartd logs showing it was already at 27pct in July 2015 due to its early life hammering. By comparison the MX500 is at 14pct since Dec 2019!

Calculating the WAF from attr 247-248 is another way to see something is odd (yet afaik Crucial consider it a normal behavior...) Mine is at 7.94 agv since the beginning, going up and done a lot but maybe more up that down...

If you wish to give people life expectancy pre-warnings, i.e. "at the current usage rate your ssd will stop working in X years/days.." that might actually help the users determine if something is wrong, but sending random warning about an attribute that goes to 1 as part of the normal ssd function (albeit quite more often when said ssh write GB's in batch in the background) I don't think that's very helpful.

Version 0, edited 3 years ago by Thomas Guyot (next)
Note: See TracTickets for help on using tickets.