Opened 3 months ago

Last modified 3 months ago

#1218 new defect

Self tests pass under Windows 10 1809 but get stuck at 90% under Linux

Reported by: Artem S. Tashkinov Owned by:
Priority: minor Milestone: undecided
Component: smartctl Version: 7.0
Keywords: ata Cc:

Description

I've got a strange issue with my SSD disk: all self tests pass under Windows 10 1809 but get stuck at 90% under Linux no matter how much I wait for their completion. I barely have any disk activity, so there's no reason for the tests to get stuck.

Under both OSes I use the same smartmontools version - 7.0 release.

In the attached log 90% "aborted by host" are Linux results. 00% completed

Attachments (1)

smartmontools-x.log (13.8 KB) - added by Artem S. Tashkinov 3 months ago.

Download all attachments as: .zip

Change History (8)

Changed 3 months ago by Artem S. Tashkinov

Attachment: smartmontools-x.log added

comment:1 Changed 3 months ago by Artem S. Tashkinov

... 00% "completed without error" are Windows results.

comment:2 Changed 3 months ago by Artem S. Tashkinov

This has been happening for quite some time under Linux kernels 4.18, 4.19, 5.0 and 5.1. Haven't tested Linux 5.2 because it hasn't been made available in Fedora 30 yet. I'm not sure any previous kernels worked at all.

This looks like a weird bug in the Linux kernel SATA layer but I'm an absolute newbie in this area, so it's just a wild guess.

comment:3 Changed 3 months ago by Artem S. Tashkinov

Also, I'm curious why I see so many errors.

comment:4 Changed 3 months ago by Christian Franke

Component: allsmartctl
Keywords: ata added
Milestone: undecided

... I barely have any disk activity, so there's no reason for the tests to get stuck.

The missing disk activity may actually be the reason for aborted self-tests. See the FAQ for explanation and possible workaround.

comment:5 Changed 3 months ago by Christian Franke

Also, I'm curious why I see so many errors.

This is unrelated to the above.

ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
...
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
  9 Power_On_Hours          -O--CK   092   092   001    -    4987
...
187 Uncorrectable_Error_Cnt -O--CK   100   100   000    -    5
...
196 Reallocated_Event_Count -O--CK   253   253   001    -    0
198 Offline_Uncorrectable   ----CK   100   100   001    -    0

No reallocated sectors, no sectors pending for reallocation, ...

SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
...
Error 5 [0] occurred at disk power-on lifetime: 3665 hours (152 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 40 00 00 00 00 00 14 40 00  Error: UNC at LBA = 0x00000014 = 20

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 01 00 40 00 00 00 00 00 14 40 08     00:02:31.000  READ FPDMA QUEUED
...
Error 4 [3] occurred at disk power-on lifetime: 3525 hours (146 days + 21 hours)
...
  40 -- 51 00 20 00 00 00 00 00 10 40 00  Error: UNC at LBA = 0x00000010 = 16
...
Error 3 [2] occurred at disk power-on lifetime: 3483 hours (145 days + 3 hours)
...
  40 -- 51 00 40 00 00 00 00 00 14 40 00  Error: UNC at LBA = 0x00000014 = 20
...
Error 2 [1] occurred at disk power-on lifetime: 3348 hours (139 days + 12 hours)
...
  40 -- 51 00 40 00 00 00 00 00 14 40 00  Error: UNC at LBA = 0x00000014 = 20
...

... very few transient read errors on LBA 16 and 20 occurred more than 1300 power on hours ago, ...

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      4987         -

... but a recent extended test succeeded. Conclusion: No sign of trouble. The possible weak sectors have been fixed by new write commands.

PS: This is a bug tracker, not a support forum. For future support questions, please use the smartmontools-support mailing list instead.

comment:6 Changed 3 months ago by Artem S. Tashkinov

The missing disk activity may actually be the reason for aborted self-tests. See the FAQ for explanation and possible workaround.

This is definitely not the case. It's a system disk, so there's some activity.

What I meant with "next to no activity" is that I'm not copying huge files between partitions or anything like that.

comment:7 Changed 3 months ago by Christian Franke

Check kernel log for any disk related messages which occur during self-test.

Check these counters before and after self-test:

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
...
0x0009  2            2  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            2  Device-to-host register FISes sent due to a COMRESET

If there is any increase, there must have been an event which resets the SATA link or the device. This typically aborts any self-test.

Note: See TracTickets for help on using tickets.