Opened 6 years ago

Closed 6 years ago

#466 closed defect (invalid)

FreeBSD 9.3: extended self-test log lifetime timestamp does not match reality

Reported by: Jeremy Chadwick Owned by:
Priority: major Milestone:
Component: all Version: 6.3
Keywords: freebsd Cc:

Description

At some point in time (I'm not sure when), it looks like either smartmontools or FreeBSD (I'm not sure which) appears to have begun using the wrong time when doing self-tests.

More precisely: when a self-test is run (ex. smartctl -t short /dev/ada0), the completion time as shown in the SMART self-test log is extremely distant/far off compared to attribute 9 Power_On_Hours.

This problem is reproducible across multiple drive models (Samsung SSD 840 EVO, WD3003FZEX, WD1003FZEX).

I cannot reproduce this problem on Windows XP Pro SP3 (32-bit) using smartmontools 6.3.

I'll provide some hard data, including -r ataioctl,2 output, from multiple drives on a system, and show actual proof of this problem.

I keep going back and forth trying to figure out if this is a smartmontools bug or a FreeBSD CAM/kernel bug (and if it's the latter, it manifests itself in an extremely bizarre way).

Rolling back to smartmontools 6.2 does not fix the problem.

Attachments (7)

ada0.txt (24.0 KB) - added by Jeremy Chadwick 6 years ago.
ada0, Samsung SSD 840 EVO
ada1.txt (24.5 KB) - added by Jeremy Chadwick 6 years ago.
ada1, WDC WD3003FZEX
ada2.txt (24.5 KB) - added by Jeremy Chadwick 6 years ago.
ada2, WDC WD1003FZEX
ada0_test_1.txt (52.2 KB) - added by Jeremy Chadwick 6 years ago.
test phase 1: smartctl -r ataioctl,2 -x /dev/ada0
ada0_test_2.txt (6.4 KB) - added by Jeremy Chadwick 6 years ago.
test phase 2: smartctl -r ataioctl,2 -t short /dev/ada0
ada0_test_3.txt (52.3 KB) - added by Jeremy Chadwick 6 years ago.
test phase 3: smartctl -r ataioctl,2 -x /dev/ada0
ada0_test_3.2.txt (52.3 KB) - added by Jeremy Chadwick 6 years ago.
test phase 3: smartctl -r ataioctl,2 -x /dev/ada0

Download all attachments as: .zip

Change History (16)

Changed 6 years ago by Jeremy Chadwick

Attachment: ada0.txt added

ada0, Samsung SSD 840 EVO

Changed 6 years ago by Jeremy Chadwick

Attachment: ada1.txt added

ada1, WDC WD3003FZEX

Changed 6 years ago by Jeremy Chadwick

Attachment: ada2.txt added

ada2, WDC WD1003FZEX

comment:1 Changed 6 years ago by Jeremy Chadwick

ada0_test_{1,2,3}.txt are results from the following commands:

  • smartctl -r ataioctl,2 -x /dev/ada0
  • smartctl -r ataioctl,2 -t short /dev/ada0
  • smartctl -r ataioctl,2 -x /dev/ada0

Relevant points:

ada0_test_1.txt shows Power_On_Hours at 3278, and the last run test at 3021 hours:

ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  9 Power_On_Hours          -O--CK   099   099   000    -    3278

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      3021         -

ada0_test_2.txt shows command 0xd0 returning attributes, where attribute 9's data bytes decode into 0x000000000cce which is 3278:

REPORT-IOCTL: Device=/dev/ada0 Command=SMART READ ATTRIBUTE VALUES
 Input:   FR=0xd0, SC=0x01, LL=...., LM=0x4f, LH=0xc2, DEV=...., CMD=0xb0 IN
 [Duration: 0.004s]
 Output: ERR=0x00, SC=0x60, LL=0x00, LM=0x37, LH=0x86, DEV=...., STS=....
REPORT-IOCTL: Device=/dev/ada0 Command=SMART READ ATTRIBUTE VALUES returned 0

===== [SMART READ ATTRIBUTE VALUES] DATA START (BASE-16) =====
000-015: 01 00 05 33 00 64 64 00 00 00 00 00 00 00 09 32 |...3.dd........2|
016-031: 00 63 63 ce 0c 00 00 00 00 00 -- -- -- -- -- -- |.cc........2.cc.|

000 = SMART attribute revision = 01 00 = revision 1
002 = SMART attribute entry #0 = 05 33 00 64 64 00 00 00 00 00 00 00
014 = SMART attribute entry #1 = 09 32 00 63 63 ce 0c 00 00 00 00 00

      - Attribute ID number: 09    : Power_On_Hours
      - Attribute flags    : 32 00 : 0x0032
      - Attribute CURRENT  : 63    : adjusted 099
      - Attribute WORST    : 63    : adjusted 099
      - Attribute data     : ce 0c 00 00 00 00 : 0x000000000cce : 3278
      - reserved           : 00

ada0_test_3.txt shows the resulting extended self-test log entry reporting that the test completed at 3035 power-on hours, which doesn't make any sense: it should be 3278 (the test only took a few seconds):

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      3035         -

This issue is 100% reproducible.

Changed 6 years ago by Jeremy Chadwick

Attachment: ada0_test_1.txt added

test phase 1: smartctl -r ataioctl,2 -x /dev/ada0

Changed 6 years ago by Jeremy Chadwick

Attachment: ada0_test_2.txt added

test phase 2: smartctl -r ataioctl,2 -t short /dev/ada0

Changed 6 years ago by Jeremy Chadwick

Attachment: ada0_test_3.txt added

test phase 3: smartctl -r ataioctl,2 -x /dev/ada0

comment:2 Changed 6 years ago by Jeremy Chadwick

Summary: FreeBSD 9.3: self-test log lifetime timestamp does not match realityFreeBSD 9.3: extended self-test log lifetime timestamp does not match reality

comment:3 Changed 6 years ago by Jeremy Chadwick

Summary: FreeBSD 9.3: self-test log lifetime timestamp does not match realityFreeBSD 9.3: extended self-test log lifetime timestamp does not match reality

Changed 6 years ago by Jeremy Chadwick

Attachment: ada0_test_3.2.txt added

test phase 3: smartctl -r ataioctl,2 -x /dev/ada0

comment:4 Changed 6 years ago by Alex Samorukov

Thank you for the report. Could you please also check it with linux livecd to make sure that problem is not related to the drive fw?

comment:5 Changed 6 years ago by Jeremy Chadwick

Couple things:

  1. It's not related to drive firmware because the issue is repeatable on 3 different models of drives: Samsung SSD 840 EVO (SSD), WD3003FZEX (MHDD), and WD1003FZEX (MHDD).
  1. I can't reproduce the problem on Windows (against a Samsung SSD 840 EVO (SSD)) using smartmontools 6.3.
  1. I'm not aware of any bootable Linux live CD/USB stick that has smartmontools 6.3 on it. They all tend to have extremely old smartmontools (5.xx or sometimes 6.1).

I'm really split between thinking this is some weird smartmontools bug (affecting FreeBSD only?) or a weird CAM bug on FreeBSD. I haven't enabled CAM debugging yet (to actually examine what the ATA CDBs are that are getting sent and received between the SATA/AHCI controller and the drives), but I can do that if push comes to shove. I've been avoiding filing a FreeBSD PR (problem report) for the time being.

One thing I will point out is that with the release of smartmontools 6.3, in FreeBSD ports, a significant portion of os_freebsd.cpp (which was being patched) became part of 6.3 natively:

http://www.freshbsd.org/commit/freebsd-ports/r363690

Part of me wonders if somehow (not sure how) some of that got messed up / contains a bug.

comment:6 Changed 6 years ago by Alex Samorukov

Rolling back to smartmontools 6.2 does not fix the problem.

So i don't think that it is related to the 6.3 changes

comment:7 Changed 6 years ago by Alex Samorukov

BTW, list of the LiveCD with 6.3 could be found here:

https://www.smartmontools.org/wiki/LiveCDs

comment:8 Changed 6 years ago by Jeremy Chadwick

I've filed a FreeBSD PR for investigation: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=194522

I'll try out one of the LiveCDs soon, possibly later tonight.

comment:9 Changed 6 years ago by Jeremy Chadwick

Resolution: invalid
Status: newclosed

LiveCD in question did not show any anomalies. Attributes correct before (short) test and after test, i.e. self-test log timestamp matched Power_On_Hours. I took shots of the console with my mobile phone just in case.

Booted into FreeBSD and attempted to reproduce the problem -- lo and behold, it's gone (!!!).

Only thing different about this situation from previous attempts: the system had to be power-cycled for an SSD firmware update.

So my guess is that the AHCI controller, and/or its option ROM, must have some kind of weird/odd bug that I alleviated by power-cycling the system. Soft reboot (ACPI reboot) didn't change anything, only an actual power-cycle; did not test hard reset.

I should note the AHCI controller in question, just in case someone else encounters this someday:

ahci0@pci0:0:31:2:      class=0x010601 card=0xd38015d9 chip=0x29228086 rev=0x02 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA AHCI Controller'
    class      = mass storage
    subclass   = SATA
    bar   [10] = type I/O Port, range 32, base 0x1c80, size  8, enabled
    bar   [14] = type I/O Port, range 32, base 0x1c74, size  4, enabled
    bar   [18] = type I/O Port, range 32, base 0x1c78, size  8, enabled
    bar   [1c] = type I/O Port, range 32, base 0x1c70, size  4, enabled
    bar   [20] = type I/O Port, range 32, base 0x18e0, size 32, enabled
    bar   [24] = type Memory, range 32, base 0xde605000, size 2048, enabled
    cap 05[80] = MSI supports 16 messages enabled with 1 message
    cap 01[70] = powerspec 3  supports D0 D3  current D0
    cap 12[a8] = SATA Index-Data Pair
    cap 13[b0] = PCI Advanced Features: FLR TP
Note: See TracTickets for help on using tickets.