Opened 3 years ago

Last modified 4 days ago

#1245 reopened defect

smartd continues to write identical disk data for a failed drive

Reported by: Ulrich Owned by:
Priority: minor Milestone: undecided
Component: smartd Version:
Keywords: scsi Cc:

Description

We had a SCSI drive failure in a HP Smart Array (cciss) during a long self test. The start of the test was logged, a temperature change was logged, too, but from then on each S.M.A.R.T request failed (because the disk died during self-test).
Amazingly smart continues to write identical smart values to the CSV file, and there's no indication that the drive is actually dead.
The version of smartmontools being used is 6.6 of SLES 12 SP4.

Attachments (2)

csv.gz (837 bytes) - added by Ulrich 3 years ago.
Compresses extract of the CSV file
smartd.log (160.2 KB) - added by Ulrich 8 days ago.
Output of "smartd -r ioctl,2 -q onecheck"

Download all attachments as: .zip

Change History (11)

comment:1 Changed 3 years ago by Christian Franke

Keywords: scsi added
Milestone: undecided

Please provide the related (around time of failure) syslog and CSV outputs of smartd. If the drive is still accessible, please provide also a smartd -r ioctl,2 -q onecheck output.

Changed 3 years ago by Ulrich

Attachment: csv.gz added

Compresses extract of the CSV file

comment:2 Changed 3 years ago by Ulrich

or the easier part,I'm afraid this is not the information you are after:

"smartd -r ioctl,2 -q onecheck" seems to output an endless number of zeros:

smartd 6.6 2017-11-05 r4594 [x86_64-linux-4.12.14-95.32-default] (SUSE RPM)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org 

Opened configuration file /etc/smartd.conf
Configuration file /etc/smartd.conf parsed.

===== [LUN DATA] DATA START (BASE-16) =====
000-015: 00 00 00 18 00 00 00 00 00 00 00 c0 00 00 00 01
016-031: 00 00 00 c0 00 00 01 01 00 00 00 c0 00 00 fa 01
032-047: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
048-063: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
064-079: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
080-095: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
096-111: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
112-127: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
128-143: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
144-159: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
160-175: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
176-191: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
192-207: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
208-223: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
224-239: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
240-255: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
256-271: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
272-287: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
288-303: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
304-319: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
320-335: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
336-351: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
352-367: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
368-383: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
384-399: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
400-415: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
416-431: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
432-447: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
448-463: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
464-479: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
480-495: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
496-511: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
512-527: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
528-543: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
544-559: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
560-575: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
...
7184-7199: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
7200-7215: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
7216-7231: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
7232-7247: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
7248-7263: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
7264-7279: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
7280-7295: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
7296-7311: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
7312-7327: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
7328-7343: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
7344-7359: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
...

Note on the CSV: Failure occurred on 2019-10-01 between 5 and 6 'o clock. After a reboot of the server on 2019-10-08 in the morning, the disk became "alive" again.

Syslog:

2019-10-01T05:48:25.750970+02:00 h06 smartd[3212]: Device: /dev/cciss/c0d0 [cciss_disk_01] [SCSI], self-test in progress
2019-10-01T05:48:25.751243+02:00 h06 smartd[3212]: Device: /dev/cciss/c0d0 [cciss_disk_01] [SCSI], Temperature changed +2 Celsius to 28 Celsius (Min/Max 22/29)
2019-10-01T06:18:25.793639+02:00 h06 smartd[3212]: Device: /dev/cciss/c0d0 [cciss_disk_01] [SCSI], failed to read SMART values
2019-10-01T06:18:25.793958+02:00 h06 smartd[3212]: Device: /dev/cciss/c0d0 [cciss_disk_01] [SCSI], failed to read Temperature
2019-10-01T06:48:26.020312+02:00 h06 smartd[3212]: Device: /dev/cciss/c0d0 [cciss_disk_01] [SCSI], failed to read Temperature

# after reboot

2019-10-08T09:57:26.433922+02:00 h06 smartd[2932]: Device: /dev/cciss/c0d0 [cciss_disk_01] [SCSI], initial Temperature is 24 Celsius (Min/Max 22/29)

comment:3 Changed 3 years ago by Christian Franke

Milestone: undecided
Resolution: worksforme
Status: newclosed

Root of the problem is unknown. Problem could not be reproduced.

comment:4 Changed 9 days ago by Ulrich

Problem happened again: One of two SSDs in Megaraid had failed after a successful short selftest.
Messages in syslog say "failed to read SMART Attribute Data" and "Read SMART Selftest Log Failed" about ever 30 minutes.
Still the attrlog*ata.csv is written with new data lines. From that data everything looks OK.

OS is SLES12 x86_64 SP5 using smartmointools-6.6-6.6.3.x86_64

SSD had failed before 2023-01-20T3:49:41, but last entry in corresponding attrlog file is 2023-01-26T12:49:41 at the time of writing this, and no value indicates that there might be any problem. Liewise the ata.state file looks sound, too.

comment:5 Changed 9 days ago by Ulrich

Resolution: worksforme
Status: closedreopened

comment:6 Changed 9 days ago by Ulrich

A side-note: "smartd -r ioctl,2 -q onecheck" complains in the output:
..., please try adding "-d megaraid,N'

But according to the manual page of smartd, "-d" turn on debugging and has no arguments.
Also when trying that, I get a syntax error.

comment:7 Changed 8 days ago by Christian Franke

Milestone: undecided

Device types like -d megaraid,N need to be specified in smartd.conf, see man smartd.conf.

comment:8 Changed 8 days ago by Ulrich

/etc/smartd.conf contains basically only two lines, like this:
DEFAULT -d removable -s (...some complex regex...)
DEVICESCAN

So I'm not sure how to apply "-d megaraid,N" as only two SSDs are attached to a "megaraid"; the other disks are SAN SCSI disks. I'm mentioning this, because I can imagine that "-d removable" might have to do with the effect I see.

Changed 8 days ago by Ulrich

Attachment: smartd.log added

Output of "smartd -r ioctl,2 -q onecheck"

comment:9 Changed 4 days ago by Christian Franke

smartd 6.6 2017-11-05 r4594 [x86_64-linux-4.12.14-122.144-default] (SUSE RPM)

This version is 5+ years old. If possible, please retry with a newer version.

'Device: /dev/bus/0 [megaraid_disk_00] [SAT], SSDSC2KG240G7R, S/N:PHYM8183018R240AGN, WWN:5-5cd2e4-14f3ad4b0, FW:SCV1DL5C, 240 GB'

-d megaraid,0 is automatically set and therefore not needed in smartd.conf.

Note: See TracTickets for help on using tickets.