Opened 9 years ago

Closed 9 years ago

Last modified 9 years ago

#149 closed defect (fixed)

It's not possible to run self-test on SATA disks with megaraid sas controller (9280)

Reported by: art9 Owned by: Christian Franke
Priority: major Milestone: Release 5.41
Component: all Version: 5.40
Keywords: megaraid linux Cc:

Description

version 5.41, 3256

If I run self-test on SATA drive, smartctl says that testing has begun, but actually it hasn't. Do not take atention at "interrupted self-test" in logs, it was made a long time ago.

If I run self-test on SAS disk, self-test starts and finishes successfully.

The problem can be:

  1. Incorrect call from smartctl
  2. Driver bug
  3. Firmware bug

If you think that the problem lies out of smartmontools, may you create as much as possible simple code which runs self test only?
I could make request to LSI support team using it. LSI doesnt want to deal with third-party soft like smartmontools directly.

PS. Is it possible to check that these ioctls are correct?

open("/dev/megaraid_sas_ioctl_node", O_RDWR) = 4
ioctl(4, MTRRIOC_SET_ENTRY, 0x7fffe4265050) = 0
ioctl(4, MTRRIOC_SET_ENTRY, 0x7fffe4264e30) = 0
write(1, "/dev/sg0 [megaraid_disk_19] [SAT]: Device open changed type from 'megaraid' to 'sat'\n", 85/dev/sg0 [megaraid_disk_19] [SAT]: Device open changed type from 'megaraid' to 'sat') = 85
ioctl(4, MTRRIOC_SET_ENTRY, 0x7fffe4261c20) = 0
ioctl(4, MTRRIOC_SET_ENTRY, 0x7fffe4261c30) = 0
write(1, "=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===\n", 57=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===) = 57
write(1, "Sending command: \"Execute SMART Extended self-test routine immediately in off-line mode\".\n", 90Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".) = 90
write(1, "Drive command \"Execute SMART Extended self-test routine immediately in off-line mode\" successful.\n", 98Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.) = 98
write(1, "Testing has begun.\n", 19Testing has begun.) = 19
write(1, "Please wait 4 minutes for test to complete.\n", 44Please wait 4 minutes for test to complete.) = 44

=====================================================================================================

smartctl -t long -dmegaraid,19 /dev/sg0
smartctl 5.41 2011-02-08 r3256M [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

/dev/sg0 [megaraid_disk_19] [SAT]: Device open changed type from 'megaraid' to 'sat'

START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION

Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 4 minutes for test to complete.
Test will complete after Mon Feb 14 19:58:34 2011

Use smartctl -X to abort test.

smartctl -a -dmegaraid,19 /dev/sg0

[...]

Self-test execution status: ( 32) The self-test routine was interrupted

by the host with a hard or soft reset.

[...]

smartctl -l selftest -dmegaraid,19 /dev/sg0
smartctl 5.41 2011-02-08 r3256M [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

/dev/sg0 [megaraid_disk_19] [SAT]: Device open changed type from 'megaraid' to 'sat'

START OF READ SMART DATA SECTION

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]

=====================================================================================================

Attachments (1)

megaraid.patch (1.3 KB) - added by Christian Franke 9 years ago.
Proposed patch

Download all attachments as: .zip

Change History (11)

comment:1 Changed 9 years ago by Christian Franke

Keywords: megaraid linux added

See also ticket #87

comment:2 Changed 9 years ago by art9

Hmm.. "Rejecting SMART/ATA command to controller". Looks like smartctl does not treat that as error in non-debug mode. Bug?

If this FW/Driver issue may you help with correct question to LSI?

smartctl -r ioctl,2 /dev/sg0 -t long -dmegaraid,19

[...]

START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION

Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".

REPORT-IOCTL: Device=/dev/sg0 Command=SMART IMMEDIATE OFFLINE InputParameter?=2

Input: FR=0xd4, SC=...., LL=0x02, LM=0x4f, LH=0xc2, DEV=...., CMD=0xb0
[ata pass-through(16): 85 06 0c 00 d4 00 00 00 02 00 4f 00 c2 00 b0 00 ]

Rejecting SMART/ATA command to controller
REPORT-IOCTL: Device=/dev/sg0 Command=SMART IMMEDIATE OFFLINE returned 0
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 4 minutes for test to complete.
Test will complete after Tue Feb 15 00:43:59 2011

comment:3 in reply to:  2 Changed 9 years ago by Christian Franke

The megaraid code apparently ignores (and returns success for) all non-data SAT ATA PASSTHROUGH(16) commands. This is the case since the first commit (r2650). This is probably a bug because a comment suggests that only the SMART ENABLE command should be ignored.

Could you please test whether it works if the opcode check is disabled by this patch:

diff --git a/os_linux.cpp b/os_linux.cpp
--- a/os_linux.cpp
+++ b/os_linux.cpp
@@ -1049,6 +1049,7 @@ bool linux_megaraid_device::scsi_pass_through(scsi_cmnd_io *iop)
   /* Controller rejects Enable SMART and Test Unit Ready */
   if (iop->cmnd[0] == 0x00)
     return true;
+#if 0
   if (iop->cmnd[0] == 0x85 && iop->cmnd[1] == 0x06) {
     if(report > 0)
       pout("Rejecting SMART/ATA command to controller\n");
@@ -1064,6 +1065,7 @@ bool linux_megaraid_device::scsi_pass_through(scsi_cmnd_io *iop)
     }
     return true;
   }
+#endif

   if (pt_cmd == NULL)
     return false;

comment:4 Changed 9 years ago by art9

I love you :) Self-test started.

./smartctl -t long /dev/sg0 -d megaraid,12
smartctl 5.41 2011-02-08 r3256M [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

/dev/sg0 [megaraid_disk_12] [SAT]: Device open changed type from 'megaraid' to 'sat'

START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION

Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 255 minutes for test to complete.
Test will complete after Tue Feb 15 07:15:28 2011

In "-a" mode error appeared:

./smartctl -a /dev/sg0 -d megaraid,12
smartctl 5.41 2011-02-08 r3256M [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

/dev/sg0 [megaraid_disk_12] [SAT]: Device open changed type from 'megaraid' to 'sat'

START OF INFORMATION SECTION

Model Family: Hitachi Ultrastar A7K2000
Device Model: Hitachi HUA722020ALA330
Serial Number: JK1130YAGKDSWT
Firmware Version: JKAOA20N
User Capacity: 2,000,398,934,016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Tue Feb 15 03:01:28 2011 NOVT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

START OF READ SMART DATA SECTION

Error SMART Status command failed
Please get assistance from http://smartmontools.sourceforge.net/
Register values returned from SMART Status command are:

ERR=...., SC=...., LL=...., LM=...., LH=...., DEV=...., STS=....

SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:

ps. Is smartmontools or FW/driver responsible for this error?

./smartctl -l scterc /dev/sg0 -d megaraid,12
smartctl 5.41 2011-02-08 r3256M [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

/dev/sg0 [megaraid_disk_12] [SAT]: Device open changed type from 'megaraid' to 'sat'
Error SMART WRITE LOG does not return COUNT and LBA_LOW register
Warning: device does not support SCT (Get) Error Recovery Control command

Changed 9 years ago by Christian Franke

Attachment: megaraid.patch added

Proposed patch

comment:5 in reply to:  4 Changed 9 years ago by Christian Franke

Milestone: Release 5.41
Owner: changed from somebody to Christian Franke
Status: newaccepted

Please test attached patch. Please test also whether SMART ENABLE (smartctl -s on) works or not.

Error SMART Status command failed
Please get assistance from http://smartmontools.sourceforge.net/
Register values returned from SMART Status command are:

ERR=...., SC=...., LL=...., LM=...., LH=...., DEV=...., STS=....

SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

ps. Is smartmontools or FW/driver responsible for this error?

FW/driver do not properly return the ATA output registers in SCSI sense data. The patch should handle this.

comment:6 Changed 9 years ago by art9

Looks like patch is fine. Smart enable/disable looks fine too.

FW/driver do not properly return the ATA output registers in SCSI sense data. The patch should handle this.

scttemp is available... May you help to create correct request for LSI support for this issue? Also I could post to linux-ide/scsi mailing lists for help, but I can not describe the issue in details.

comment:7 in reply to:  6 Changed 9 years ago by Christian Franke

Looks like patch is fine. Smart enable/disable looks fine too.

Thanks for testing. I will commit the patch soon.

... May you help to create correct request for LSI support for this issue? Also I could post to linux-ide/scsi mailing lists for help, but I can not describe the issue in details.

AFAICS the issue can be described as follows:

The ATA PASS-THROUGH(16) implementation in the SAT layer of the megaraid driver or firmware does not return the ATA output registers if requested. This violates SAT standard.

Expected: If CK_COND (bit 5 of CDB[2]) is set, ATA PASS-THROUGH(16) (CDB[0] = 0x85) shall return a CHECK CONDITION even if the ATA command completed successfully, and return the ATA output registers in the sense data using ATA return descriptor format (descriptor code 0x09).

Observed: If CK_COND is set and the ATA command completed successfully, ATA PASS-THROUGH(16) does not return a CHECK CONDITION or the sense data does not contain an ATA return descriptor.

See also the smartmontools SAT implementation for further info.

comment:8 Changed 9 years ago by art9

From the description, it's a firmware issue ... the driver tends to
return exactly what the firmware tells it. There's definitely no
special processing for ATA_12 or ATA_16 commands, they're treated as
normal SCSI ones, so if there's no SAM_STAT_CHECK_CONDITION, the
firmware isn't sending one.

James

comment:9 Changed 9 years ago by Christian Franke

Resolution: fixed
Status: acceptedclosed

comment:10 Changed 9 years ago by Christian Franke

Please test current SVN if possible (rerun ./autogen.sh && ./configure).

Note: See TracTickets for help on using tickets.