Opened 9 years ago

Closed 8 years ago

#214 closed defect (fixed)

Freeze with Intel 320: exception Emask 0x2 SAct 0x0 SErr 0x3000400 action 0x6 frozen

Reported by: andrew-stewart Owned by: Christian Franke
Priority: major Milestone: Release 6.0
Component: all Version: 5.42
Keywords: firmwarebug Cc:

Description

This issue is similar to Ticket #137, but we are already running the latest Firmware Version: 4PC10362 for Intel 320 Series. And we just upgrade to smartctl 5.42, I cannot reproduce the issue with smartctl 5.40. This seems likely to be a FW issue, but I need the details of what change in smartctl is causing the issue now, so I can open a ticket with intel.

Also UDMA_CRC_Error_Count counts sometimes during the test.

This is the kernel error:

ata5.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
ata5.00: cmd 61/40:00:2d:c1:03/00:00:00:00:00/40 tag 0 ncq 32768 out

res 40/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)

ata5.00: status: { DRDY }
ata5: hard resetting link
ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata5.00: configured for UDMA/133
ata5: EH complete
SCSI device sdb: 156301488 512-byte hdwr sectors (80026 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: drive cache: write back
ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata5.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0

res 40/00:04:de:e3:50/00:00:09:00:00/40 Emask 0x4 (timeout)

ata5.00: status: { DRDY }
ata5: hard resetting link
ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata5.00: configured for UDMA/133
sd 4:0:0:0: timing out command, waited 60s
ata5: EH complete
sd 4:0:0:0: timing out command, waited 60s
SCSI device sdb: 156301488 512-byte hdwr sectors (80026 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: drive cache: write back
ata5.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
ata5.00: cmd 61/30:00:0e:e4:af/00:00:01:00:00/40 tag 0 ncq 24576 out

res 40/00:0c:5e:70:ac/00:00:06:00:00/40 Emask 0x4 (timeout)

ata5.00: status: { DRDY }
ata5: hard resetting link
ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata5.00: configured for UDMA/133
ata5: EH complete
SCSI device sdb: 156301488 512-byte hdwr sectors (80026 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: drive cache: write back
ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata6.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0

res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)

ata6.00: status: { DRDY }
ata6: hard resetting link
ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata6.00: configured for UDMA/133
sd 5:0:0:0: timing out command, waited 60s
ata6: EH complete
SCSI device sdc: 156301488 512-byte hdwr sectors (80026 MB)
sdc: Write Protect is off
sdc: Mode Sense: 00 3a 00 00
SCSI device sdc: drive cache: write back
sd 5:0:0:0: timing out command, waited 60s
ata6: EH complete
SCSI device sdc: 156301488 512-byte hdwr sectors (80026 MB)
sdc: Write Protect is off
sdc: Mode Sense: 00 3a 00 00
SCSI device sdc: drive cache: write back
sd 5:0:0:0: timing out command, waited 60s
ata5.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
ata5.00: cmd 61/08:00:f6:1d:b0/00:00:01:00:00/40 tag 0 ncq 4096 out

res 40/00:0c:9e:04:ae/00:00:06:00:00/40 Emask 0x4 (timeout)

ata5.00: status: { DRDY }
ata5: hard resetting link
ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata5.00: configured for UDMA/133
ata5: EH complete
SCSI device sdb: 156301488 512-byte hdwr sectors (80026 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: drive cache: write back

I can reproduce this much faster by running the following script:

#!/bin/bash

while [ 1 ]; do

echo date;
time ./smartctl -a -d ata /dev/sdb > /dev/null
time ./smartctl -a -d ata /dev/sdc > /dev/null
echo date
echo ""

done

./smartctl -a -d ata /dev/sdb
smartctl 5.42 2011-10-20 r3458 [i686-linux-2.6.18.solos38] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

START OF INFORMATION SECTION

Model Family: Intel 320 Series SSDs
Device Model: INTEL SSDSA2CW080G3
Serial Number: CVPR112401SR080BGN
LU WWN Device Id: 5 001517 959516ee6
Firmware Version: 4PC10362
User Capacity: 80,026,361,856 bytes [80.0 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Thu Feb 16 11:23:03 2012 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

START OF READ SMART DATA SECTION

SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity

was never started.
Auto Offline Data Collection: Disabled.

Self-test execution status: ( 0) The previous self-test routine completed

without error or no self-test has ever
been run.

Total time to complete Offline
data collection: ( 1) seconds.
Offline data collection
capabilities: (0x75) SMART execute Offline immediate.

No Auto Offline data collection support.
Abort Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering

power-saving mode.
Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 1) minutes.
Conveyance self-test routine
recommended polling time: ( 1) minutes.
SCT capabilities: (0x003d) SCT Status supported.

SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 5
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

3 Spin_Up_Time 0x0020 100 100 000 Old_age Offline - 0
4 Start_Stop_Count 0x0030 100 100 000 Old_age Offline - 0
5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 2662

12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 504

170 Reserve_Block_Count 0x0033 100 100 010 Pre-fail Always - 0
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
183 Runtime_Bad_Block 0x0030 100 100 000 Old_age Offline - 0
184 End-to-End_Error 0x0032 100 100 090 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
192 Unsafe_Shutdown_Count 0x0032 100 100 000 Old_age Always - 499
199 UDMA_CRC_Error_Count 0x0030 100 100 000 Old_age Offline - 0
225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 16940
226 Workld_Media_Wear_Indic 0x0032 100 100 000 Old_age Always - 257
227 Workld_Host_Reads_Perc 0x0032 100 100 000 Old_age Always - 59
228 Workload_Minutes 0x0032 100 100 000 Old_age Always - 159763
232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0
233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always - 0
241 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 16940
242 Host_Reads_32MiB 0x0032 100 100 000 Old_age Always - 24461

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime?(hours) LBA_of_first_error
# 1 Vendor (0x45) Completed without error 00% 2662 -

Note: selective self-test log revision number (0) not 1 implies that no selective self-test has ever been run
SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing

Selective self-test flags (0x0):

After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

/smartctl -a -d ata /dev/sdc
smartctl 5.42 2011-10-20 r3458 [i686-linux-2.6.18.solos38] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

START OF INFORMATION SECTION

Model Family: Intel 320 Series SSDs
Device Model: INTEL SSDSA2CW080G3
Serial Number: CVPR112407BG080BGN
LU WWN Device Id: 5 001517 959518c16
Firmware Version: 4PC10362
User Capacity: 80,026,361,856 bytes [80.0 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Thu Feb 16 11:25:21 2012 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

START OF READ SMART DATA SECTION

SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity

was never started.
Auto Offline Data Collection: Disabled.

Self-test execution status: ( 0) The previous self-test routine completed

without error or no self-test has ever
been run.

Total time to complete Offline
data collection: ( 1) seconds.
Offline data collection
capabilities: (0x75) SMART execute Offline immediate.

No Auto Offline data collection support.
Abort Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering

power-saving mode.
Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 1) minutes.
Conveyance self-test routine
recommended polling time: ( 1) minutes.
SCT capabilities: (0x003d) SCT Status supported.

SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 5
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

3 Spin_Up_Time 0x0020 100 100 000 Old_age Offline - 0
4 Start_Stop_Count 0x0030 100 100 000 Old_age Offline - 0
5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 2665

12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 505

170 Reserve_Block_Count 0x0033 100 100 010 Pre-fail Always - 0
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
183 Runtime_Bad_Block 0x0030 100 100 000 Old_age Offline - 0
184 End-to-End_Error 0x0032 100 100 090 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
192 Unsafe_Shutdown_Count 0x0032 100 100 000 Old_age Always - 500
199 UDMA_CRC_Error_Count 0x0030 100 100 000 Old_age Offline - 2
225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 36272
226 Workld_Media_Wear_Indic 0x0032 100 100 000 Old_age Always - 429
227 Workld_Host_Reads_Perc 0x0032 100 100 000 Old_age Always - 15
228 Workload_Minutes 0x0032 100 100 000 Old_age Always - 159917
232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0
233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always - 0
241 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 36272
242 Host_Reads_32MiB 0x0032 100 100 000 Old_age Always - 6808

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime?(hours) LBA_of_first_error
# 1 Vendor (0x4d) Completed without error 00% 2665 -

Note: selective self-test log revision number (0) not 1 implies that no selective self-test has ever been run
SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing

Selective self-test flags (0x0):

After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

Change History (18)

comment:1 Changed 9 years ago by Christian Franke

Keywords: linux added; Freeze removed
Milestone: Release 5.43
Summary: Freeze while running smartctl: exception Emask 0x2 SAct 0x0 SErr 0x3000400 action 0x6 frozenFreeze with Intel 320: exception Emask 0x2 SAct 0x0 SErr 0x3000400 action 0x6 frozen

AFAICS there are no related changes between the final versions of smartctl 5.40 (r3189) and 5.42 (r3458). There were changes in 5.40 pre-releases which affected the number and order of SMART commands issued.

The root of the problem from ticket #137 was a (confirmed) firmware bug: Freezes occurred on command SMART READ LOG for self-test log.

Option "-a" is equivalent to:

# smartctl -i -H -c -A -l error -l selftest -l selective /dev/sdX

Which subset of these options can be used the reproduce the problem?

# smartctl -i /dev/sdX  (IDENTIFY only)
# smartctl -c /dev/sdX  (..., SMART READ DATA)
# smartctl -A /dev/sdX  (..., ..., SMART READ THRESHOLDS)
# smartctl -H /dev/sdX  (..., ..., ..., SMART RETURN STATUS)
# smartctl -l selftest /dev/sdX  (IDENTIFY, SMART READ DATA, SMART READ LOG)

comment:2 Changed 9 years ago by andrew-stewart

I tried all the option individually and only "-l selftest" caused problems.

comment:3 Changed 9 years ago by Christian Franke

Milestone: Release 5.43

Same behavior as with X25M firmware 2CV102HD (ticket #137). This was fixed in 2CV102M3. Please report this to Intel support.

comment:4 Changed 9 years ago by andrew-stewart

Yes I will report to Intel. Any idea why we only are seeing this with 5.42 and not 5.40?

comment:5 Changed 9 years ago by Christian Franke

No - for 5.40 final (r3189). Note that some distributions (e.g. Debian, Ubuntu) provided 5.40 prereleases which may have used the old ATA command sequence from 5.39.

comment:6 Changed 9 years ago by andrew-stewart

We compiled from src downloaded from:

http://iweb.dl.sourceforge.net/project/smartmontools/smartmontools/5.40/smartmontools-5.40.tar.gz

I will recompile and retest with 5.40, but so far I cannot reproduce with this version.

comment:7 Changed 9 years ago by andrew-stewart

I recompiled 5.40 and was still not able to reproduce the issue, so I download the 5.41 src and test it and I can reproduce with 5.41.

I did some debugging to find out the difference in the smart commands used and I found the following:

5.40:
ATA_IDENTIFY_DEVICE
ATA_SMART_READ_VALUES
ATA_SMART_READ_LOG_SECTOR

where 5.41:
ATA_IDENTIFY_DEVICE
ATA_SMART_READ_VALUES
ATA_SMART_READ_LOG_SECTOR
ATA_SMART_READ_LOG_SECTOR

comment:8 Changed 9 years ago by andrew-stewart

It looks like the extra ATA_SMART_READ_LOG_SECTOR is from ataReadLogDirectory, because need_smart_logdir is now true starting in 5.41

comment:9 Changed 9 years ago by andrew-stewart

I should point out the the hard drive was not in the database in 5.40, but is in the database for 5.41.

comment:10 Changed 9 years ago by Christian Franke

Thanks for testing this. Sorry, I missed the extra READ LOG SECTOR command. The database entry only affects attribute names and print format (You could use current database with smartctl 5.40).

The first READ LOG SECTOR of Log Directory is done intentionally, see ticket #89 for details. This specific log address (0x00) is probably the root of the problem. Then the issue should also be reproducible with:

smartctl -l directory ... 

ATA commands from 5.42: IDENTIFY, SMART READ LOG (LBA LOW=0x00). Testcase should also "work" with 5.40 then.

comment:11 Changed 9 years ago by andrew-stewart

yes smartctl -l directory using smartctl 5.40 will also reproduce the issue, and if I take out the directory read in 5.41 the issue goes away, so it looks like a f/w issue when reading the log directory.

comment:12 Changed 9 years ago by Christian Franke

In this thread of smartmontools-support ML is a report about a 120GB Intel 320 with same FW 4PC10362. Reading log directory and device statistics works.

The problem with your device might be controller/driver specific. Could you possibly test on another machine and/or with another OS version?

comment:13 Changed 9 years ago by andrew-stewart

I have retested on a different chipset and Linux kernel and can confirm that it passes. I will see if I can narrow the problem down more.

comment:14 Changed 9 years ago by andrew-stewart

I take that back after running for 12 hours in a tight loop I got the following error (not the same, but may be related) on a different chip set(Motherboard Dell 0Y2MRG) and kernel version (3.1.0-7.fc16.x86_6):

[44167.071327] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[44167.071341] ata1.00: failed command: FLUSH CACHE
[44167.071355] ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0
[44167.071356] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[44167.071378] ata1.00: status: { DRDY }
[44167.071393] ata1: hard resetting link
[44167.376169] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[44167.376889] ata1.00: configured for UDMA/133
[44167.376894] ata1.00: retrying FLUSH 0xe7 Emask 0x4
[44167.376961] ata1.00: device reported invalid CHS sector 0
[44167.376967] ata1: EH complete

comment:15 Changed 9 years ago by andrew-stewart

The system that I can reproduce it on much quicker is:

Intel S5520HC, version: E26045-454

Kernel: kernel-2.6.18-274.17.1

comment:16 Changed 8 years ago by Christian Franke

Keywords: firmwarebug added; linux removed
Milestone: Release 5.44
Owner: changed from somebody to Christian Franke
Status: newaccepted

comment:17 Changed 8 years ago by Christian Franke

Please update to r3591 and try '-F nologdir' option. This should prevent the freeze.

Option will be added to drive database entry for affected Intel SSDs in the near future. Ticket will be closed then. Will take some time because this breaks backward compatibility and therefore drivedb.h branches are required for smartmontools <= 5.43.

A local /etc/smart_drivedb.h could be used until then, see -B option on smartctl man page.

comment:18 Changed 8 years ago by Christian Franke

Resolution: fixed
Status: acceptedclosed
Note: See TracTickets for help on using tickets.