Version 7 (modified by Gabriele Pohl, 3 years ago) (diff)

Repairs at the disk level

Bad block HOWTO for smartmontools

This article describes what actions might be taken when smartmontools detects a bad block on a disk. It demonstrates how to identify the file associated with an unreadable disk sector, and how to force that sector to reallocate.

Introduction

Handling bad blocks is a difficult problem as it often involves decisions about losing information. Modern storage devices tend to handle the simple cases automatically, for example by writing a disk sector that was read with difficulty to another area on the media. Even though such a remapping can be done by a disk drive transparently, there is still a lingering worry about media deterioration and the disk running out of spare sectors to remap.

Can smartmontools help? As the SMART [1] acronym suggests, the smartctl command and the smartd daemon concentrate on monitoring and analysis. So apart from changing some reporting settings, smartmontools will not modify the raw data in a device. Also smartmontools only works with physical devices, it does not know about partitions and file systems. So other tools are needed. The job of smartmontools is to alert the user that something is wrong and user intervention may be required.

When a bad block is reported one approach is to work out the mapping between the logical block address used by a storage device and a file or some other component of a file system using that device. Note that there may not be such a mapping reflecting that a bad block has been found at a location not currently used by the file system. A user may want to do this analysis to localize and minimize the number of replacement files that are retrieved from some backup store. This approach requires knowledge of the file system involved and this document uses the Linux ext2/ext3 and ReiserFS file systems for examples. Also the type of content may come into play. For example if an area storing video has a corrupted sector, it may be easiest to accept that a frame or two might be corrupted and instruct the disk not to retry as that may have the visual effect of causing a momentary blank into a 1 second pause (while the disk retries the faulty sector, often accompanied by a telltale clicking sound).

Another approach is to ignore the upper level consequences (e.g. corrupting a file or worse damage to a file system) and use the facilities offered by a storage device to repair the damage. The SCSI disk command set is used elaborate on this low level approach.

Repairs in a file system

This section contains examples of what to do at the file system level when smartmontools reports a bad block. These examples assume the Linux operating system and either the ext2/ext3 or ReiserFS file system. The various Linux commands shown have man pages and the reader is encouraged to examine these. Of note is the dd command which is often used in repair work [2] and has a unique command line syntax.

The authors would like to thank Sergey Vlasov, Theodore Ts'o, Michael Bendzick, and others for explaining this approach. The authors would like to add text showing how to do this for other file systems, in particular XFS, and JFS: please email if you can provide this information.

ext2/ext3 first example

In this example, the disk is failing self-tests at Logical Block Address LBA = 0x016561e9 = 23421417. The LBA counts sectors in units of 512 bytes, and starts at zero.

root]# smartctl -l selftest /dev/hda:
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%       217         0x016561e9

Note that other signs that there is a bad sector on the disk can be found in the non-zero value of the Current_Pending_Sector count:

root]# smartctl -A /dev/hda
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       1

First Step: We need to locate the partition on which this sector of the disk lives:

root]# fdisk -lu /dev/hda
Disk /dev/hda: 123.5 GB, 123522416640 bytes
255 heads, 63 sectors/track, 15017 cylinders, total 241254720 sectors
Units = sectors of 1 * 512 = 512 bytes
   Device Boot    Start       End    Blocks   Id  System
/dev/hda1   *        63   4209029   2104483+  83  Linux
/dev/hda2       4209030   5269319    530145   82  Linux swap
/dev/hda3       5269320 238227884 116479282+  83  Linux
/dev/hda4     238227885 241248104   1510110   83  Linux

The partition /dev/hda3 starts at LBA 5269320 and extends past the problem LBA. The problem LBA is offset 23421417 - 5269320 = 18152097 sectors into the partition /dev/hda3.

To verify the type of the file system and the mount point, look in /etc/fstab:

root]# grep hda3 /etc/fstab
/dev/hda3 /data ext2 defaults 1 2

You can see that this is an ext2 file system, mounted at /data.

Second Step: we need to find the block size of the file system (normally 4096 bytes for ext2):

root]# tune2fs -l /dev/hda3 | grep Block
Block count:              29119820
Block size:               4096

In this case the block size is 4096 bytes. Third Step: we need to determine which File System Block contains this LBA. The formula is:

  b = (int)((L-S)*512/B)
where:
b = File System block number
B = File system block size in bytes
L = LBA of bad sector
S = Starting sector of partition as shown by fdisk -lu
and (int) denotes the integer part.

In our example, L=23421417, S=5269320, and B=4096. Hence the problem LBA is in block number

   b = (int)18152097*512/4096 = (int)2269012.125
so b=2269012.

Note: the fractional part of 0.125 indicates that this problem LBA is actually the second of the eight sectors that make up this file system block.

Fourth Step: we use debugfs to locate the inode stored in this block, and the file that contains that inode:

root]# debugfs
debugfs 1.32 (09-Nov-2002)
debugfs:  open /dev/hda3
debugfs:  testb 2269012
Block 2269012 not in use

If the block is not in use, as in the above example, then you can skip the rest of this step and go ahead to Step Five.

If, on the other hand, the block is in use, we want to identify the file that uses it:

debugfs:  testb 2269012
Block 2269012 marked in use
debugfs:  icheck 2269012
Block   Inode number
2269012 41032
debugfs:  ncheck 41032
Inode   Pathname
41032   /S1/R/H/714197568-714203359/H-R-714202192-16.gwf

In this example, you can see that the problematic file (with the mount point included in the path) is: /data/S1/R/H/714197568-714203359/H-R-714202192-16.gwf

When we are working with an ext3 file system, it may happen that the affected file is the journal itself. Generally, if this is the case, the inode number will be very small. In any case, debugfs will not be able to get the file name:

debugfs:  testb 2269012
Block 2269012 marked in use
debugfs:  icheck 2269012
Block   Inode number
2269012 8
debugfs:  ncheck 8
Inode   Pathname
debugfs:

To get around this situation, we can remove the journal altogether:

tune2fs -O ^has_journal /dev/hda3

and then start again with Step Four: we should see this time that the wrong block is not in use any more. If we removed the journal file, at the end of the whole procedure we should remember to rebuild it:

tune2fs -j /dev/hda3

Fifth Step NOTE: This last step will permanently and irretrievably destroy the contents of the file system block that is damaged: if the block was allocated to a file, some of the data that is in this file is going to be overwritten with zeros. You will not be able to recover that data unless you can replace the file with a fresh or correct version.

To force the disk to reallocate this bad block we'll write zeros to the bad block, and sync the disk:

root]# dd if=/dev/zero of=/dev/hda3 bs=4096 count=1 seek=2269012
root]# sync

Now everything is back to normal: the sector has been reallocated. Compare the output just below to similar output near the top of this article:

root]# smartctl -A /dev/hda
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       1
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       1
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       1

Note: for some disks it may be necessary to update the SMART Attribute values by using smartctl -t offline /dev/hda

We have corrected the first errored block. If more than one blocks were errored, we should repeat all the steps for the subsequent ones. After we do that, the disk will pass its self-tests again:

root]# smartctl -t long /dev/hda  [wait until test completes, then]
root]# smartctl -l selftest /dev/hda
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       239         -
# 2  Extended offline    Completed: read failure       90%       217         0x016561e9
# 3  Extended offline    Completed: read failure       90%       212         0x016561e9
# 4  Extended offline    Completed: read failure       90%       181         0x016561e9
# 5  Extended offline    Completed without error       00%        14         -
# 6  Extended offline    Completed without error       00%         4         -

and no longer shows any offline uncorrectable sectors:

root]# smartctl -A /dev/hda
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       1
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       1
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0

ext2/ext3 second example

On this drive, the first sign of trouble was this email from smartd:

    To: ballen
    Subject: SMART error (selftest) detected on host: medusa-slave166.medusa.phys.uwm.edu
    This email was generated by the smartd daemon running on host:
    medusa-slave166.medusa.phys.uwm.edu in the domain: master001-nis
    The following warning/error was logged by the smartd daemon:
    Device: /dev/hda, Self-Test Log error count increased from 0 to 1

Running smartctl -a /dev/hda confirmed the problem:

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       80%       682         0x021d9f44
Note that the failing LBA reported is 0x021d9f44 (base 16) = 35495748 (base 10)
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       3
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       3

and one can see above that there are 3 sectors on the list of pending sectors that the disk can't read but would like to reallocate.

The device also shows errors in the SMART error log:

Error 212 occurred at disk power-on lifetime: 690 hours
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 12 46 9f 1d e2  Error: UNC 18 sectors at LBA = 0x021d9f46 = 35495750
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Timestamp  Command/Feature_Name
  -- -- -- -- -- -- -- --   ---------  --------------------
  25 00 12 46 9f 1d e0 00 2485545.000  READ DMA EXT

Signs of trouble at this LBA may also be found in SYSLOG:

[root]# grep LBA /var/log/messages | awk '{print $12}' | sort | uniq
 LBAsect=35495748
 LBAsect=35495750

So I decide to do a quick check to see how many bad sectors there really are. Using the bash shell I check 70 sectors around the trouble area:

[root]# export i=35495730
[root]# while [ $i -lt 35495800 ]
        > do echo $i
        > dd if=/dev/hda of=/dev/null bs=512 count=1 skip=$i
        > let i+=1
        > done
<SNIP>
35495734
1+0 records in
1+0 records out
35495735
dd: reading `/dev/hda': Input/output error
0+0 records in
0+0 records out
<SNIP>
35495751
dd: reading `/dev/hda': Input/output error
0+0 records in
0+0 records out
35495752
1+0 records in
1+0 records out
<SNIP>

which shows that the seventeen sectors 35495735-35495751 (inclusive) are not readable.

Next, we identify the files at those locations. The partitioning information on this disk is identical to the first example above, and as in that case the problem sectors are on the third partition /dev/hda3. So we have:

     L=35495735 to 35495751
     S=5269320
     B=4096

so that b=3778301 to 3778303 are the three bad blocks in the file system.

[root]# debugfs
debugfs 1.32 (09-Nov-2002)
debugfs:  open /dev/hda3
debugfs:  icheck 3778301
Block   Inode number
3778301 45192
debugfs:  icheck 3778302
Block   Inode number
3778302 45192
debugfs:  icheck 3778303
Block   Inode number
3778303 45192
debugfs:  ncheck 45192
Inode   Pathname
45192   /S1/R/H/714979488-714985279/H-R-714979984-16.gwf
debugfs:  quit

Note that the first few steps of this procedure could also be done with a single command, which is very helpful if there are many bad blocks (thanks to Danie Marais for pointing this out):

debugfs: icheck 3778301 3778302 3778303

And finally, just to confirm that this is really the damaged file:

[root]# md5sum /data/S1/R/H/714979488-714985279/H-R-714979984-16.gwf
md5sum: /data/S1/R/H/714979488-714985279/H-R-714979984-16.gwf: Input/output error

Finally we force the disk to reallocate the three bad blocks:

[root]# dd if=/dev/zero of=/dev/hda3 bs=4096 count=3 seek=3778301
[root]# sync

We could also probably use:

[root]# dd if=/dev/zero of=/dev/hda bs=512 count=17 seek=35495735

At this point we now have:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0

which is encouraging, since the pending sectors count is now zero. Note that the drive reallocation count has not yet increased: the drive may now have confidence in these sectors and have decided not to reallocate them..

A device self test:

  [root#] smartctl -t long /dev/hda
(then wait about an hour) shows no unreadable sectors or errors:
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       692         -
# 2  Extended offline    Completed: read failure       80%       682         0x021d9f44

Unassigned sectors

This section was written by Kay Diederichs. Even though this section assumes Linux and the ext2/ext3 file system, the strategy should be more generally applicable.

I read your badblocks-howto at and greatly benefited from it. One thing that's (maybe) missing is that often the smartctl -t long scan finds a bad sector which is not assigned to any file. In that case it does not help to run debugfs, or rather debugfs reports the fact that no file owns that sector. Furthermore, it is somewhat laborious to come up with the correct numbers for debugfs, and debugfs is slow ...

So what I suggest in the case of presence of Current_Pending_Sector/Offline_Uncorrectable errors is to create a huge file on that file system.

  dd if=/dev/zero of=/some/mount/point bs=4k

creates the file. Leave it running until the partition/file system is full. This will make the disk reallocate those sectors which do not belong to a file. Check the smartctl -a output after that and make sure that the sectors are reallocated. If any remain, use the debugfs method. Of course the usual caveats apply - back it up first, and so on.

ReiserFS example

This section was written by Joachim Jautz with additions from Manfred Schwarb.

The following problems were reported during a scheduled test:

smartd[575]: Device: /dev/hda, starting scheduled Offline Immediate Test.
[... 1 hour later ...]
smartd[575]: Device: /dev/hda, 1 Currently unreadable (pending) sectors
smartd[575]: Device: /dev/hda, 1 Offline uncorrectable sectors

[Step 0] The SMART selftest/error log (see smartctl -l selftest) indicated there was a problem with block address (i.e. the 512 byte sector at) 58656333. The partition table (e.g. see sfdisk -luS /dev/hda or fdisk -ul /dev/hda) indicated that this block was in the /dev/hda3 partition which contained a ReiserFS file system. That partition started at block address 54781650.

While doing the initial analysis it may also be useful to take a copy of the disk attributes returned by smartctl -A /dev/hda. Specifically the values associated with the Reallocated_Sector_Ct and Reallocated_Event_Count attributes (for ATA disks, the grown list (GLIST) length for SCSI disks). If these are incremented at the end of the procedure it indicates that the disk has re-allocated one or more sectors.

[Step 1] Get the file system's block size:

# debugreiserfs /dev/hda3 | grep '^Blocksize'
Blocksize: 4096

[Step 2] Calculate the block number:

# echo "(58656333-54781650)*512/4096" | bc -l
484335.37500000000000000000

It is re-assuring that the calculated 4 KB damaged block address in /dev/hda3 is less than Count of blocks on the device shown in the output of debugreiserfs shown above.

[Step 3] Try to get more info about this block => reading the block fails as expected but at least we see now that it seems to be unused. If we do not get the Cannot read the block error we should check if our calculation in [Step 2] was correct ;)

# debugreiserfs -1 484335 /dev/hda3
debugreiserfs 3.6.19 (2003 http://www.namesys.com)
484335 is free in ondisk bitmap
The problem has occurred looks like a hardware problem.

If you have bad blocks, we advise you to get a new hard drive, because once you get one bad block that the disk drive internals cannot hide from your sight, the chances of getting more are generally said to become much higher (precise statistics are unknown to us), and this disk drive is probably not expensive enough for you to risk your time and data on it. If you don't want to follow that advice then if you have just a few bad blocks, try writing to the bad blocks and see if the drive remaps the bad blocks (that means it takes a block it has in reserve and allocates it for use for of that block number). If it cannot remap the block, use badblock option (-B) with reiserfs utils to handle this block correctly.

bread: Cannot read the block (484335): (Input/output error).
Aborted

So it looks like we have the right (i.e. faulty) block address.

[Step 4] Try then to find the affected file [3]:

tar -cO /mydir | cat >/dev/null

If you do not find any unreadable files, then the block may be free or located in some metadata of the file system.

[Step 5] Try your luck: bang the affected block with badblocks -n (non-destructive read-write mode, do unmount first), if you are very lucky the failure is transient and you can provoke reallocation [4]:

# badblocks -b 4096 -p 3 -s -v -n /dev/hda3 `expr 484335 + 100` `expr 484335 - 100`

[5]

check success with debugreiserfs -1 484335 /dev/hda3. Otherwise:

[Step 6] Perform this step only if Step 5 has failed to fix the problem: overwrite that block to force reallocation:

# dd if=/dev/zero of=/dev/hda3 count=1 bs=4096 seek=484335
1+0 records in
1+0 records out
4096 bytes transferred in 0.007770 seconds (527153 bytes/sec)

[Step 7] If you can't rule out the bad block being in metadata, do a file system check:

reiserfsck --check

This could take a long time so you probably better go for lunch ...

[Step 8] Proceed as stated earlier. For example, sync disk and run a long selftest that should succeed now.

Repairs at the disk level

This section first looks at a damaged partition table. Then it ignores the upper level impact of a bad block and just repairs the underlying sector so that defective sector will not cause problems in the future.

Footnotes

[1] Self-Monitoring, Analysis and Reporting Technology -> SMART

[2] Starting with GNU coreutils release 5.3.0, the dd command in Linux includes the options 'iflag=direct' and 'oflag=direct'. Using these with the dd commands should be helpful, because adding these flags should avoid any interaction with the block buffering IO layer in Linux and permit direct reads/writes from the raw device. Use dd --help to see if your version of dd supports these options. If not, the latest code for dd can be found at https://www.gnu.org/software/coreutils/.

[3] Do not use tar -c -f /dev/null or tar -cO /mydir >/dev/null. GNU tar does not actually read the files if /dev/null is used as archive path or as standard output, see info tar.

[4] Important: set blocksize range is arbitrary, but do not only test a single block, as bad blocks are often social. Not too large as this test probably has not 0% risk.

[5] The rather awkward expr 484335 + 100 (note the back quotes) can be replaced with $((484335+100)) if the bash shell is being used. Similarly the last argument can become $((484335-100)).