Changes between Version 10 and Version 11 of BadBlockHowto

03/25/2017 08:55:20 PM (3 years ago)
Gabriele Pohl



  • BadBlockHowto

    v10 v11  
    664664Here is an alternate brute force technique to consider: if the data on the SCSI or ATA disk has all been backed up (e.g. is held on the other disks in a RAID 5 enclosure), then simply reformatting the disk may be the least cumbersome approach.
     666==== Example ====
     668Given a ''bad block'', it still may be useful to look at the `fdisk` command (if the disk has multiple partitions) to find out which partition is involved, then use `debugfs` (or a similar tool for the file system in question) to find out which, if any, file or other part of the file system may have been damaged. This is discussed in section [#Repairsinafilesystem Repairs in a file system].
     670Then a program that can execute the `REASSIGN BLOCKS SCSI` command is required. In Linux (2.4 and 2.6 series), FreeBSD, Tru64(OSF) and Windows the author's `sg_reassign` utility in the `sg3_utils` package can be used. Also found in that package is `sg_verify` which can be used to check that a block is readable.
     672Assume that `logical block address 1193046` (which is `123456` in hex) is corrupt [#footnote10 [10]] on the disk at `/dev/sdb`. A long selftest command like `smartctl -t long /dev/sdb` may result in log results like this:
     675# smartctl -l selftest /dev/sdb
     676smartctl version 5.37 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
     677Home page is
     678SMART Self-test log
     679Num  Test              Status            segment  LifeTime  LBA_first_err [SK ASC ASQ]
     680     Description                         number   (hours)
     681# 1  Background long   Failed in segment      -     354           1193046 [0x3 0x11 0x0]
     682# 2  Background short  Completed              -     323                 - [-   -    -]
     683# 3  Background short  Completed              -     194                 - [-   -    -]
     686The `sg_verify` utility can be used to confirm that there is a problem at that address:
     689# sg_verify --lba=1193046 /dev/sdb
     690verify (10):  Fixed format, current;  Sense key: Medium Error
     691 Additional sense: Unrecovered read error
     692  Info fld=0x123456 [1193046]
     693  Field replaceable unit code: 228
     694  Actual retry count: 0x008b
     695medium or hardware error, reported lba=0x123456
     698Now the GLIST length is checked before the block reassignment:
     701# sg_reassign --grown /dev/sdb
     702>> Elements in grown defect list: 0
     705And now for the actual reassignment followed by another check of the GLIST length:
     708# sg_reassign --address=1193046 /dev/sdb
     709# sg_reassign --grown /dev/sdb
     710>> Elements in grown defect list: 1
     713The GLIST length has grown by one as expected. If the disk was unable to recover any data, then the ''new'' block at lba `0x123456` has vendor specific data in it. The `sg_reassign` utility can also do bulk reassigns, see `man sg_reassign` for more information.
     715The `dd` command could be used to read the contents of the ''new'' block:
     718# dd if=/dev/sdb iflag=direct skip=1193046 of=blk.img bs=512 count=1
     721and a hex editor [#footnote11 [11]] used to view and potentially change the `blk.img` file. An altered `blk.img` file (or `/dev/zero`) could be written back with:
     724# dd if=blk.img of=/dev/sdb seek=1193046 oflag=direct bs=512 count=1
     727More work may be needed at the file system level, especially if the reassigned block held critical file system information such as a superblock or a directory.
     729Even if a full backup of the disk is available, or the disk has been ''ejected'' from a RAID, it may still be worthwhile to reassign the bad block(s) that caused the problem (or simply format the disk (see `sg_format` in the `sg3_utils package`)) and re-use the disk later (not unlike the way a replacement disk from a manufacturer might be used).
    666732== Footnotes ==
    684750[=#footnote9 [9]] Often disks inside a hardware RAID have the ARRE and AWRE bits cleared (disabled) so the RAID controller can do things manually or flag the disk for replacement.
     752[=#footnote10 [10]] In this case the corruption was manufactured by using the `WRITE LONG SCSI` command. See `sg_write_long` in `sg3_utils`.
     754[=#footnote11 [11]] Most window managers have a handy calculator that will do hex to decimal conversions.