Opened 3 years ago

Last modified 3 years ago

#1443 new enhancement

[smartd] fix for otherwise always aborted selftests

Reported by: Ch.Ris Owned by:
Priority: minor Milestone: undecided
Component: smartd Version:
Keywords: smartd.conf Cc:

Description

Bug:

smartd is often not able to schedule successful selftests on temporarily idle disks (i.e. especially at night when selftests should run).

This often happens because external disk adapters abort the test with power saving features, but it is also the case when the OS or a utility like hd-idle is configured to issue spin-down commands on idle disks when the disk's own power saving features are not available or accessible.

FAQ item:

https://www.smartmontools.org/wiki/FAQ#Whyarelongself-testskeepgettinginterrupted

Problems:

  • Most often the adapter's power saving features can not be configured nor be disabled.
  • It's not trivial to externally generate artificial disk activity or temporarily disable OS power saving features to allow the smartd scheduled selftests to finish.

Proposed solution:

Make the smartd scheduled selftest work, by letting smartd poll some disk information after triggering the selftest, as long as the selftest is running.

Proof of concept:

It's a working workaround to start a test manually and then "smartctl -a" in a while loop:

while smartctl -a /dev/sdX | grep -q "in progress" ; do sleep 60; done

Change History (24)

comment:1 by Christian Franke, 3 years ago

Keywords: smartd.conf added
Milestone: unscheduled
Priority: majorminor
Type: defectenhancement

Makes sense. Would require a new smartd.conf directive, an additional internal time interval independent of -i option and separate implementations for ATA/SATA and (possibly later) SCSI/SAS.

Please report details about your proof of concept (platform, adapter, disk, ...).

Please test which of the following smartctl commands are sufficient (or not) to prevent the spin down:

smartctl -d TYPE -n never /dev/sdX
smartctl -d TYPE -i /dev/sdX
smartctl -d TYPE -H /dev/sdX
smartctl -d TYPE -c /dev/sdX
smartctl -d scsi -i /dev/sdX

Make sure to disable device type auto-detection by specifying the appropriate device TYPE (sat, usb*) or scsi in the last testcase. Without -d, additional commands (e.g. SCSI INQUIRY) may be issued and may produce misleading results.

comment:2 by Ch.Ris, 3 years ago

Identified some problems and a solution.

I could test with a directly connected sata drive under hd-idle power management and one connected to a VIA usb3-to-sata adapter (all x86_64).

At the end I found my initial "proof of concept" and all the other smartctl keep-awake command candidates did only work by accident: When starting the selftest while the disk already was in standby, and there actually was absolutely no other disk activity (smartctl readouts don't count) during the entire selftest. The powermanagement would then simply not issue a new power saving command during the selftest, assuming the disk to be spun-down all the time.

(Additional irritation was caused by the Seagate Firecuda drive, as it does not seem to actually always have to spin-up when going from standby into active, due to its SSD cache.)

But here are the conclusions about what worked:

  • Reliably working for all setups were only keep-awake commands doing disk writes, i.e. did simply touch $MOUNTPOINT/smartd-selftest-keep-awake.file. (Reads could come from a buffer and not reach the adapter.)
  • The first such keep-awake command MUST be executed before starting the selftest, to ensure the disk is active and the host powermangement won't spin-down and stop the selftest before smartd's first keep-alive.
  • From the given smartctl commands I only saw info about the still running selftest in the output of "smartctl -a" (to properly stop the keep-awakes after the selftest is done). So smartd would probably have to poll the drive for the selftest progress with the shorter interval, and issue a separate keep-awake command as long as the selftest is still running.
  • I guess the implementation could be the same for ATA/SATA/SCSI/SAS etc., but may be needing a keep-awake "mountpoint" or "touchpoint" path in smartd.conf, or is that detectable?

=> Is there already a way (command) to start smartd to do a single selftest on demand?
It would be a great advantage to have a reliable way to start selftests that don't abort due to adapter or OS powermanagment.

comment:3 by Alex Samorukov, 3 years ago

I am thinking that may be its a good idea to add flag to the -c command which would exit when self test is complete and poll once an interval otherwise. This should fix this problem.

Writing something in smartctl is a very bad idea, as this tool does not work with a filesystems and can be used with non-mounted drive at all.

comment:4 by Alex Samorukov, 3 years ago

Also, tbh, from my experience having disks to stop is always an evil. I did few tests and disks with hdidle been dying much faster compared to drives always running. Of course your mileage may vary.

Last edited 3 years ago by Alex Samorukov (previous) (diff)

comment:5 by Ch.Ris, 3 years ago

Hello
it seems the captive tests are having their own problems, though:
https://www.smartmontools.org/ticket/303
https://www.smartmontools.org/ticket/1153

With devices that are not mounted the problem may be even more apparent. Maybe there is still a better keep-awake command to solve this.

Without any disk activity the background selftests would even get more regularily aborted when using an adapter or OS powermanagement is active (common default).

But if the device is not mounted one could, e.g. do a small mbr or gpt flag or label adjustment as keep-awake command. Without any partition table one could use dd.

I think keep-awakes are needed to allow testing of temporarily attached devices, idling backup- or spare-drives, not so for allways-on server or busy system drives.

comment:6 by Alex Samorukov, 3 years ago

I was not referring to captive test (-C) (which is +- broken on most of the linuxes) but about "-c" which shows self-test progress. Adding flag to wait until test is completed and poll once an interval could be an option. As for writing _anything_ to disk - sorry, no. However, smartctl has json output, so nothing should stop you from scripting this if needed.

comment:7 by Alex Samorukov, 3 years ago

P.S. I am not sure if there is anything "one fit all" solution here at all. As controller power saving logic is very different, one may require some writes, another could be happy with any smart command, etc. So may be its best to leave it as is.

comment:8 by Alex Samorukov, 3 years ago

You can do something like

 smartctl /dev/disk0 -c --json |jq .ata_smart_data.self_test.status.value

and do whatever is needed if value > 0. Could be added in cron task, for example. Together with "-n" switch you can avoid excessive wake-ups.

Last edited 3 years ago by Alex Samorukov (previous) (diff)

comment:9 by Ch.Ris, 3 years ago

It depends much on the hardware. On many laptops, other low-power devices and enclosures the selftest options are not usable as is.

As controller power saving logic is very different, one may require some writes, another could be happy with any smart command, etc.


That's a valid point. An option to do minimal meta-data updates should prevent a spin-down for all devices, though. (And also be safe for production-grade filesystems/partition-tables/unallocated-space on devices one wants to run internal tests on anyway.)

A straightforward --keep-awake option and a --keep-awake-command customization option would be a great help to make it work for the affected cases.

comment:10 by Ch.Ris, 3 years ago

Err, maybe block device reading could work as keep-awake with the proper direct/sync/no-buffer io control function or option?

Someone knows about those?

Last edited 3 years ago by Ch.Ris (previous) (diff)

comment:11 by Ch.Ris, 3 years ago

To ease the tedious keep-awake command testing a bit I was using this script btw:

#!/bin/sh
#
# selftest-keep-awake-test.sh




LOG=selftest-keep-awake-test.log

DEVICES="/dev/sdb /dev/sdc"
DEVICE_MOUNTPOINTS="/srv/data"

# minutes to wait before asking user for disk spinning state evaluation
# set it longer than disk spin down timers
CHECK_TIME=5




# keep-awake commands to test
TYPES="sat"

# not working
TYPED_KEEP_AWAKE_COMMANDS="
smartctl -d \$TYPE -n never \$DEVICE
smartctl -d \$TYPE -i \$DEVICE
smartctl -d \$TYPE -H \$DEVICE
smartctl -d \$TYPE -c \$DEVICE"

###################################################### overriding above (not working) again
TYPED_KEEP_AWAKE_COMMANDS=""

for MOUNTPOINT in $DEVICE_MOUNTPOINTS; do break; done

# not working
SIMPLE_KEEP_AWAKE_COMMANDS="# (noop for comparision)
smartctl -a \$DEVICE > /dev/null
smartctl -d scsi -i \$DEVICE"

###################################################### overriding above (not working) again
SIMPLE_KEEP_AWAKE_COMMANDS="touch \$MOUNTPOINT/selftest-keep-alive-test.file"



function write_to_devices() {
	for MOUNT_POINT in $DEVICE_MOUNTPOINTS
	do
  	  echo -e "\n * Ensure consistent active device state by creating
and and removing empty file in $MOUNT_POINT"
  	  touch $MOUNT_POINT/selftest-keep-alive-test.file
  	  sync
  	  rm $MOUNT_POINT/selftest-keep-alive-test.*
	done
}

function finish() {

  # stop remaining selftests
  for DEVICE in $DEVICES
  do
  	CLEANUP_COMMANDS="smartctl --abort \$DEVICE"
    echo -e "\n * $DEVICE cleanup: $CLEANUP_COMMANDS"
    eval $CLEANUP_COMMANDS
  done

  # Write to the filesystems of the devices,
  # to ensure that host power management will actively spin down the disk again afterwards,
  # i.e. if it had already been switched to standby state before the selftest.
  write_to_devices
  
  exit
}

[ "$1" = "finish" ] && finish





# init log
for DEVICE in $DEVICES
do
  echo "${DEVICE} $(smartctl -i $DEVICE | grep Family) ($(uname -m))" >> $LOG
done


# Access filesystems on the device
# Need to ensure devices are actually active before the keep-awake test.
# Otherwise the power management of the OS or adapter won't spin-down drives anyway...
# * not during selftest
# * and not until after a regular drive access has happened (if ever on a mostly idle backup or spare drive)
# Writing is needed to avoid only reading buffered data without any adapter activity.
write_to_devices
echo -e "\n * Hear, devices should now be spinning idle, if not SSD cached."
beep
sleep 15


beep
echo -e "\n * Starting selftests for keep-alive testing:"


for TYPE in $TYPES
do
  prev_IFS=$IFS
  IFS='
'
# eval bugs here? output file wrongly named ...filen, better while read ...?
  for COMMAND in "${SIMPLE_KEEP_AWAKE_COMMANDS}\n${TYPED_KEEP_AWAKE_COMMANDS}"
  do
    IFS=$prev_IFS
    i=0
    run="true"
    
    for DEVICE in $DEVICES
    do
      INIT_COMMANDS="smartctl --abort \$DEVICE > /dev/null; smartctl --test=long \$DEVICE > /dev/null"
      echo -e "\n-------- Starting new long selftest on $DEVICE.\n"
      eval $INIT_COMMANDS

    done
    
    while [ "$run" = "true" ]
    do
      echo " -- At minute: $i"

            
      for DEVICE in $DEVICES
      do
        echo -e "\n * $DEVICE command: $COMMAND"
        eval $COMMAND
      done
      
      if [ "$i" -lt "$CHECK_TIME" ]
      then
        echo -n "<Press any letter key to stop keep-awake testing.>"
        read -t 60 -n 1 INPUT
        
        if [ "${INPUT:-undefined}" != "undefined" ]
        then
          echo -e '\r  exiting...                                             \n'
          finish
        fi
        echo -e '\r                                                         \n'
        
      else
        beep
        echo -ne "\n -> Enter list of halted devices (from \"$DEVICES\")\n or \"no\" if all are still spinning:"
        read -t 60 INPUT
        DEVICE="DEVICE"
        if [ "${INPUT:-undefined}" != "undefined" ]
        then
          echo "$COMMAND (aborted tests: $INPUT)" >> $LOG
          run="false"
        fi
      fi

      i=$((i+1))
    done
  done
done

finish

comment:12 by Christian Franke, 3 years ago

I'm not willing to add much complexity to smartd for this topic. A possible solution would be: First implement device specific check intervals (#336). Then add an extension to (currently ATA-only) -l selfteststs directive like -l selfteststs,SECONDS. If specified, check interval for an individual device is reduced to SECONDS if a self-test is in progress.

It could easily be tested whether this would prevent spin down of a device by running smartd with -i SECONDS option and start a self-test.

comment:13 by Ch.Ris, 3 years ago

I like your idea, that syntax also makes it much more consistent.

Could you see an optional addition like -l selftest,SECONDS,KEEP-AWAKE-COMMAND?

Specifying a command to execute could make it work with whatever sync-read, or -touch a particular device may need.

comment:14 by Ch.Ris, 3 years ago

It would even nicely allow for this:

I had wondered about a good way to schedule trim and scrub without colliding with selftests.

The keep-awake-command could be a script that, in addition to keep the disk awake, starts a timer to start a fstim after a short selftest and a full scrub after a long selftest, waiting interval+x seconds (and renewing that timer as long as the selftest is still running).

comment:15 by Christian Franke, 3 years ago

Before requesting further complex enhancements, please perform the test suggested in comment 12: Does some small -i SECONDS setting in smartd command line prevent spin down during self-test?

comment:16 by Ch.Ris, 3 years ago

Oh, sorry only now I realize you were still wondering if smartd could work as is, with just a reduced interval, i.e. without calling an explicit keep-awake action in the itervals during the selftests.

My fault, I tested that too and should have elaborated my conclusions.

None of the smartd nor smartctl actions wake the disks up or keep them awake, here. And the selftests were also aborted while running smartd -i 30. I think that is not unfortunate, because otherwise current smartd would prevent powersaving.

Here, the disks I tested here don't need to spin to answer the smartctl (and smartd) selftest queries. I guess that could depend on internal SSD storage. Some older disks may have to spin up for returning the selftest history for example, but certainly not these.

That's why I had concluded a more specific and effective keep-awake action would be needed anyway, to allow the selftests to complete on modern (non-24x7-server) hardware.

If we can find some direct-sync-io read function that wakes up all drives, that could be nice to use as default and call from smartd.

But lacking that, the least complex solution seems to only execute an externally specified keep-awake command during selftests.

Version 0, edited 3 years ago by Ch.Ris (next)

comment:17 by Christian Franke, 3 years ago

Milestone: unscheduledundecided

Some disks spin up (otherwise -n standby won't exist), others don't. Leaving ticket open as undecided for now.

comment:18 by Ch.Ris, 3 years ago

Ok thanks, I can recommend a dock with a nice new usb backup disk connected, to make it itch. ;-)

comment:19 by Ch.Ris, 3 years ago

For those affected, maybe try risking a device specific workaround:

In my setup, it seems that the "power-on in standby feature" (hdparm's "VERY DANGEROUS" -s option) alters the systems runtime behaviour in such a way that smartd -i 60 interval checks are enough to prevent the selftests from getting aborted, even with only a basic smartd.conf (DEVICESCAN -H -l error -f).

No idea why it seems to work, maybe the kernel prepends a "wake-up" command to the smartd query or something like that.

As it's not a boot device I didn't wory about BIOS support for "power-on in standby" drives.

comment:20 by Christian Franke, 3 years ago

Actually this is "VERY DANGEROUS" unless all systems you want to connect this drive are able to identify and then spin up drives with "power-on in standby" enabled. This is possibly the case for most RAID controller firmware and possibly not for most regular PC BIOS.

comment:21 by Ch.Ris, 3 years ago

Would you have a pointer to why or what could happen? I could not find info. I notice the (data) drives (1x sata, 1x usb-to-sata) now only spin-up later during boot (when getting mounted). BIOS detects them ok, does not need to boot from.

comment:22 by Christian Franke, 3 years ago

AFAICS from ATA ACS-4, full support of the Power-Up In Standby (PUIS) feature set requires that the BIOS or controller firmware properly handles these conditions:

If PUIS feature set enabled bit is set (by command or jumper), an IDENTIFY DEVICE command must not spin up the disk. A drive may have only tiny firmware in ROM which later boots the actual firmware from the disk. Therefore the drive may return incomplete IDENTIFY information, indicated by the Response incomplete bit. This information may be mostly empty, in particular in may report no model name and zero drive capacity. Then the BIOS must repeat the IDENTIFY DEVICE command later after spin up.

A drive may not spin up on I/O or other commands. This is indicated by the SET FEATURES subcommand required to spin-up bit. Then the BIOS must issue the SET FEATURES: PUIS feature set device spin-up command to spin up the drive. Otherwise the drive will be virtually bricked until connected to a controller with a PUIS compatible BIOS.

See smartctl --identify=wb ... for the mentioned bits.

comment:23 by Ch.Ris, 3 years ago

Thank you, that was the technical information I was missing.

Seems to confirm my assumption that it's not dangerous for the drive nor the data itself. It may just not spin up to allow booting with an incompatible BIOS, or be mountable after a device scan with an old incompatible OS (linux <2.6.22).

comment:24 by Christian Franke, 3 years ago

I guess that is correct.

Note: See TracTickets for help on using tickets.