Opened 6 years ago

Closed 6 years ago

#301 closed defect (worksforme)

please improve Selective Self-Tests using smartd/smartctl

Reported by: alzheimer Owned by: Christian Franke
Priority: minor Milestone:
Component: smartd Version: 6.2
Keywords: Cc:

Description

I love the selective self tests feature (-t select,min-max/next/cont[+SIZE]). The long tests just take too long. However I can't get it to work right with smartd.

My configuration is this:

  • smartmontools version 6.2
  • smartd started with --savestates=/var/lib/smartd/ and verified that state files actually appear in that location
  • smartd.conf with -a -s c/../.././17 for all disks

My issues are these:

  • when I -t select,min-max a disk (to get things started), this info usually does not make it into the state files. And without the state files nothing works as my WDC disks don't seem to keep the data across reboots
  • after adding selective-test-last-start/end manually to the state files, things *seem* to work for a while. But it seems to forget the range size after a couple of days, probably when it reaches the end of the disk?
  • one of the disks is a SSD (Crucial M4 64G), and I did a -t select,0-max on it in an attempt to make the cont check always check the entire thing. What happens instead is that smartd actually checks nothing (0-0) every time and the last start/end vanishes from the state file. (I'm aware I should just make an extra config entry for the SSDs that use long tests as they don't take long on SSDs.)
  • If possible I'd like to be able to specify the SIZE directly in smartd.conf the same way it's possible to do on the command line (-t select,cont+SIZE). Ideally this should also remove the need of running the first select test manually.
  • It would be cool if SIZE could be specified as a percentage, so for example a DEFAULT entry that checks 5% of a disk could be made regardless of disk sizes.

Change History (8)

comment:1 in reply to:  description ; Changed 6 years ago by Christian Franke

Component: allsmartd
Owner: changed from somebody to Christian Franke
Priority: majorminor
Status: newaccepted
Type: enhancementdefect
  • when I -t select,min-max a disk (to get things started), this info usually does not make it into the state files.

It should appear in the state file after smartd has started the first selective test.

  • after adding selective-test-last-start/end manually to the state files, things *seem* to work for a while. But it seems to forget the range size after a couple of days, probably when it reaches the end of the disk?

Could not reproduce this. I use a similar configuration running 8x5 which works for 2 years now. It uses two WDC WD2003FYYS disks which do not keep test spans across power cycles.

  • one of the disks is a SSD (Crucial M4 64G), and I did a -t select,0-max on it in an attempt to make the cont check always check the entire thing. What happens instead is that smartd actually checks nothing (0-0) every time and the last start/end vanishes from the state file.

Possibly the next spawn calculation does not work in this corner case. Or the SSD does never keep the previous span in selective self-test log page. Please test what happens if the second test is started by "smartctl -t select,cont ..." instead of smartd.

  • If possible I'd like to be able ...
  • It would be cool ...

Both make sense. Please create a separate ticket for these feature requests because these are independent from the possible bugs above.

comment:2 in reply to:  1 Changed 6 years ago by alzheimer

Replying to chrfranke:

it seems to forget the range size after a couple of days, probably when it reaches the end of the disk?

Could not reproduce this.

How does the end-of-disk case work? The states only store the start and end, not the size itself, so maybe there is a problem when the select size is not a divisor of the total size? Or would it store a larger end than max in the state file for the last step in order to remember the size?

I'll do some more testing regarding this and your other suggestions but it takes a bit of time.

  • one of the disks is a SSD (Crucial M4 64G), and I did a -t select,0-max

Please test what happens if the second test is started by "smartctl -t select,cont ..." instead of smartd.

While the test is running the smartctl testlog output looks like this:

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 125045423 Self_test_in_progress [70% left] (30418936-30484471)

When the test is finished:

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 125045423 Not_testing

A select,cont issued by smartctl seems to be doing the right thing:

START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION

Continue Selective Self-Test: Start next span
Sending command: "Execute SMART Selective self-test routine immediately in off-line mode".
SPAN STARTING_LBA ENDING_LBA

0 0 125045423

So I don't know why the same does not work for smartd. I'll test what happens when I make it test the half SSD instead.

Please create a separate ticket for these feature requests

Created a separate ticket as requested in #302

Thank you

comment:3 Changed 6 years ago by alzheimer

Testing the SSD with an odd range (0-124045423, instead of the max 125045423), causes smartd to run the next test as (62522712-125045423) which is exactly the second half of the disk. So it seems smartd does try to be smart about adapting testing ranges, maybe that's why my ranges changed on the HDDs as well?

The next thing I tested was (123045423-124045423). smartd followed that up with (124053000-125045423). Running it manually results in:

Continue Selective Self-Test: Start next span
Span 0 changed from 124045424-125045424 (1000001 sectors)

to 124053000-125045423 (992424 sectors) (126 spans)

And when it wraps around it's:

SPAN STARTING_LBA ENDING_LBA

0 0 992423

So it seems to be possible for the size to change on wrap arounds, when your sector ranges are odd enough...? 992424 seems to fit the disk size perfectly though (992424*126=max). So that seems to make a whole lot of sense, if the cont tests are supposed to auto adapt odd ranges...

Also testing with other ranges,

Span 0 changed from 125042038-166722714 (41680677 sectors)

to 93784068-125045423 (31261356 sectors) (4 spans)

So there can be changes on wraparounds, maybe that's all I was seeing.

It also does appear in the state file once smartd runs the test, as you said. This was my mistake, I started a selective test and expected smartd to run its own test the next day.

Probably everything works as intended, and I'm just too stupid for this feature ;)

comment:4 Changed 6 years ago by alzheimer

It also does appear in the state file once smartd runs the test, as you said.

As an afterthought, shouldn't it still store the ranges as soon they appear in the log of the HDD, seeing as how the next test may be days off? Or is this a feature so you can run arbitrary tests without smartctl immediately picking up on them? Like if you do select tests on Sundays, you can manually run any test you like on a Tuesday and thanks to reboot it will never affect the Sunday-test range?

comment:5 Changed 6 years ago by alzheimer

One thing I noticed about the select,0-max corner case, the state file contains only this:

selective-test-last-end = 125045423

There is no entry for selective-test-last-start anymore. Maybe that is why it fails after a reboot?

comment:6 Changed 6 years ago by alzheimer

...and I can't reproduce the SSD 0-max issue anymore either. I don't know what I did wrong before, but it seems to work now. The only other thing I changed is that I updated the firmware of the SSD.

I'm happy now, sorry for the fuss. #302 would still be a great feature, though. Thanks for your help, ~

comment:7 in reply to:  5 Changed 6 years ago by Christian Franke

One thing I noticed about the select,0-max corner case, the state file contains only this:

selective-test-last-end = 125045423

Entries with value 0 are not written. Nonexistent entries are read as 0. There is no distinction between "0" and "unset".

comment:8 Changed 6 years ago by Christian Franke

Resolution: worksforme
Status: acceptedclosed

SSD 0-max issue could not be reproduced. See #302 for requested features.

Note: See TracTickets for help on using tickets.