Opened 6 years ago

Last modified 5 years ago

#1017 closed enhancement

make smartd usable as temperature monitor / replace hddtemp — at Initial Version

Reported by: calestyo Owned by:
Priority: major Milestone:
Component: smartd Version: 6.6
Keywords: Cc: Nathan Stratton Treadway

Description

Hi.

It seems to me that hddtemp, is more or less dead and unmaintained... and since (AFAIK) its temperature reading is also based on SMART, there is not much sense in having both, smartd and hddtemp.

For many of my newer devices like SSDs, hddtemp reports "no temp sensor found" or so... while smartmontools work perfectly on them (and display temperature).

smartd already seems to have some limited functionality to monitor a device's temperature, namely via:
-W DIFF[,INFO[,CRIT]]

There are a number of problems with it:

1) Most importantly, it seems that warnings are not re-sent as they occur (but only once a day?).
For example, I have had a line in smartd.conf like:
/dev/disk/by-id/ata-Samsung_SSD_850_PRO_1TB_S252NXAG910017F -d auto -d removable -n standby,4 -a -W 0,50,55 -m root -M exec /usr/share/smartmo
ntools/smartd-runner
For testing purposes I changed that to -W 0,20,25 and got an alert. Changed it back to -W 0,50,55 (which is fine for that device) and restarted... and then I repeated this (i.e. going back to something that should trigger a warning).
However, no further warning.

This behaviour may be reasonable for other smart values, e.g. things like:

  • Wear_Leveling_Count
  • Uncorrectable_Error_Cnt

would typically get only worse and not better again.
And things like:

  • ECC_Error_Rate

may increase pretty fast on some devices (one such value does on Seagate) and it's perfectly fine for them.

But for temperature monitoring it's IMHO bad:
My Samsung SSD for example, supports I think up to 70°C.
So I'd like to get a warning at say 50°C ... but not only the first time per day, because the temperature may decrease again then (or I just decrease the IO load on the device)... only to rise again shortly after (which I wouldn't notice anymore, as no further warning is sent).

Especially on mobile devices like laptops, temperatures can easily go up and down quite regularly.
Therefore it makes sense to send temperature errors every time they occur (i.e. that is once per check interval).

2) devices typically also have a minimum operation temperature
This is typically pretty low, so I'm not sure if it's can be even monitored properly (=> do the temp sensors of the disks give reasonable values for such low temps?)... but if they can, it would be nice if smartd would also monitor for a minimum temperature.

3) smartmontools should know the maxmin temperatures of the devices
*if* smartd would become a replacement / alternative to hddtemp, it would of course be nice if it comes with a database of maxmin temperatures for known devices.
Example, my Samsung SSD (according to Samsung) operate in some range between 0-70°C. My HDDs take much less (~50°C or so? would need to look it up).
So it would be nice, if there'd be a DB, that automatically selects reasonable values, like for the SSD in my case: INFO at 60°C, CRIT at 70°C

Cheers,
Chris.

Change History (0)

Note: See TracTickets for help on using tickets.