Opened 7 years ago

Closed 7 years ago

#788 closed task (worksforme)

Add temperature raw value in syslog, only log if normalized "health" value is below 100%

Reported by: thomas303 Owned by: Christian Franke
Priority: minor Milestone:
Component: all Version: 6.5
Keywords: Cc:

Description (last modified by Christian Franke)

Forwarding from https://bugs.launchpad.net/ubuntu/+source/smartmontools/+bug/1653560

syslog entries like

Jan 2 20:22:27 server smartd[876]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 110 to 112

should be less confusing and logging should by default only take place if something is worth to be warned about.

That said, a "health" value below 100% (e.g. 98%) should trigger the logging, because then the health status as specified by the vendor is no more perfect.

And the output could be more verbose and less confusing. I suggest:

Jan 2 20:22:27 server smartd[876]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius: Thermal health changed from 110% (40°C) to 112% (38°C)

Given that normalization is specified by vendors, smartmontools could also take into account that e.g. health below 90% is critical (for WD drives that would be 60°C) and also should reported as critical (WARNING, etc.).

Change History (3)

comment:1 by Christian Franke, 7 years ago

Description: modified (diff)

comment:2 by Christian Franke, 7 years ago

Owner: set to Christian Franke
Status: newaccepted

comment:3 by Christian Franke, 7 years ago

Resolution: worksforme
Status: acceptedclosed

Use -W DIFF[,INFO[,CRIT]] directive to track temperature (works also with SCSI/SAS and NVMe devices). Add -I 194 to suppress the above messages.

For example -W 2,50,60 would result in LOG_INFO messages like:

Jan 2 20:22:27 server smartd[876]: Device: /dev/sda [SAT], Temperature changed -2 Celsius to 38 Celsius (Min/Max 30/45)
Jan 2 21:22:27 server smartd[876]: Device: /dev/sda [SAT], Temperature 52 Celsius reached limit of 50 Celsius (Min/Max 30/52)

and LOG_CRIT messages (and warning emails) like:

Jan 2 22:22:27 server smartd[876]: Device: /dev/sda [SAT], Temperature 61 Celsius reached critical limit of 60 Celsius (Min/Max 30/61)

Alternatively add -r 194 to log the raw value along with the normalized value:

Jan 2 20:22:27 server smartd[876]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 110 [Raw 40 (Min/Max 30/45)] to 112 [Raw 38 (Min/Max 30/45)]

See man smartd.conf for details.

Note that the mapping of raw temperature value to the normalized value is vendor and device specific. Various (undocumented) mappings exist in practice:
100-temperature, 150-temperature (above), temperature unchanged, something else.
The normalized value should not be interpreted as a health percentage.

Note: See TracTickets for help on using tickets.