Opened 7 years ago

Closed 3 years ago

Last modified 3 years ago

#800 closed enhancement (wontfix)

"can't get bus number" issue with MegaRAID on ESXi

Reported by: Simone Giordano Owned by:
Priority: major Milestone:
Component: all Version: 6.5
Keywords: megaraid esxi linux Cc: Bruno da Costa, Deepali

Description

There is an issue using smartctl on ESXi to monitor disks behind the RAID.
Example:

smartctl -a /dev/disks/naa.6c81f660d2aeab001fd4153f9ba416c5 -d sat+megaraid,12

Smartctl open device: /dev/disks/naa.6c81f660d2aeab001fd4153f9ba416c5 [megaraid_disk_12] [SAT] failed: can't get bus number

I've compiled a static version of smartctl from updated sources (6.6 r4384) and the issue still exists. Because ESXi is different than a normal Linux distribution, I've tried to patch os_linux.cpp forcing linux_megaraid_device::open to use the right device:

  if ((m_fd = ::open("/dev/megaraid_sas_ioctl", O_RDWR)) >= 0) {
    m_hba = 1;  // ?
    pt_cmd = &linux_megaraid_device::megasas_cmd;
    set_fd(m_fd);
    return true; 
  }

After this patch, the device is opened but I get "INQUIRY FAILED"

On ESXi the MegaCli utility works right, so I think there are no issues with driver or ioctl support.

I can do any test that you want or apply a particular patch.

It's important for monitor disks behind RAID because the SMART indicators reported by controller are very poor.

Thank you.
Simone

Attachments (1)

storcli_strace_output.txt (185.2 KB ) - added by Bruno da Costa 5 years ago.
STrace output of 'storcli' command on ESXi

Download all attachments as: .zip

Change History (24)

in reply to:  description comment:1 by Christian Franke, 7 years ago

Component: smartctlall
Keywords: linux added
Milestone: undecided
Priority: minormajor

comment:2 by Alex Samorukov, 6 years ago

Please try to run smartctl --scan-open.

comment:3 by Alex Samorukov, 6 years ago

Also you can get statically build smartmontools from the builds.smartmontools.org website.

comment:4 by Simone Giordano, 6 years ago

Also with latest version 6.6 2017-09-20 r4440 the error is the same:

./smartctl --scan-open
Segmentation fault

./smartctl -a /dev/disks/naa.6c81f660d2aeab001fd4153f9ba416c5 -d sat+megaraid,12
smartctl 6.6 2017-09-20 r4440 [x86_64-linux-6.0.0] (daily-20170920)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

Smartctl open device: /dev/disks/naa.6c81f660d2aeab001fd4153f9ba416c5 [megaraid_disk_12] [SAT] failed: can't get bus number

comment:5 by Alex Samorukov, 6 years ago

Hi, do you think it would be possible to provide temporary ssh access for further debugging or at least core dump? It is very interesting to see where smartctl crashed. Please also try --scan-open with -r ioctl,3

comment:6 by Alex Samorukov, 6 years ago

No reply within 5 months, closing ticket

comment:7 by Alex Samorukov, 6 years ago

Resolution: wontfix
Status: newclosed

comment:8 by Christian Franke, 6 years ago

Milestone: undecided

comment:9 by chris watts, 5 years ago

Alex Samorukov,
I have exactly the same problem, I can provide you temp ssh access to diagnose the fault, we need to be able to get smart data from drives behind an LSI card in an ESXi machine.

./smartctl -a /dev/disks/naa.6782bcb05a114e00233c51f30afd396d -d megaraid,0
smartctl 6.6 2017-08-08 r4433 [x86_64-linux-6.7.0] (daily-20170808)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

Smartctl open device: /dev/disks/naa.6782bcb05a114e00233c51f30afd396d [megaraid_disk_00] failed: can't get bus number

./smartctl --scan-open
Segmentation fault

./smartctl --scan-open -r ioctl,3
glob(3) found no matches for pattern /dev/hd[a-t]
glob(3) found no matches for pattern /dev/sd[a-z]
glob(3) found no matches for pattern /dev/sd[a-c][a-z]
Segmentation fault

comment:10 by Christian Franke, 5 years ago

Milestone: undecided
Resolution: wontfix
Status: closedreopened

Reopening ticket because new info is available.

in reply to:  9 ; comment:11 by Christian Franke, 5 years ago

... behind an LSI card in an ESXi machine.

Which "LSI card" (chip) ?

Smartctl open device: /dev/disks/naa.6782bcb05a114e00233c51f30afd396d [megaraid_disk_00] failed: can't get bus number

This means that SG_GET_SCSI_ID and SCSI_IOCTL_GET_BUS_NUMBER on this path failed for some unknown reason (e.g. not supported).

Does /proc/devices exist? If yes, please examine its contents and provide all lines which contain megaraid or megadev.

Does /dev/megaraid_sas_ioctl_node exist?

Does /dev/megadev0 exist?

Is the ESXi LSI driver actually similar to the Linux one (i.e. same ioctl()s supported) ?

Is the source code of this driver publicly available?

./smartctl --scan-open -r ioctl,3
glob(3) found no matches for pattern /dev/hd[a-t]
glob(3) found no matches for pattern /dev/sd[a-z]
glob(3) found no matches for pattern /dev/sd[a-c][a-z]
Segmentation fault

This segfault occurs if /proc/devices does not exist. The related bug was fixed in r4723. If possible, please test current SVN version of smartctl.

A closer look reveals that a similar bug still exists in linux_megaraid_device::open().

comment:12 by chris watts, 5 years ago

Thanks for getting back to me.
No /proc/devices on ESXi machines.

/dev/megaraid_sas_ioctl exists.

lrwxrwxrwx    1 root     root          33 Oct 11 02:28 /dev/megaraid_sas_ioctl -> char/vmkdriver/megaraid_sas_ioctl

/dev/megadev0 does not exist.

The card is LSI Mega RAID SAS 9261-8i

Driver source is not available.

[root@SAU-A625C-OR:/opt/lsi/storcli] ./storcli show all
CLI Version = 007.0709.0000.0000 Aug 14, 2018
Operating system = VMkernel 6.7.0
Status Code = 0
Status = Success
Description = None

Number of Controllers = 2
Host Name = SAU-A625C-OR
Operating System  = VMkernel 6.7.0
Store Lib IT Version = 07.0705.0200.0000
Store Lib IR3 Version = 16.02-0

----------------------------------------------------------------------------------
Ctl Model                 Ports PDs DGs DNOpt VDs VNOpt BBU  sPR DS EHS ASOs Hlth
----------------------------------------------------------------------------------
  0 LSIMegaRAIDSAS9261-8i     8   2   1     0   1     0 Msng On  -  Y      2 Opt
----------------------------------------------------------------------------------

-------------------------------------------------------------------------
Ctl Model      Adapter-Type   Vend-Id Dev-Id Sub-Vend-Id Sub-Dev-Id PCI Address
-------------------------------------------------------------------------
  1 SAS9300-8i   SAS3008(C0) 0x1000  0x97    0x1000   0x30E0 00:81:00:00
-------------------------------------------------------------------------

ASO :
----------------------------------------------------
Ctl Cl SAS MD R6 WC R5 SS FP Re CR RF CO CW HA SSHA
----------------------------------------------------
  0 X  U   X  U  U  U  X  X  X  X  X  X  X  X  X
----------------------------------------------------

Cl=Cluster|MD=Max Disks|WC=Wide Cache|SS=Safe Store|FP=Fast Path|Re=Recovery
CR=Cache-Cade(Read)|RF=Reduced Feature Set|CO=Cache Offload
CW=Cache-Cade(Read / Write)|X=Not Available / Not Installed|U=Unlimited|T=Trial
|HA=High Availability |SSHA=Single server High Availability
Last edited 5 years ago by Christian Franke (previous) (diff)

in reply to:  12 comment:13 by Christian Franke, 5 years ago

No /proc/devices on ESXi machines.

This explains the segfault. Smartctl cannot create the missing nodes without info from /proc/devices.

/dev/megaraid_sas_ioctl exists.

This does not help, as /dev/megaraid_sas_ioctl_node is required by -d megaraid code.

/dev/megadev0 does not exist.
...
Driver source is not available.

Conclusion: The ESXi MegaRAID driver is different from the Linux driver which is currently supported by smartmontools. More info (documentation, sample source code, reverse engineering result, ...) is required.

If no info could be provided, this ticket will be resolved as wontfix again.

in reply to:  11 comment:14 by Christian Franke, 5 years ago

Replying to Christian Franke:

A closer look reveals that a similar bug still exists in linux_megaraid_device::open().

Fixed in r4809. This fixes the possible crash but not the -d megaraid functionality under ESXi as a required device node is missing.

comment:15 by Christian Franke, 5 years ago

The current implementation of -d megaraid device type in os_linux.cpp works as follows:

  1. Detect bus (HBA) number as follows: If device path matches /dev/bus/N* use N as number or else try ioctl SG_GET_SCSI_ID or else try SCSI_IOCTL_GET_BUS_NUMBER or else fail.
  1. Create possibly missing device nodes /dev/megaraid_sas_ioctl_node and /dev/megadev0 based on major device numbers listed in /proc/devices.
  1. Open /dev/megaraid_sas_ioctl_node or else /dev/megadev0 or else fail.
  1. For pass-through access, use ioctl MEGASAS_IOC_FIRMWARE for /dev/megaraid_sas_ioctl_node or else use MEGAIOCCMD for /dev/megadev0.

Observations on ESXi collected from above comments:

  1. Neither SG_GET_SCSI_ID nor SCSI_IOCTL_GET_BUS_NUMBER work. Do /dev/bus/N* nodes exist on ESXi?
  1. /proc/devices does not exist.
  1. Neither /dev/megaraid_sas_ioctl_node nor /dev/megadev0 exist, /dev/megaraid_sas_ioctl exists instead.
  1. /dev/megaraid_sas_ioctl could be opened instead, but MEGASAS_IOC_FIRMWARE does not work then. Does another ioctl with same functionality exist on ESXi?

comment:16 by Christian Franke, 5 years ago

Milestone: undecided
Resolution: wontfix
Status: reopenedclosed

The ESXi MegaRAID driver is different from the Linux driver which is currently supported by smartmontools. More info (documentation, sample source code, reverse engineering result, ...) is required.

Please reopen this ticket if (and only if) more info is available.

by Bruno da Costa, 5 years ago

Attachment: storcli_strace_output.txt added

STrace output of 'storcli' command on ESXi

comment:17 by Bruno da Costa, 5 years ago

Resolution: wontfix
Status: closedreopened

Hello,

I'm re-opening this ticket with some (hopefully) useful data about how a MegaRAID controller works on a ESXi 6.5 box. I attached to this ticket the output of a strace taken from storcli (LSI's/Broadcom's native ESXi tool) listing information about all of the physical devices attached to it. Here are some snippets:

# strace /opt/lsi/storcli/storcli /call /eall /sall show
execve("/opt/lsi/storcli/storcli", ["/opt/lsi/storcli/storcli", "/call", "/eall", "/sall", "show"], [/* 17 vars */]) = 0
[ Process PID=163823 runs in 32 bit mode. ]
[... loading libraries...]
uname({sys="VMkernel", node="hypervisor", ...}) = 0
access("/etc/vmware/hostd/mockupEsxHost.txt", F_OK) = -1 ENOENT (No such file or directory)
open("/etc/lsi/storelibconf.ini", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/dev/megaraid_sas_ioctl", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/dev/megaraid_perc9_ioctl", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/vmfs/devices/char/vmkdriver/vmwMgmtInfo", O_RDWR|O_LARGEFILE) = 3
ioctl(3, 0x800, 0xff939894)             = 0
close(3)                                = 0
open("/vmfs/devices/char/vmkdriver/vmwMgmtNode2", O_RDWR|O_LARGEFILE) = 3
ioctl(3, 0x100, 0x8bb7b10)              = 0
pipe([4, 5])                            = 0
mmap2(NULL, 528384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0xa4f6000
mprotect(0xa4f6000, 4096, PROT_NONE)    = 0
clone(child_stack=0xa576484, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID|CLONE_DETACHED, parent_tidptr=0xa576bd8, tls=0xa576bd8, child_tidptr=0xff939c40) = 163824
futex(0x8bb8098, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x8bb80b4, FUTEX_WAIT_PRIVATE, 1, NULL) = 0
futex(0x8bb8098, FUTEX_WAKE_PRIVATE, 1) = 0
ioctl(3, 0x200, 0x8bb82b0)              = 0
open("/vmfs/devices/char/vmkdriver/vmwMgmtInfo", O_RDWR|O_LARGEFILE) = 6
ioctl(6, 0x800, 0xff939894)             = 0
close(6)                                = 0
open("/vmfs/devices/char/vmkdriver/vmwMgmtInfo", O_RDWR|O_LARGEFILE) = 6
ioctl(6, 0x800, 0xff939894)             = 0
close(6)                                = 0
[... this repeats a lot ...]
ioctl(3, 0x200, 0x8bb82b0)              = 0
open("/dev/megaraid_swr_ioctl_node", O_RDONLY) = -1 ENOENT (No such file or directory)
ioctl(3, 0x200, 0x8bb7980)              = 0
uname({sys="VMkernel", node="hypervisor", ...}) = 0
uname({sys="VMkernel", node="hypervisor", ...}) = 0
ioctl(3, 0x200, 0x8bb7980)              = 0
ioctl(3, 0x200, 0x8bb79c8)              = 0
ioctl(3, 0x200, 0x8bb8388)              = 0
ioctl(3, 0x200, 0x8bb8610)              = 0
brk(0x8bfb000)                          = 0x8bfb000
ioctl(3, 0x200, 0x8bb8700)              = 0
ioctl(3, 0x200, 0x8bb96d0)              = 0
brk(0x8beb000)                          = 0x8beb000
ioctl(3, 0x200, 0x8bb9f78)              = 0
brk(0x8c0c000)                          = 0x8c0c000
ioctl(3, 0x200, 0x8bb9f88)              = 0
brk(0x8bfc000)                          = 0x8bfc000
ioctl(3, 0x200, 0x8bdb3c8)              = 0
ioctl(3, 0x200, 0x8bca8f8)              = 0
ioctl(3, 0x200, 0x8bcab00)              = 0
ioctl(3, 0x200, 0x8bcad08)              = 0
ioctl(3, 0x200, 0x8bcaf10)              = 0
ioctl(3, 0x200, 0x8bcb118)              = 0
ioctl(3, 0x200, 0x8bcb320)              = 0
ioctl(3, 0x200, 0x8bcb528)              = 0
ioctl(3, 0x200, 0x8bcb730)              = 0
ioctl(3, 0x200, 0x8bcb938)              = 0
ioctl(3, 0x200, 0x8bcbb40)              = 0
ioctl(3, 0x200, 0x8bcbd48)              = 0
ioctl(3, 0x200, 0x8bcbf50)              = 0
ioctl(3, 0x200, 0x8bcc158)              = 0
ioctl(3, 0x200, 0x8bcc360)              = 0
ioctl(3, 0x200, 0x8bcc568)              = 0
ioctl(3, 0x200, 0x8bcc770)              = 0
ioctl(3, 0x200, 0x8bcc978)              = 0
ioctl(3, 0x200, 0x8bccb80)              = 0
ioctl(3, 0x200, 0x8bccd88)              = 0
ioctl(3, 0x200, 0x8bccf90)              = 0
ioctl(3, 0x200, 0x8bcd198)              = 0
ioctl(3, 0x200, 0x8bcd3a0)              = 0
ioctl(3, 0x200, 0x8bcd5a8)              = 0
ioctl(3, 0x200, 0x8bcd7b0)              = 0
ioctl(3, 0x200, 0x8bcd5d0)              = 0
ioctl(3, 0x200, 0x8bcdaa8)              = 0
ioctl(3, 0x200, 0x8bcdca8)              = 0
ioctl(3, 0x200, 0x8bce288)              = 0
ioctl(3, 0x200, 0x8bced20)              = 0
ioctl(3, 0x200, 0x8bcee08)              = 0
ioctl(3, 0x200, 0x8bcef18)              = 0
ioctl(3, 0x200, 0x8bcf078)              = 0
ioctl(3, 0x200, 0x8bcf078)              = 0
ioctl(3, 0x200, 0x8bcfae8)              = 0
ioctl(3, 0x200, 0x8bd0208)              = 0
ioctl(3, 0x200, 0x8bd08a8)              = 0
ioctl(3, 0x200, 0x8bd0f48)              = 0
ioctl(3, 0x200, 0x8bd15e8)              = 0
ioctl(3, 0x200, 0x8bd1c88)              = 0
ioctl(3, 0x200, 0x8bd2328)              = 0
ioctl(3, 0x200, 0x8bd29c8)              = 0
ioctl(3, 0x200, 0x8bd3068)              = 0
ioctl(3, 0x200, 0x8bd3708)              = 0
ioctl(3, 0x200, 0x8bd3da8)              = 0
ioctl(3, 0x200, 0x8bd4448)              = 0
ioctl(3, 0x200, 0x8bd4ae8)              = 0
ioctl(3, 0x200, 0x8bd5188)              = 0
ioctl(3, 0x200, 0x8bd5828)              = 0
ioctl(3, 0x200, 0x8bd5ec8)              = 0
ioctl(3, 0x200, 0x8bd6568)              = 0
ioctl(3, 0x200, 0x8bd6c08)              = 0
ioctl(3, 0x200, 0x8bd72c0)              = 0
ioctl(3, 0x200, 0x8bd7978)              = 0
ioctl(3, 0x200, 0x8bd8018)              = 0
ioctl(3, 0x200, 0x8bd86b8)              = 0
ioctl(3, 0x200, 0x8bd8d58)              = 0
ioctl(3, 0x200, 0x8bd93f8)              = 0
fstat64(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(0, 0), ...}) = 0
ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, {B9600 opost isig icanon echo ...}) = 0
mmap2(NULL, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xa577000
[...command writes result to stdout...]

Of interest, it looks like, instead of using /dev/devices and /dev/megaraid*, storcli uses /vmfs/devices/char/vmkdriver/vmwMgmtInfo and /vmfs/devices/char/vmkdriver/vmwMgmtInfo2 on ESXi.

I have an ESXi 6.5u2 box with a MegaRAID 9265-8i connected to it and I'm available to run commands and provide any information I can to help make smartctl work on ESXi. Let me know how I can help.

Thanks!

Last edited 5 years ago by Bruno da Costa (previous) (diff)

comment:18 by Bruno da Costa, 5 years ago

Cc: Bruno da Costa added

comment:19 by Christian Franke, 5 years ago

Milestone: undecided

comment:20 by Christian Franke, 5 years ago

Milestone: undecided
Resolution: wontfix
Status: reopenedclosed

The info from above comment is not sufficient, sorry.

There is no information about size and contents of the structures passed to the various ioctl(3, 0x?00, 0x????????) calls.

comment:21 by Deepali, 3 years ago

Cc: Deepali added
Resolution: wontfix
Status: closedreopened

Hi
I really need to get a powerful tool like smartmontools to work on esxi. It works very well on linux, but on esxi I run into one error after another. I took the latest build smartmontools-linux-x86_64-static-7.3-r5227.tar.gz, created a vib out of it and installed on my vmware host with esxi 6.5. It has 8 disks behind a PERC raid controller.
I get Function not implemented error when trying to access smart parameters for one of them.
I am attaching the strace of the command here.


[root@Poweredge-R720-ESXi6:/usr/local/sbin] strace ./smartctl  -d sat --all /dev/disks/naa.6c81f660f18d100021b289ce0c3cf070
execve("./smartctl", ["./smartctl", "-d", "sat", "--all", "/dev/disks/naa.6c81f660f18d10002"...], [/* 19 vars */]) = 0
geteuid()                               = 0
getuid()                                = 0
getegid()                               = 0
getgid()                                = 0
brk(0)                                  = 0x661db64000
brk(0x661db651c0)                       = 0x661db651c0
arch_prctl(ARCH_SET_FS, 0x661db64880)   = 0
uname({sys="VMkernel", node="Poweredge-R720-ESXi6.5.hsd1.ca.comcast.net", ...}) = 0
readlink("/proc/self/exe", 0x3080eb55b40, 4096) = -1 ENOENT (No such file or directory)
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
brk(0x661db861c0)                       = 0x661db861c0
brk(0x661db87000)                       = 0x661db87000
uname({sys="VMkernel", node="Poweredge-R720-ESXi6.5.hsd1.ca.comcast.net", ...}) = 0
fstat(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(0, 0), ...}) = 0
ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
write(1, "smartctl 7.3 2021-06-26 r5227 [x"..., 62smartctl 7.3 2021-06-26 r5227 [x86_64-linux-6.5.0] (CircleCI)) = 62
write(1, "Copyright (C) 2002-21, Bruce All"..., 76Copyright (C) 2002-21, Bruce Allen, Christian Franke, www.smartmontools.org) = 76
write(1, "\n", 1)                       = 1
access("/usr/local/etc/smart_drivedb.h", F_OK) = -1 ENOENT (No such file or directory)
access("/usr/local/share/smartmontools/drivedb.h", F_OK) = 0
openat(AT_FDCWD, "/usr/local/share/smartmontools/drivedb.h", O_RDONLY) = -1 ENOSYS (Function not implemented)
write(1, "/usr/local/share/smartmontools/d"..., 74/usr/local/share/smartmontools/drivedb.h: cannot open drive database file) = 74
exit_group(1)                           = ?

I get similar error for direct access,(non raided disks as well)
Is there any support for smartctl to work on esxi?
What does the error " /usr/local/share/smartmontools/d"..., 74/usr/local/share/smartmontools/drivedb.h: cannot open drive database file " mean?

Can you help to get it to work?
Would appreciate your response.
Thank you
-Deepali

Last edited 3 years ago by Deepali (previous) (diff)

in reply to:  18 comment:22 by Christian Franke, 3 years ago

Replying to Deepali:
Smartctl fails early before any device access due to unimplemented file open function:

openat(AT_FDCWD, "/usr/local/share/smartmontools/drivedb.h", O_RDONLY) = -1 ENOSYS (Function not implemented)
write(1, "/usr/local/share/smartmontools/d"..., 74/usr/local/share/smartmontools/drivedb.h: cannot open drive database file) = 74

A ESXi compatible C runtime is possibly required. Please see the mail thread mentioned in the related FAQ entry.

Last edited 3 years ago by Christian Franke (previous) (diff)

comment:23 by Christian Franke, 3 years ago

Resolution: wontfix
Status: reopenedclosed
Type: defectenhancement

The info from above comment does not help.

Please do not reopen this ticket unless you could provide a working patch.

Note: See TracTickets for help on using tickets.