Opened 2 years ago

Last modified 4 weeks ago

#800 reopened defect

"can't get bus number" issue with MegaRAID on ESXi

Reported by: Simone Giordano Owned by:
Priority: major Milestone: undecided
Component: all Version: 6.5
Keywords: megaraid esxi linux Cc: Bruno da Costa

Description

There is an issue using smartctl on ESXi to monitor disks behind the RAID.
Example:

smartctl -a /dev/disks/naa.6c81f660d2aeab001fd4153f9ba416c5 -d sat+megaraid,12

Smartctl open device: /dev/disks/naa.6c81f660d2aeab001fd4153f9ba416c5 [megaraid_disk_12] [SAT] failed: can't get bus number

I've compiled a static version of smartctl from updated sources (6.6 r4384) and the issue still exists. Because ESXi is different than a normal Linux distribution, I've tried to patch os_linux.cpp forcing linux_megaraid_device::open to use the right device:

  if ((m_fd = ::open("/dev/megaraid_sas_ioctl", O_RDWR)) >= 0) {
    m_hba = 1;  // ?
    pt_cmd = &linux_megaraid_device::megasas_cmd;
    set_fd(m_fd);
    return true; 
  }

After this patch, the device is opened but I get "INQUIRY FAILED"

On ESXi the MegaCli? utility works right, so I think there are no issues with driver or ioctl support.

I can do any test that you want or apply a particular patch.

It's important for monitor disks behind RAID because the SMART indicators reported by controller are very poor.

Thank you.
Simone

Attachments (1)

storcli_strace_output.txt (185.2 KB) - added by Bruno da Costa 4 weeks ago.
STrace output of 'storcli' command on ESXi

Download all attachments as: .zip

Change History (20)

comment:1 in reply to:  description Changed 2 years ago by Christian Franke

Component: smartctlall
Keywords: linux added
Milestone: undecided
Priority: minormajor

comment:2 Changed 16 months ago by Alex Samorukov

Please try to run smartctl --scan-open.

comment:3 Changed 16 months ago by Alex Samorukov

Also you can get statically build smartmontools from the builds.smartmontools.org website.

comment:4 Changed 16 months ago by Simone Giordano

Also with latest version 6.6 2017-09-20 r4440 the error is the same:

./smartctl --scan-open
Segmentation fault

./smartctl -a /dev/disks/naa.6c81f660d2aeab001fd4153f9ba416c5 -d sat+megaraid,12
smartctl 6.6 2017-09-20 r4440 [x86_64-linux-6.0.0] (daily-20170920)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

Smartctl open device: /dev/disks/naa.6c81f660d2aeab001fd4153f9ba416c5 [megaraid_disk_12] [SAT] failed: can't get bus number

comment:5 Changed 16 months ago by Alex Samorukov

Hi, do you think it would be possible to provide temporary ssh access for further debugging or at least core dump? It is very interesting to see where smartctl crashed. Please also try --scan-open with -r ioctl,3

comment:6 Changed 11 months ago by Alex Samorukov

No reply within 5 months, closing ticket

comment:7 Changed 11 months ago by Alex Samorukov

Resolution: wontfix
Status: newclosed

comment:8 Changed 11 months ago by Christian Franke

Milestone: undecided

comment:9 Changed 4 months ago by chris watts

Alex Samorukov,
I have exactly the same problem, I can provide you temp ssh access to diagnose the fault, we need to be able to get smart data from drives behind an LSI card in an ESXi machine.

./smartctl -a /dev/disks/naa.6782bcb05a114e00233c51f30afd396d -d megaraid,0
smartctl 6.6 2017-08-08 r4433 [x86_64-linux-6.7.0] (daily-20170808)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

Smartctl open device: /dev/disks/naa.6782bcb05a114e00233c51f30afd396d [megaraid_disk_00] failed: can't get bus number

./smartctl --scan-open
Segmentation fault

./smartctl --scan-open -r ioctl,3
glob(3) found no matches for pattern /dev/hd[a-t]
glob(3) found no matches for pattern /dev/sd[a-z]
glob(3) found no matches for pattern /dev/sd[a-c][a-z]
Segmentation fault

comment:10 Changed 4 months ago by Christian Franke

Milestone: undecided
Resolution: wontfix
Status: closedreopened

Reopening ticket because new info is available.

comment:11 in reply to:  9 ; Changed 4 months ago by Christian Franke

... behind an LSI card in an ESXi machine.

Which "LSI card" (chip) ?

Smartctl open device: /dev/disks/naa.6782bcb05a114e00233c51f30afd396d [megaraid_disk_00] failed: can't get bus number

This means that SG_GET_SCSI_ID and SCSI_IOCTL_GET_BUS_NUMBER on this path failed for some unknown reason (e.g. not supported).

Does /proc/devices exist? If yes, please examine its contents and provide all lines which contain megaraid or megadev.

Does /dev/megaraid_sas_ioctl_node exist?

Does /dev/megadev0 exist?

Is the ESXi LSI driver actually similar to the Linux one (i.e. same ioctl()s supported) ?

Is the source code of this driver publicly available?

./smartctl --scan-open -r ioctl,3
glob(3) found no matches for pattern /dev/hd[a-t]
glob(3) found no matches for pattern /dev/sd[a-z]
glob(3) found no matches for pattern /dev/sd[a-c][a-z]
Segmentation fault

This segfault occurs if /proc/devices does not exist. The related bug was fixed in r4723. If possible, please test current SVN version of smartctl.

A closer look reveals that a similar bug still exists in linux_megaraid_device::open().

comment:12 Changed 3 months ago by chris watts

Thanks for getting back to me.
No /proc/devices on ESXi machines.

/dev/megaraid_sas_ioctl exists.

lrwxrwxrwx    1 root     root          33 Oct 11 02:28 /dev/megaraid_sas_ioctl -> char/vmkdriver/megaraid_sas_ioctl

/dev/megadev0 does not exist.

The card is LSI Mega RAID SAS 9261-8i

Driver source is not available.

[root@SAU-A625C-OR:/opt/lsi/storcli] ./storcli show all
CLI Version = 007.0709.0000.0000 Aug 14, 2018
Operating system = VMkernel 6.7.0
Status Code = 0
Status = Success
Description = None

Number of Controllers = 2
Host Name = SAU-A625C-OR
Operating System  = VMkernel 6.7.0
Store Lib IT Version = 07.0705.0200.0000
Store Lib IR3 Version = 16.02-0

----------------------------------------------------------------------------------
Ctl Model                 Ports PDs DGs DNOpt VDs VNOpt BBU  sPR DS EHS ASOs Hlth
----------------------------------------------------------------------------------
  0 LSIMegaRAIDSAS9261-8i     8   2   1     0   1     0 Msng On  -  Y      2 Opt
----------------------------------------------------------------------------------

-------------------------------------------------------------------------
Ctl Model      Adapter-Type   Vend-Id Dev-Id Sub-Vend-Id Sub-Dev-Id PCI Address
-------------------------------------------------------------------------
  1 SAS9300-8i   SAS3008(C0) 0x1000  0x97    0x1000   0x30E0 00:81:00:00
-------------------------------------------------------------------------

ASO :
----------------------------------------------------
Ctl Cl SAS MD R6 WC R5 SS FP Re CR RF CO CW HA SSHA
----------------------------------------------------
  0 X  U   X  U  U  U  X  X  X  X  X  X  X  X  X
----------------------------------------------------

Cl=Cluster|MD=Max Disks|WC=Wide Cache|SS=Safe Store|FP=Fast Path|Re=Recovery
CR=Cache-Cade(Read)|RF=Reduced Feature Set|CO=Cache Offload
CW=Cache-Cade(Read / Write)|X=Not Available / Not Installed|U=Unlimited|T=Trial
|HA=High Availability |SSHA=Single server High Availability
Last edited 3 months ago by Christian Franke (previous) (diff)

comment:13 in reply to:  12 Changed 3 months ago by Christian Franke

No /proc/devices on ESXi machines.

This explains the segfault. Smartctl cannot create the missing nodes without info from /proc/devices.

/dev/megaraid_sas_ioctl exists.

This does not help, as /dev/megaraid_sas_ioctl_node is required by -d megaraid code.

/dev/megadev0 does not exist.
...
Driver source is not available.

Conclusion: The ESXi MegaRAID driver is different from the Linux driver which is currently supported by smartmontools. More info (documentation, sample source code, reverse engineering result, ...) is required.

If no info could be provided, this ticket will be resolved as wontfix again.

comment:14 in reply to:  11 Changed 3 months ago by Christian Franke

Replying to Christian Franke:

A closer look reveals that a similar bug still exists in linux_megaraid_device::open().

Fixed in r4809. This fixes the possible crash but not the -d megaraid functionality under ESXi as a required device node is missing.

comment:15 Changed 3 months ago by Christian Franke

The current implementation of -d megaraid device type in os_linux.cpp works as follows:

  1. Detect bus (HBA) number as follows: If device path matches /dev/bus/N* use N as number or else try ioctl SG_GET_SCSI_ID or else try SCSI_IOCTL_GET_BUS_NUMBER or else fail.
  1. Create possibly missing device nodes /dev/megaraid_sas_ioctl_node and /dev/megadev0 based on major device numbers listed in /proc/devices.
  1. Open /dev/megaraid_sas_ioctl_node or else /dev/megadev0 or else fail.
  1. For pass-through access, use ioctl MEGASAS_IOC_FIRMWARE for /dev/megaraid_sas_ioctl_node or else use MEGAIOCCMD for /dev/megadev0.

Observations on ESXi collected from above comments:

  1. Neither SG_GET_SCSI_ID nor SCSI_IOCTL_GET_BUS_NUMBER work. Do /dev/bus/N* nodes exist on ESXi?
  1. /proc/devices does not exist.
  1. Neither /dev/megaraid_sas_ioctl_node nor /dev/megadev0 exist, /dev/megaraid_sas_ioctl exists instead.
  1. /dev/megaraid_sas_ioctl could be opened instead, but MEGASAS_IOC_FIRMWARE does not work then. Does another ioctl with same functionality exist on ESXi?

comment:16 Changed 3 months ago by Christian Franke

Milestone: undecided
Resolution: wontfix
Status: reopenedclosed

The ESXi MegaRAID driver is different from the Linux driver which is currently supported by smartmontools. More info (documentation, sample source code, reverse engineering result, ...) is required.

Please reopen this ticket if (and only if) more info is available.

Changed 4 weeks ago by Bruno da Costa

Attachment: storcli_strace_output.txt added

STrace output of 'storcli' command on ESXi

comment:17 Changed 4 weeks ago by Bruno da Costa

Resolution: wontfix
Status: closedreopened

Hello,

I'm re-opening this ticket with some (hopefully) useful data about how a MegaRAID controller works on a ESXi 6.5 box. I attached to this ticket the output of a strace taken from storcli (LSI's/Broadcom's native ESXi tool) listing information about all of the physical devices attached to it. Here are some snippets:

# strace /opt/lsi/storcli/storcli /call /eall /sall show
execve("/opt/lsi/storcli/storcli", ["/opt/lsi/storcli/storcli", "/call", "/eall", "/sall", "show"], [/* 17 vars */]) = 0
[ Process PID=163823 runs in 32 bit mode. ]
[... loading libraries...]
uname({sys="VMkernel", node="hypervisor", ...}) = 0
access("/etc/vmware/hostd/mockupEsxHost.txt", F_OK) = -1 ENOENT (No such file or directory)
open("/etc/lsi/storelibconf.ini", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/dev/megaraid_sas_ioctl", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/dev/megaraid_perc9_ioctl", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/vmfs/devices/char/vmkdriver/vmwMgmtInfo", O_RDWR|O_LARGEFILE) = 3
ioctl(3, 0x800, 0xff939894)             = 0
close(3)                                = 0
open("/vmfs/devices/char/vmkdriver/vmwMgmtNode2", O_RDWR|O_LARGEFILE) = 3
ioctl(3, 0x100, 0x8bb7b10)              = 0
pipe([4, 5])                            = 0
mmap2(NULL, 528384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0xa4f6000
mprotect(0xa4f6000, 4096, PROT_NONE)    = 0
clone(child_stack=0xa576484, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID|CLONE_DETACHED, parent_tidptr=0xa576bd8, tls=0xa576bd8, child_tidptr=0xff939c40) = 163824
futex(0x8bb8098, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x8bb80b4, FUTEX_WAIT_PRIVATE, 1, NULL) = 0
futex(0x8bb8098, FUTEX_WAKE_PRIVATE, 1) = 0
ioctl(3, 0x200, 0x8bb82b0)              = 0
open("/vmfs/devices/char/vmkdriver/vmwMgmtInfo", O_RDWR|O_LARGEFILE) = 6
ioctl(6, 0x800, 0xff939894)             = 0
close(6)                                = 0
open("/vmfs/devices/char/vmkdriver/vmwMgmtInfo", O_RDWR|O_LARGEFILE) = 6
ioctl(6, 0x800, 0xff939894)             = 0
close(6)                                = 0
[... this repeats a lot ...]
ioctl(3, 0x200, 0x8bb82b0)              = 0
open("/dev/megaraid_swr_ioctl_node", O_RDONLY) = -1 ENOENT (No such file or directory)
ioctl(3, 0x200, 0x8bb7980)              = 0
uname({sys="VMkernel", node="hypervisor", ...}) = 0
uname({sys="VMkernel", node="hypervisor", ...}) = 0
ioctl(3, 0x200, 0x8bb7980)              = 0
ioctl(3, 0x200, 0x8bb79c8)              = 0
ioctl(3, 0x200, 0x8bb8388)              = 0
ioctl(3, 0x200, 0x8bb8610)              = 0
brk(0x8bfb000)                          = 0x8bfb000
ioctl(3, 0x200, 0x8bb8700)              = 0
ioctl(3, 0x200, 0x8bb96d0)              = 0
brk(0x8beb000)                          = 0x8beb000
ioctl(3, 0x200, 0x8bb9f78)              = 0
brk(0x8c0c000)                          = 0x8c0c000
ioctl(3, 0x200, 0x8bb9f88)              = 0
brk(0x8bfc000)                          = 0x8bfc000
ioctl(3, 0x200, 0x8bdb3c8)              = 0
ioctl(3, 0x200, 0x8bca8f8)              = 0
ioctl(3, 0x200, 0x8bcab00)              = 0
ioctl(3, 0x200, 0x8bcad08)              = 0
ioctl(3, 0x200, 0x8bcaf10)              = 0
ioctl(3, 0x200, 0x8bcb118)              = 0
ioctl(3, 0x200, 0x8bcb320)              = 0
ioctl(3, 0x200, 0x8bcb528)              = 0
ioctl(3, 0x200, 0x8bcb730)              = 0
ioctl(3, 0x200, 0x8bcb938)              = 0
ioctl(3, 0x200, 0x8bcbb40)              = 0
ioctl(3, 0x200, 0x8bcbd48)              = 0
ioctl(3, 0x200, 0x8bcbf50)              = 0
ioctl(3, 0x200, 0x8bcc158)              = 0
ioctl(3, 0x200, 0x8bcc360)              = 0
ioctl(3, 0x200, 0x8bcc568)              = 0
ioctl(3, 0x200, 0x8bcc770)              = 0
ioctl(3, 0x200, 0x8bcc978)              = 0
ioctl(3, 0x200, 0x8bccb80)              = 0
ioctl(3, 0x200, 0x8bccd88)              = 0
ioctl(3, 0x200, 0x8bccf90)              = 0
ioctl(3, 0x200, 0x8bcd198)              = 0
ioctl(3, 0x200, 0x8bcd3a0)              = 0
ioctl(3, 0x200, 0x8bcd5a8)              = 0
ioctl(3, 0x200, 0x8bcd7b0)              = 0
ioctl(3, 0x200, 0x8bcd5d0)              = 0
ioctl(3, 0x200, 0x8bcdaa8)              = 0
ioctl(3, 0x200, 0x8bcdca8)              = 0
ioctl(3, 0x200, 0x8bce288)              = 0
ioctl(3, 0x200, 0x8bced20)              = 0
ioctl(3, 0x200, 0x8bcee08)              = 0
ioctl(3, 0x200, 0x8bcef18)              = 0
ioctl(3, 0x200, 0x8bcf078)              = 0
ioctl(3, 0x200, 0x8bcf078)              = 0
ioctl(3, 0x200, 0x8bcfae8)              = 0
ioctl(3, 0x200, 0x8bd0208)              = 0
ioctl(3, 0x200, 0x8bd08a8)              = 0
ioctl(3, 0x200, 0x8bd0f48)              = 0
ioctl(3, 0x200, 0x8bd15e8)              = 0
ioctl(3, 0x200, 0x8bd1c88)              = 0
ioctl(3, 0x200, 0x8bd2328)              = 0
ioctl(3, 0x200, 0x8bd29c8)              = 0
ioctl(3, 0x200, 0x8bd3068)              = 0
ioctl(3, 0x200, 0x8bd3708)              = 0
ioctl(3, 0x200, 0x8bd3da8)              = 0
ioctl(3, 0x200, 0x8bd4448)              = 0
ioctl(3, 0x200, 0x8bd4ae8)              = 0
ioctl(3, 0x200, 0x8bd5188)              = 0
ioctl(3, 0x200, 0x8bd5828)              = 0
ioctl(3, 0x200, 0x8bd5ec8)              = 0
ioctl(3, 0x200, 0x8bd6568)              = 0
ioctl(3, 0x200, 0x8bd6c08)              = 0
ioctl(3, 0x200, 0x8bd72c0)              = 0
ioctl(3, 0x200, 0x8bd7978)              = 0
ioctl(3, 0x200, 0x8bd8018)              = 0
ioctl(3, 0x200, 0x8bd86b8)              = 0
ioctl(3, 0x200, 0x8bd8d58)              = 0
ioctl(3, 0x200, 0x8bd93f8)              = 0
fstat64(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(0, 0), ...}) = 0
ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, {B9600 opost isig icanon echo ...}) = 0
mmap2(NULL, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xa577000
[...command writes result to stdout...]

Of interest, it looks like, instead of using /dev/devices and /dev/megaraid*, storcli uses /vmfs/devices/char/vmkdriver/vmwMgmtInfo and /vmfs/devices/char/vmkdriver/vmwMgmtInfo2 on ESXi.

I have an ESXi 6.5u2 box with a MegaRAID 9265-8i connected to it and I'm available to run commands and provide any information I can to help make smartctl work on ESXi. Let me know how I can help.

Thanks!

Last edited 4 weeks ago by Bruno da Costa (previous) (diff)

comment:18 Changed 4 weeks ago by Bruno da Costa

Cc: Bruno da Costa added

comment:19 Changed 4 weeks ago by Christian Franke

Milestone: undecided
Note: See TracTickets for help on using tickets.