nvidia-smi 显示少卡
ubuntu 安装好 nvidia 驱动后,nvidia-smi 显示的网卡数目和实际数目(lspci |grep -i nvidia)对不上。nvidia-smi 显示的网卡数量。查看 nvidia 的日志。
·
问题描述
ubuntu 安装好 nvidia 驱动后,nvidia-smi 显示的网卡数目和实际数目(lspci |grep -i nvidia)对不上。
nvidia-smi 显示的网卡数量
nvidia-smi
Mon Oct 21 18:13:11 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.256.02 Driver Version: 470.256.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A40 On | 00000000:A1:00.0 Off | 0 |
| 0% 34C P8 28W / 300W | 0MiB / 45634MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
nvidia-smi -L
GPU 0: NVIDIA A40 (UUID: GPU-558024f3-9a49-b2d6-7420-ad3d6a4537de)
实际网卡数量
lspci |grep -i nvidia
61:00.0 3D controller: NVIDIA Corporation Device 2235 (rev a1)
a1:00.0 3D controller: NVIDIA Corporation Device 2235 (rev a1)
排查
查看 nvidia 的日志
cat /var/log/dmesg.0 |grep -i nvidia
[ 11.640635] kernel: nvidia: module license 'NVIDIA' taints kernel.
[ 11.747846] kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 235
[ 11.756368] kernel: nvidia 0000:61:00.0: enabling device (0000 -> 0002)
[ 11.760365] kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[ 11.761942] kernel: nvidia: probe of 0000:61:00.0 failed with error -1
[ 11.763189] kernel: nvidia 0000:a1:00.0: enabling device (0000 -> 0002)
[ 11.812764] kernel: NVRM: The NVIDIA probe routine failed for 1 device(s).
解决
cat /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="pci=realloc=off"
sudo update-grub
reboot
参考
NVRM: This PCI I/O region assigned to your NVIDIA device is invalid
更多推荐
所有评论(0)