问题描述
如题,起因是在阿里云GPU服务器上,使用原先正常运行的镜像生成了容器,但容器的显卡驱动出问题了,使用nvidia-smi命令会报错 NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver.
尝试使用官网.run文件重新安装显卡驱动会报错ERROR: Unable to load the kernel module ‘nvidia.ko’. This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if a driver such as rivafb, nvidiafb, or nouveau is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA graphics device(s), or no NVIDIA GPU installed in this system is supported by this NVIDIA Linux graphics driver release.
按照报错信息,怀疑是内核版本或者gcc版本有误,更换了多个内核版本和gcc版本,使用了网上很多这两种保存相关的解决思路,都没能解决,一筹莫展。
放弃了原先的镜像,新建了空的容器,但是空的容器也会报NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver的错,并且空的容器居然也装不上显卡驱动,遂怀疑是容器本身的问题。
解决方案
发现可能是容器本身的设置有问题,设置为GPU计算时容器可正常安装驱动,但是设置为GPU计算可视化时就会报以上错误。
咨询阿里云,发现GPU计算可视化型需要提交工单获取特定的兼容驱动,GPU计算型才可以从官网下载驱动安装。通过提交工单获取特定的兼容驱动后,驱动可正常安装,问题解决。文章来源:https://www.toymoban.com/news/detail-512385.html
反思
如果云服务器中空的容器连驱动都安装不好的话,就不要自己折腾了,大概率是容器本身哪里出问题了,咨询云服务商吧。文章来源地址https://www.toymoban.com/news/detail-512385.html
到了这里,关于【已解决】nvidia-smi报错:NVIDIA-SMI has failed because it couldn’t communicate with the ... 阿里云GPU服务器的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!