Azure Ubuntu 20.04 VM 安裝 Nvidia T4 GPU Driver 紀錄
本篇想要記錄一下在 Azure Ubuntu 20.04 x64 VM 上面使用 GPU 的安裝過程,首先需要安裝相關 Nvidia GPU 的 Driver,不過究竟要安裝多少套件各方說法不一,由於之前已經有一組可以使用 GPU 的安裝指令,所以本篇以嘗試使用這組指令為基礎紀錄解決問題的方法,鳥哥的教學告訴我們可以利用 dpkg -l ‘nvidia*’ 的指令得知目前安裝所有 Nvidia GPU 相關的套件總覽,配合這個指令我們可以了解究竟安裝了什麼?
以下為嘗試過可以安裝 GPU 的指令,基本上執行完就可以利用 nvidia-smi 看到 GPU 的運作訊息,並且可以利用 nvidia-docker 指令跑一個 docker container
curl https://get.docker.com | sh
sudo systemctl start docker && sudo systemctl enable docker
packages=(
apt-transport-https
curl
ca-certificates
software-properties-common
python3-pip
python3-venv
)
apt-get -y update
apt-get install -y --no-install-recommends "${packages[@]}"
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update
sudo apt install -y xserver-xorg-video-nvidia-510-server
sudo apt install -y libnvidia-cfg1-510-server
sudo apt-get install -y nvidia-driver-510-server
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
利用 dpkg -l ‘nvidia*’ 指令紀錄此時安裝的 Nvidia 軟體:
root@fe27aea210c744e9afe9ca458b03806e000000:/home/admin# dpkg -l 'nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===============================-===========================-============-=================================================
un nvidia-384 <none> <none> (no description available)
un nvidia-390 <none> <none> (no description available)
un nvidia-compute-utils <none> <none> (no description available)
ii nvidia-compute-utils-525-server 525.147.05-0ubuntu0.20.04.1 amd64 NVIDIA compute utilities
un nvidia-container-runtime <none> <none> (no description available)
un nvidia-container-runtime-hook <none> <none> (no description available)
ii nvidia-container-toolkit 1.13.5-1 amd64 NVIDIA Container toolkit
ii nvidia-container-toolkit-base 1.13.5-1 amd64 NVIDIA Container Toolkit Base
ii nvidia-dkms-525-server 525.147.05-0ubuntu0.20.04.1 amd64 NVIDIA DKMS package
un nvidia-dkms-kernel <none> <none> (no description available)
un nvidia-docker <none> <none> (no description available)
ii nvidia-docker2 2.13.0-1 all nvidia-docker CLI wrapper
ii nvidia-driver-510-server 515.105.01-0ubuntu0.20.04.1 amd64 Transitional package for nvidia-driver-515-server
ii nvidia-driver-515-server 525.147.05-0ubuntu0.20.04.1 amd64 Transitional package for nvidia-driver-525-server
ii nvidia-driver-525-server 525.147.05-0ubuntu0.20.04.1 amd64 NVIDIA Server Driver metapackage
un nvidia-driver-binary <none> <none> (no description available)
un nvidia-kernel-common <none> <none> (no description available)
ii nvidia-kernel-common-525-server 525.147.05-0ubuntu0.20.04.1 amd64 Shared files used with the kernel module
un nvidia-kernel-source <none> <none> (no description available)
ii nvidia-kernel-source-525-server 525.147.05-0ubuntu0.20.04.1 amd64 NVIDIA kernel source package
un nvidia-opencl-icd <none> <none> (no description available)
un nvidia-persistenced <none> <none> (no description available)
un nvidia-prime <none> <none> (no description available)
un nvidia-settings <none> <none> (no description available)
un nvidia-smi <none> <none> (no description available)
un nvidia-utils <none> <none> (no description available)
ii nvidia-utils-525-server 525.147.05-0ubuntu0.20.04.1 amd64 NVIDIA Server Driver support binaries
但是執行 nvidia-smi 卻得到以下的錯誤訊息與預期不同
root@kernel-cudf-build:/home# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.
Make sure that the latest NVIDIA driver is installed and running.
利用以上訊息查找,有很多不同的解決方法,例如連結一與連結二都是利用 dkms 去安裝 nvidia-driver,我們嘗試實作之後,虛擬機上的 nvidia 相關軟件列表變成以下所示,但是 nvidia-smi 還是出現一樣的錯誤訊息。
root@kernel-cudf-build:/home# dpkg -l 'nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-================================-===========================-============-===============================================
un nvidia-384 <none> <none> (no description available)
un nvidia-390 <none> <none> (no description available)
un nvidia-compute-utils <none> <none> (no description available)
ii nvidia-compute-utils-525 525.147.05-0ubuntu0.20.04.1 amd64 NVIDIA compute utilities
rc nvidia-compute-utils-525-server 525.147.05-0ubuntu0.20.04.1 amd64 NVIDIA compute utilities
un nvidia-container-runtime <none> <none> (no description available)
un nvidia-container-runtime-hook <none> <none> (no description available)
ii nvidia-container-toolkit 1.13.5-1 amd64 NVIDIA Container toolkit
ii nvidia-container-toolkit-base 1.13.5-1 amd64 NVIDIA Container Toolkit Base
ii nvidia-cuda-dev 10.1.243-3 amd64 NVIDIA CUDA development files
ii nvidia-cuda-doc 10.1.243-3 all NVIDIA CUDA and OpenCL documentation
ii nvidia-cuda-gdb 10.1.243-3 amd64 NVIDIA CUDA Debugger (GDB)
ii nvidia-cuda-toolkit 10.1.243-3 amd64 NVIDIA CUDA development toolkit
ii nvidia-dkms-525 525.147.05-0ubuntu0.20.04.1 amd64 NVIDIA DKMS package
rc nvidia-dkms-525-server 525.147.05-0ubuntu0.20.04.1 amd64 NVIDIA DKMS package
un nvidia-dkms-kernel <none> <none> (no description available)
un nvidia-docker <none> <none> (no description available)
ii nvidia-docker2 2.13.0-1 all nvidia-docker CLI wrapper
un nvidia-driver <none> <none> (no description available)
ii nvidia-driver-525 525.147.05-0ubuntu0.20.04.1 amd64 NVIDIA driver metapackage
un nvidia-driver-binary <none> <none> (no description available)
un nvidia-kernel-common <none> <none> (no description available)
ii nvidia-kernel-common-525 525.147.05-0ubuntu0.20.04.1 amd64 Shared files used with the kernel module
rc nvidia-kernel-common-525-server 525.147.05-0ubuntu0.20.04.1 amd64 Shared files used with the kernel module
un nvidia-kernel-source <none> <none> (no description available)
ii nvidia-kernel-source-525 525.147.05-0ubuntu0.20.04.1 amd64 NVIDIA kernel source package
un nvidia-kernel-source-525-server <none> <none> (no description available)
un nvidia-legacy-304xx-vdpau-driver <none> <none> (no description available)
un nvidia-legacy-340xx-vdpau-driver <none> <none> (no description available)
un nvidia-libopencl1 <none> <none> (no description available)
un nvidia-libopencl1-dev <none> <none> (no description available)
ii nvidia-opencl-dev:amd64 10.1.243-3 amd64 NVIDIA OpenCL development files
un nvidia-opencl-icd <none> <none> (no description available)
un nvidia-persistenced <none> <none> (no description available)
ii nvidia-prime 0.8.16~0.20.04.2 all Tools to enable NVIDIA's Prime
ii nvidia-profiler 10.1.243-3 amd64 NVIDIA Profiler for CUDA and OpenCL
ii nvidia-settings 470.57.01-0ubuntu0.20.04.3 amd64 Tool for configuring the NVIDIA graphics driver
un nvidia-settings-binary <none> <none> (no description available)
un nvidia-smi <none> <none> (no description available)
un nvidia-tesla-418-driver <none> <none> (no description available)
解決方法:
持續搜尋之後發現在此篇的分享中提到 I disabled the Secure Boot and it worked pretty fine,所以我們依樣畫葫蘆將 Azure VM 中的 Configuration 裡的 Enable secure boot 關掉,就可以使用了。根據文件說明,If you’re running certain PC graphics cards, hardware, or operating systems such as Linux or previous version of Windows you may need to disable Secure Boot.
其他參考連結:
過程當中也有嘗試直接執行 nvidia-docker 嘗試跑出一個 container, nvidia-docker run -it image:tag,此時的錯誤訊息變成沒有 nvidia-container-toolkit 初始化失敗,有查看過連結不過沒有太多幫助。