Azure Ubuntu 20.04 VM 安裝 Nvidia T4 GPU Driver 紀錄

本篇想要記錄一下在 Azure Ubuntu 20.04 x64 VM 上面使用 GPU 的安裝過程,首先需要安裝相關 Nvidia GPU 的 Driver,不過究竟要安裝多少套件各方說法不一,由於之前已經有一組可以使用 GPU 的安裝指令,所以本篇以嘗試使用這組指令為基礎紀錄解決問題的方法,鳥哥的教學告訴我們可以利用 dpkg -l ‘nvidia*’ 的指令得知目前安裝所有 Nvidia GPU 相關的套件總覽,配合這個指令我們可以了解究竟安裝了什麼?

以下為嘗試過可以安裝 GPU 的指令,基本上執行完就可以利用 nvidia-smi 看到 GPU 的運作訊息,並且可以利用 nvidia-docker 指令跑一個 docker container

curl https://get.docker.com | sh
sudo systemctl start docker && sudo systemctl enable docker
packages=(
  apt-transport-https
  curl
  ca-certificates
  software-properties-common
  python3-pip
  python3-venv
)
apt-get -y update
apt-get install -y --no-install-recommends "${packages[@]}"
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update
sudo apt install -y xserver-xorg-video-nvidia-510-server
sudo apt install -y libnvidia-cfg1-510-server
sudo apt-get install -y nvidia-driver-510-server
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker

利用 dpkg -l ‘nvidia*’ 指令紀錄此時安裝的 Nvidia 軟體:

root@fe27aea210c744e9afe9ca458b03806e000000:/home/admin# dpkg -l 'nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                            Version                     Architecture Description
+++-===============================-===========================-============-=================================================
un  nvidia-384                      <none>                      <none>       (no description available)
un  nvidia-390                      <none>                      <none>       (no description available)
un  nvidia-compute-utils            <none>                      <none>       (no description available)
ii  nvidia-compute-utils-525-server 525.147.05-0ubuntu0.20.04.1 amd64        NVIDIA compute utilities
un  nvidia-container-runtime        <none>                      <none>       (no description available)
un  nvidia-container-runtime-hook   <none>                      <none>       (no description available)
ii  nvidia-container-toolkit        1.13.5-1                    amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base   1.13.5-1                    amd64        NVIDIA Container Toolkit Base
ii  nvidia-dkms-525-server          525.147.05-0ubuntu0.20.04.1 amd64        NVIDIA DKMS package
un  nvidia-dkms-kernel              <none>                      <none>       (no description available)
un  nvidia-docker                   <none>                      <none>       (no description available)
ii  nvidia-docker2                  2.13.0-1                    all          nvidia-docker CLI wrapper
ii  nvidia-driver-510-server        515.105.01-0ubuntu0.20.04.1 amd64        Transitional package for nvidia-driver-515-server
ii  nvidia-driver-515-server        525.147.05-0ubuntu0.20.04.1 amd64        Transitional package for nvidia-driver-525-server
ii  nvidia-driver-525-server        525.147.05-0ubuntu0.20.04.1 amd64        NVIDIA Server Driver metapackage
un  nvidia-driver-binary            <none>                      <none>       (no description available)
un  nvidia-kernel-common            <none>                      <none>       (no description available)
ii  nvidia-kernel-common-525-server 525.147.05-0ubuntu0.20.04.1 amd64        Shared files used with the kernel module
un  nvidia-kernel-source            <none>                      <none>       (no description available)
ii  nvidia-kernel-source-525-server 525.147.05-0ubuntu0.20.04.1 amd64        NVIDIA kernel source package
un  nvidia-opencl-icd               <none>                      <none>       (no description available)
un  nvidia-persistenced             <none>                      <none>       (no description available)
un  nvidia-prime                    <none>                      <none>       (no description available)
un  nvidia-settings                 <none>                      <none>       (no description available)
un  nvidia-smi                      <none>                      <none>       (no description available)
un  nvidia-utils                    <none>                      <none>       (no description available)
ii  nvidia-utils-525-server         525.147.05-0ubuntu0.20.04.1 amd64        NVIDIA Server Driver support binaries

但是執行 nvidia-smi 卻得到以下的錯誤訊息與預期不同

root@kernel-cudf-build:/home# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. 
Make sure that the latest NVIDIA driver is installed and running.

利用以上訊息查找,有很多不同的解決方法,例如連結一連結二都是利用 dkms 去安裝 nvidia-driver,我們嘗試實作之後,虛擬機上的 nvidia 相關軟件列表變成以下所示,但是 nvidia-smi 還是出現一樣的錯誤訊息。

root@kernel-cudf-build:/home# dpkg -l 'nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                             Version                     Architecture Description
+++-================================-===========================-============-===============================================
un  nvidia-384                       <none>                      <none>       (no description available)
un  nvidia-390                       <none>                      <none>       (no description available)
un  nvidia-compute-utils             <none>                      <none>       (no description available)
ii  nvidia-compute-utils-525         525.147.05-0ubuntu0.20.04.1 amd64        NVIDIA compute utilities
rc  nvidia-compute-utils-525-server  525.147.05-0ubuntu0.20.04.1 amd64        NVIDIA compute utilities
un  nvidia-container-runtime         <none>                      <none>       (no description available)
un  nvidia-container-runtime-hook    <none>                      <none>       (no description available)
ii  nvidia-container-toolkit         1.13.5-1                    amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base    1.13.5-1                    amd64        NVIDIA Container Toolkit Base
ii  nvidia-cuda-dev                  10.1.243-3                  amd64        NVIDIA CUDA development files
ii  nvidia-cuda-doc                  10.1.243-3                  all          NVIDIA CUDA and OpenCL documentation
ii  nvidia-cuda-gdb                  10.1.243-3                  amd64        NVIDIA CUDA Debugger (GDB)
ii  nvidia-cuda-toolkit              10.1.243-3                  amd64        NVIDIA CUDA development toolkit
ii  nvidia-dkms-525                  525.147.05-0ubuntu0.20.04.1 amd64        NVIDIA DKMS package
rc  nvidia-dkms-525-server           525.147.05-0ubuntu0.20.04.1 amd64        NVIDIA DKMS package
un  nvidia-dkms-kernel               <none>                      <none>       (no description available)
un  nvidia-docker                    <none>                      <none>       (no description available)
ii  nvidia-docker2                   2.13.0-1                    all          nvidia-docker CLI wrapper
un  nvidia-driver                    <none>                      <none>       (no description available)
ii  nvidia-driver-525                525.147.05-0ubuntu0.20.04.1 amd64        NVIDIA driver metapackage
un  nvidia-driver-binary             <none>                      <none>       (no description available)
un  nvidia-kernel-common             <none>                      <none>       (no description available)
ii  nvidia-kernel-common-525         525.147.05-0ubuntu0.20.04.1 amd64        Shared files used with the kernel module
rc  nvidia-kernel-common-525-server  525.147.05-0ubuntu0.20.04.1 amd64        Shared files used with the kernel module
un  nvidia-kernel-source             <none>                      <none>       (no description available)
ii  nvidia-kernel-source-525         525.147.05-0ubuntu0.20.04.1 amd64        NVIDIA kernel source package
un  nvidia-kernel-source-525-server  <none>                      <none>       (no description available)
un  nvidia-legacy-304xx-vdpau-driver <none>                      <none>       (no description available)
un  nvidia-legacy-340xx-vdpau-driver <none>                      <none>       (no description available)
un  nvidia-libopencl1                <none>                      <none>       (no description available)
un  nvidia-libopencl1-dev            <none>                      <none>       (no description available)
ii  nvidia-opencl-dev:amd64          10.1.243-3                  amd64        NVIDIA OpenCL development files
un  nvidia-opencl-icd                <none>                      <none>       (no description available)
un  nvidia-persistenced              <none>                      <none>       (no description available)
ii  nvidia-prime                     0.8.16~0.20.04.2            all          Tools to enable NVIDIA's Prime
ii  nvidia-profiler                  10.1.243-3                  amd64        NVIDIA Profiler for CUDA and OpenCL
ii  nvidia-settings                  470.57.01-0ubuntu0.20.04.3  amd64        Tool for configuring the NVIDIA graphics driver
un  nvidia-settings-binary           <none>                      <none>       (no description available)
un  nvidia-smi                       <none>                      <none>       (no description available)
un  nvidia-tesla-418-driver          <none>                      <none>       (no description available)

解決方法:

持續搜尋之後發現在此篇的分享中提到 I disabled the Secure Boot and it worked pretty fine,所以我們依樣畫葫蘆將 Azure VM 中的 Configuration 裡的 Enable secure boot 關掉,就可以使用了。根據文件說明,If you’re running certain PC graphics cards, hardware, or operating systems such as Linux or previous version of Windows you may need to disable Secure Boot. 

其他參考連結:

過程當中也有嘗試直接執行 nvidia-docker 嘗試跑出一個 container, nvidia-docker run -it image:tag,此時的錯誤訊息變成沒有 nvidia-container-toolkit 初始化失敗,有查看過連結不過沒有太多幫助。

Github Self Runner 使用 root 權限