Manjaro安裝NVIDIA Driver＆CUDA

搞定Manjaro的基本環境後，接下來要設定GTX 1660 Super以利後續的運算任務…噢，又折騰了不少時間，得好好紀錄過程與問題才行吶。

Manjaro freezing at boot screen after NVIDIA driver installed

透過Manjaro硬體偵測安裝non-free driver：

1	$ sudo mhwd -a pci nonfree 0300

重新開機後，很好，畫面就停在黑螢幕了，進不去登入畫面(傻眼。原來在安裝了NVIDIA驅動後還必須手動修改硬體設定，否則視窗服務的運作會異常，導致系統看起來掛了，所以安裝N卡驅動的正確姿勢應該是要先修改設定文件後再重新啟動。首先，在黑螢幕的畫面按下ctrl + alt + F3來進入Terminal：

查詢GPU BusID

1	$ lspci \| grep -E "VGA\|3D"

輸出訊息前三組數字即為BusID(忽略前綴0)，例如「01:00.0」則BusID為「1:0:0」。

備份設定

1	$ sudo mv /etc/X11/xorg.conf.d/90-mhwd.conf /etc/X11/xorg.conf.d/90-mhwd.conf.bak

寫入 /etc/X11/xorg.conf.d/90-mhwd.conf，BusID改為你要設定的GPU，我是設定為內顯的AMD GPU。

Section "Module"
    Load "modesetting"
EndSection

Section "Device"
    Identifier "nvidia"
    Driver "nvidia"
    BusID "PCI:1:0:0"
    Option "AllowEmptyInitialConfiguration"
EndSection

修改完成後重新啟動即可。

參考：https://blog.csdn.net/baidu_33340703/article/details/103977592

CUDA＆cuDNN

在安裝CUDA之前，先確認N卡驅動是否正確安裝：

1	$ nvidia-smi

安裝CUDA、cuDNN以及後續會用到的Python函式庫：

1	$ sudo pacman -Syu tensorflow-cuda cuda cudnn python-pycuda python-tensorflow-cuda python-matplotlib

＊為避免和pacman軟體庫提供的版本衝突，Arch/Manjaro移除了pip軟體庫中的tensorflow-gpu，以tensorflow-cuda取代之。

將CUDA安裝目錄中的samples複製到home目錄下，編譯然後測試CUDA是否安裝成功：

1
2
3

$ cp -r /opt/cuda/samples ~
$ ~/samples
$ sudo make -k

編譯過程需要點時間，大概30分鐘，編譯完成後執行deviceQuery：

$ cd ~/samples/1_Utilities/deviceQuery
$ ./deviceQuery
./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1660 SUPER"
 CUDA Driver Version / Runtime Version          10.2 / 10.2
 CUDA Capability Major/Minor version number:    7.5
 Total amount of global memory:                 5945 MBytes (6233391104 bytes)
 (22) Multiprocessors, ( 64) CUDA Cores/MP:     1408 CUDA Cores
 
(略...)

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.2, NumDevs = 1
Result = PASS  # PASS表示CUDA安裝成功

Test

用MNIST手寫數字辨識來做測試，每批次都1秒就運算完了，雖然實際應該不到1秒XD

Keras could not create cudnn handle: cudnn_status_alloc_failed

噢，正想說一切都配置好了，趕緊來繼續實驗放置了半個月的cifar-10模型訓練(因為Google CoLab的免費資源太熱門導致經常斷線而白忙，只好自己建置運算環境了😂)，結果出現了一個看起來很厲害的錯誤訊息(傻眼x2。查詢後發現是GPU記憶體配置問題，Tensorflow為了避免記憶體碎片化，預設會盡可能把可見的GPU記憶體都映射給當前的進程：

By default, TensorFlow maps nearly all of the GPU memory of all GPUs (subject to CUDA_VISIBLE_DEVICES) visible to the process. This is done to more efficiently use the relatively precious GPU memory resources on the devices by reducing memory fragmentation.

結果就是cifar-10這樣的運算量就導致記憶體不足而拋出錯誤訊息，所以我們必須讓Tensorflow按需求配置GPU記憶體：

1
2
3

config = tf.compat.v1.ConfigProto()  # tensorflow-gpu 2.1.0
config.gpu_options.allow_growth=True
tf.compat.v1.keras.backend.set_session(tf.compat.v1.Session(config=config))

解決～每批次的運算速度比我在CoLab上面快10秒呢，$$沒有白花了QQ

參考：

本部落格已搬遷, 3秒後跳轉...

“We're believers that the best way to learn something is to do it.”