cuda安装教程

发表于 2021-08-02 更新于 2024-07-24 分类于器， linux杂项阅读次数： Waline：本文字数： 14k 阅读时长 ≈ 47 分钟

文件夹组织

cuda_cudnn/
    cuda/
        各个版本的cuda安装包
    cudnn/
        各个版本的cuda对应的各个版本的cudnn的安装包
    README.md

cuDNN与cuda之间的关系

cuda：显卡计算的加速库，可用于渲染、神经网络等等

cudnn：神经网络加速库文件，依赖于cuda，大幅优化显卡上的神经网络计算，比cuda加速两倍以上

nccl：多显卡通讯加速，可加速神经网络多卡训练

直接用apt安装

参见

徒手安装教程

参考：

从NVIDIA官网下载安装包的技巧

需要翻墙开全局模式，才能下载链接/下载按钮才能工作
可再本地用浏览器(w3m不可)登录账号、下载，传到服务器
直接下载到服务器：
- 不可在网页上直接拷贝下载按钮的链接，去服务器wget，这样无法下载成功。
- 必需用（本地）浏览器（w3m终端浏览器不可以）访问下载页，登录账号。
- 再点下载网页上的按钮，开始下载了，再从浏览器的下载栏处复制链接。
- 然后到服务器wget [此链接]，下载得到的文件名结尾有一串乱码，形如
```
'cudnn-10.1-linux-x64-v7.6.3.30.tgz?ziMqZ3giGG1v5v90de6Of-_NpBtGVRLIR4O7hkSQ4Hu5RE_Qr-qxE98NFILK6B89iL1xitgZGQMy1ZH_o3ayiKsoYVbK1K3GmYUbNkFKUTn-jDCpEv726d61fCYT5SC6rI17tKt8hDVHxC-4zDH5XJtEjSovwJn5obx_04zS72ohX8HvmNEI8MxqTrq97Jq55krHcKU5l'
```
  只需 mv cudnn-10.1-linux-x64-v7.6.3.30.tgz?xxxxxxx cudnn-10.1-linux-x64-v7.6.3.30.tgz
- 之后解压必需用tar -xzvf cudnn-x.x-linux-x64-vxx.tgz，用其他7z解压会得到一个文件而不是文件夹。

安装cuda

cuda-9.0下载cuda

sudo sh cuda_x.x.x_xxx.xx_linux.run

⌃C滑到协议的结尾，然后依下选项选择

Do you accept the previously read EULA?
accept/decline/quit: accept

Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 384.81?
(y)es/(n)o/(q)uit: n  # 这是因为已经安装了NVIDIA的驱动

Install the CUDA 9.0 Toolkit?
(y)es/(n)o/(q)uit: y

Enter Toolkit Location
 [ default is /usr/local/cuda-9.0 ]: # 直接回车

Do you want to install a symbolic link at /usr/local/cuda?
(y)es/(n)o/(q)uit: y

Install the CUDA 9.0 Samples?
(y)es/(n)o/(q)uit: y

Enter CUDA Samples Location
 [ default is /home/xxxx ]: # 直接回车

Installing the CUDA Toolkit in /usr/local/cuda-9.0 ...

安装完成返回

Installing the CUDA Toolkit in /usr/local/cuda-9.0 ...
Installing the CUDA Samples in /home/haoyu ...
Copying samples to /home/haoyu/NVIDIA_CUDA-9.0_Samples now...
Finished copying samples.

===========
= Summary =
===========

Driver:   Not Selected # 这是因为上面 Install NVIDIA Accelerated Graphics Driver 选了n
Toolkit:  Installed in /usr/local/cuda-9.0
Samples:  Installed in /home/haoyu

Please make sure that
 -   PATH includes /usr/local/cuda-9.0/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-9.0/lib64, or, add /usr/local/cuda-9.0/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-9.0/bin

Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-9.0/doc/pdf for detailed information on setting up CUDA.

***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 384.00 is required for CUDA 9.0 functionality to work.
To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
    sudo <CudaInstaller>.run -silent -driver

Logfile is /tmp/cuda_install_51161.log
Signal caught, cleaning up

测试cuda

cd /usr/local/cuda/samples/1_Utilities/deviceQuery
sudo make clean && sudo make
./deviceQuery

> Peer access from GeForce GTX TITAN X (GPU0) -> GeForce GTX TITAN X (GPU1) : Yes
> Peer access from GeForce GTX TITAN X (GPU0) -> GeForce GTX TITAN X (GPU2) : No
> Peer access from GeForce GTX TITAN X (GPU0) -> GeForce GTX TITAN X (GPU3) : No
> Peer access from GeForce GTX TITAN X (GPU1) -> GeForce GTX TITAN X (GPU0) : Yes
> Peer access from GeForce GTX TITAN X (GPU1) -> GeForce GTX TITAN X (GPU2) : No
> Peer access from GeForce GTX TITAN X (GPU1) -> GeForce GTX TITAN X (GPU3) : No
> Peer access from GeForce GTX TITAN X (GPU2) -> GeForce GTX TITAN X (GPU0) : No
> Peer access from GeForce GTX TITAN X (GPU2) -> GeForce GTX TITAN X (GPU1) : No
> Peer access from GeForce GTX TITAN X (GPU2) -> GeForce GTX TITAN X (GPU3) : Yes
> Peer access from GeForce GTX TITAN X (GPU3) -> GeForce GTX TITAN X (GPU0) : No
> Peer access from GeForce GTX TITAN X (GPU3) -> GeForce GTX TITAN X (GPU1) : No
> Peer access from GeForce GTX TITAN X (GPU3) -> GeForce GTX TITAN X (GPU2) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 9.0, NumDevs = 4

结果最后一行若有Result = PASS则测试成功

安装cudnn

参考：

下载cudnn官网

进入网页首先需要注册账号

法一 dpkg安装

下载[cuDNN Runtime Library for UbuntuXX.X (Deb)]

然后执行

sudo dpkg -i libcudnnX_X.X.X.XX-X+cudaX.X_amd64.deb

即完成安装

法二 cp安装

请选择下载

Download cuDNN vx.x.x (月日, 年), for CUDA x.x
- cuDNN vx.x.x Library for Linux cuDNN 动态库文件
- cuDNN vx.x.x Developer Library for Ubuntu16.04 (Deb) cuDNN 测试代码
下载得到.tar文件，解压之

tar -xzvf cudnn-x.x-linux-x64-vxx.tgz

进入解压的文件夹后

通过拷贝来安装cudnn，并修改权限

CUDA_TO_INSTALL=cuda-x.x  # 设置版本号
sudo cp cuda/include/cudnn.h /usr/local/$CUDA_TO_INSTALL/include/
sudo cp cuda/lib64/libcudnn* /usr/local/$CUDA_TO_INSTALL/lib64/
sudo chmod a+r /usr/local/$CUDA_TO_INSTALL/include/cudnn.h
sudo chmod a+r /usr/local/$CUDA_TO_INSTALL/lib64/libcudnn*
sudo ldconfig /usr/local/$CUDA_TO_INSTALL/lib64

安装nccl

NCCL是Nvidia Collective multi-GPU Communication Library的简称，它是一个实现多GPU的collective communication通信库，Nvidia做了很多优化，以在PCIe、Nvlink、InfiniBand上实现较高的通信速度。

参考参考

访问官方下载页注册、登录、选择和cuda、系统适配的安装包、下载
- 下载 local NCCL repository 必需用浏览器(w3m不可)登录账号、下载，传到服务器。不可在网页上直接拷贝下载按钮的链接，去服务器wget，这样无法下载成功。
- 下载For the network repository 可以直接在网页上的下载按钮复制下载链接用wget获取。

Install the repository.

For the local NCCL repository:

sudo dpkg -i nccl-repo-<version>.deb

For the network repository:

sudo dpkg -i nvidia-machine-learning-repo-<version>.deb

Update the APT database:
```
sudo apt update
```
Install the libnccl2 package with APT. Additionally, if you need to compile applications with NCCL, you can install the libnccl-dev package as well:
- If you are using the network repository, the following command will upgrade CUDA to the latest version.
```
sudo apt install libnccl2 libnccl-dev
```
- 装不是全网最新版本的cuda的nccl，则需执行
```
sudo apt install libnccl2=2.0.0-1+cuda8.0 libnccl-dev=2.0.0-1+cuda8.0
```

测试nccl

查看版本

python -c 'import torch; print(torch.cuda.nccl.version())'

专门的测试程序

These tests check both the performance and the correctness of NCCL operations.

下载

git clone https://github.com/NVIDIA/nccl-tests.git

编译

cd nccl-tests
make clean
make # (我们的服务器就跑这个)

If CUDA is not installed in /usr/local/cuda, you may specify CUDA_HOME. Similarly, if NCCL is not installed in /usr, you may specify NCCL_HOME.

make CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl

NCCL tests rely on MPI to work on multiple processes, hence multiple nodes. If you want to compile the tests with MPI support, you need to set MPI=1 and set MPI_HOME to the path where MPI is installed. (我们的服务器无需跑这个)

make MPI=1 MPI_HOME=/path/to/mpi CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl

make成功会返回

make -C src build
make[1]: Entering directory '/home/haoyu/nccl-tests/src'
Compiling  all_reduce.cu                       > ../build/all_reduce.o
Compiling  common.cu                           > ../build/common.o
Linking  ../build/all_reduce.o               > ../build/all_reduce_perf
Compiling  all_gather.cu                       > ../build/all_gather.o
Linking  ../build/all_gather.o               > ../build/all_gather_perf
Compiling  broadcast.cu                        > ../build/broadcast.o
Linking  ../build/broadcast.o                > ../build/broadcast_perf
Compiling  reduce_scatter.cu                   > ../build/reduce_scatter.o
Linking  ../build/reduce_scatter.o           > ../build/reduce_scatter_perf
Compiling  reduce.cu                           > ../build/reduce.o
Linking  ../build/reduce.o                   > ../build/reduce_perf
make[1]: Leaving directory '/home/haoyu/nccl-tests/src'

Usage

NCCL tests can run on multiple processes, multiple threads, and multiple CUDA devices per thread. The number of process is managed by MPI and is therefore not passed to the tests as argument. The total number of ranks (=CUDA devices) will be equal to (number of processes)(number of threads)(number of GPUs per thread).

Quick examples

Run on 8 GPUs (-g 8), scanning from 8 Bytes to 128MBytes : (我们的服务器就跑这个)

./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8

Run with MPI on 40 processes (potentially on multiple nodes) with 4 GPUs each :

mpirun -np 40 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4

测试结果会返回

# nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
#   Rank  0 Pid  12164 on   jungpu30 device  0 [0x1a] GeForce RTX 2080 Ti
#   Rank  1 Pid  12164 on   jungpu30 device  1 [0x1b] GeForce RTX 2080 Ti
#   Rank  2 Pid  12164 on   jungpu30 device  2 [0x3d] GeForce RTX 2080 Ti
#   Rank  3 Pid  12164 on   jungpu30 device  3 [0x3e] GeForce RTX 2080 Ti
#   Rank  4 Pid  12164 on   jungpu30 device  4 [0x88] GeForce RTX 2080 Ti
#   Rank  5 Pid  12164 on   jungpu30 device  5 [0x89] GeForce RTX 2080 Ti
#   Rank  6 Pid  12164 on   jungpu30 device  6 [0xb1] GeForce RTX 2080 Ti
#   Rank  7 Pid  12164 on   jungpu30 device  7 [0xb2] GeForce RTX 2080 Ti
#
#                                                     out-of-place                       in-place
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2   float     sum    35.93    0.00    0.00  1e-07    35.72    0.00    0.00  1e-07
          16             4   float     sum    37.01    0.00    0.00  1e-07    36.77    0.00    0.00  1e-07
          32             8   float     sum    36.74    0.00    0.00  6e-08    36.44    0.00    0.00  6e-08
          64            16   float     sum    37.21    0.00    0.00  6e-08    36.61    0.00    0.00  6e-08
         128            32   float     sum    36.98    0.00    0.01  6e-08    36.97    0.00    0.01  6e-08
         256            64   float     sum    37.26    0.01    0.01  3e-08    36.42    0.01    0.01  3e-08
         512           128   float     sum    37.47    0.01    0.02  3e-08    37.28    0.01    0.02  3e-08
        1024           256   float     sum    37.54    0.03    0.05  1e-07    36.81    0.03    0.05  1e-07
        2048           512   float     sum    38.50    0.05    0.09  2e-07    37.57    0.05    0.10  2e-07
        4096          1024   float     sum    38.97    0.11    0.18  2e-07    38.46    0.11    0.19  2e-07
        8192          2048   float     sum    39.49    0.21    0.36  2e-07    38.23    0.21    0.37  2e-07
       16384          4096   float     sum    41.20    0.40    0.70  2e-07    40.58    0.40    0.71  2e-07
       32768          8192   float     sum    67.22    0.49    0.85  2e-07    67.61    0.48    0.85  2e-07
       65536         16384   float     sum    124.9    0.52    0.92  2e-07    126.4    0.52    0.91  2e-07
      131072         32768   float     sum    237.0    0.55    0.97  2e-07    236.7    0.55    0.97  2e-07
      262144         65536   float     sum    207.2    1.27    2.21  2e-07    204.7    1.28    2.24  2e-07
      524288        131072   float     sum    325.5    1.61    2.82  2e-07    325.4    1.61    2.82  2e-07
     1048576        262144   float     sum    615.4    1.70    2.98  2e-07    613.0    1.71    2.99  2e-07
     2097152        524288   float     sum   1289.4    1.63    2.85  2e-07   1290.3    1.63    2.84  2e-07
     4194304       1048576   float     sum   2740.9    1.53    2.68  2e-07   2740.6    1.53    2.68  2e-07
     8388608       2097152   float     sum   5830.5    1.44    2.52  2e-07   5829.7    1.44    2.52  2e-07
    16777216       4194304   float     sum    11991    1.40    2.45  2e-07    11981    1.40    2.45  2e-07
    33554432       8388608   float     sum    23923    1.40    2.45  2e-07    23913    1.40    2.46  2e-07
    67108864      16777216   float     sum    47784    1.40    2.46  2e-07    47781    1.40    2.46  2e-07
   134217728      33554432   float     sum    95470    1.41    2.46  2e-07    95477    1.41    2.46  2e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 1.20304
#

见此状完整的结果，即通过测试。

多版本cuda的管理

设置默认cuda版本

当装了多个版本的cuda，要设置哪个版本的cuda为系统默认的cuda，只需修改/usr/local/cuda

sudo rm /usr/local/cuda && sudo ln -s /usr/local/cuda-x.x /usr/local/cuda

当/usr/local/cuda链接变更后，需要更新共享库缓存

sudo ldconfig /usr/local/cuda/lib64

lbconfig是一个动态链接库管理命令，为了让动态链接库为系统所共享。linux下的共享库机制采用了类似于高速缓存的机制，即将库信息保存在/etc/ld.so.cache文件里边。运行上述命令，可搜索/usr/local/cuda/lib64内lib*.so.*文件，将其路径等信息添加到/etc/ld.so.cache文件里。

使用非默认版本的cuda

一个用户若想使用非默认版本的cuda, 则在当前终端执行如下命令, 修改环境变量:

export PATH=/usr/local/cuda-x.x/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-x.x/lib64:$LD_LIBRARY_PATH

这样, 此终端将使用cuda-x.x版本运行各种程序. 此时若执行 nvcc -V 查看cuda版本会返回此版本.

报错日志与解决方案

cuda-8.0 安装

cd /home/xxx/NVIDIA_CUDA-8.0_Samples
make

报错

nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
In file included from /usr/local/cuda-8.0/bin/..//include/cuda_runtime.h:78:0,
                 from <command-line>:0:
/usr/local/cuda-8.0/bin/..//include/host_config.h:119:2: error: #error -- unsupported GNU version! gcc versions later than 5 are not supported!
 #error -- unsupported GNU version! gcc versions later than 5 are not supported!
  ^~~~~
Makefile:250: recipe for target 'simplePrintf.o' failed
make[1]: *** [simplePrintf.o] Error 1
make[1]: Leaving directory '/usr/local/cuda-8.0/samples/0_Simple/simplePrintf'
Makefile:52: recipe for target '0_Simple/simplePrintf/Makefile.ph_build' failed
make: *** [0_Simple/simplePrintf/Makefile.ph_build] Error 2

解决办法：手动给cuda-8.0添加gcc和g++（版本<5）的链接

sudo apt-get install gcc-4.9 g++-4.9
sudo ln -s /usr/bin/g++-4.9 /usr/local/cuda-8.0/bin/g++
sudo ln -s /usr/bin/gcc-4.9 /usr/local/cuda-8.0/bin/gcc

cudnn 测试

编译

sudo make clean && sudo make

/usr/bin/ld:/usr/local/cuda/lib64/libcudnn.so: file format not recognized; treating as linker script

解决方法

cd /usr/local/cuda/lib64
ls -l | grep libcudnn.so

sudo rm -rf libcudnn.so libcudnn.so.7
sudo ln -s libcudnn.so.7.x.x libcudnn.so.7
sudo ln -s libcudnn.so.7 libcudnn.so

然后回到测试代码的目录

sudo make clean && sudo make

即编译成功

运行

./mnistCUDNN

./mnistCUDNN: error while loading shared libraries: libcudart.so.9.0: cannot open shared object file: No such file or directory