Appendix G - Setting Up and Testing a GPGPU¶

Requirements for GPGPU testing¶

SUT prepared for testing as described in this document
NVIDIA or AMD GPGPU(s) installed in SUT
- At this time, only NVIDIA and AMD GPGPUs are supported for Certification Testing.
Internet connection
- The SUT must be able to talk to the Internet in order to download a significant number of packages from the NVIDIA repositories.
Installation of the checkbox-provider-gpgpu package – type sudo apt install checkbox-provider-gpgpu after deploying the node. This package is installed from the Certification PPA, which should be enabled when you deployed the node or installed Checkbox manually.
- This will install the snapped tools used for certification: cuda-samples, gpu-burn, and rocm-validation-suite. cuda-samples and gpu-burn are used for NVIDIA GPGPUs while rocm-validation-suite is used for AMD GPGPUs.

Setting Up GPGPU(s) for Testing¶

New tests cases have been added to test that NVIDIA and AMD GPGPUs work with Ubuntu. With this addition, GPGPUs can be certified on any Ubuntu LTS Release or Point Release starting with Ubuntu 18.04 LTS using the 4.15 kernel.

AMD GPGPUs should work with the default drivers installed by Ubuntu.

If you’re using an NVIDIA GPGPU, install the ubuntu-drivers utility if it isn’t already installed. To install it, type sudo apt install ubuntu-drivers-common. This tool automates the setup of the NVIDIA drivers on Ubuntu. To use it, type ubuntu-drivers install.

Once the tool completes the installation, you must reboot the SUT to ensure the correct driver is loaded.

GPGPUs that use NVLink¶

Some NVIDIA GPUs use NVLink for inter-device communication. NVLink is a high- bandwidth, energy-efficient interconnect technology developed by NVIDIA, aimed at replacing the traditional PCIe method of data transfer between the CPU and GPU or between multiple GPUs. Server configurations that use NVLink to connect multiple GPUs require extra configuration before testing can be performed. Failure to configure NVLink on systems where it is in use will result in the GPU tests failing to successfully run.

You must configure NVLink before launching tests. The following steps are provided as a guideline and as a general reference to the steps necessary to configure NVLink. It is not guaranteed that these steps will work in all cases as they depend somewhat on specific driver versions, tool versions, etc. which can change over time. It is expected that you understand how to configure your own hardware prior to testing.

Documentation and downloads for NVIDIA’s Data Center GPU Manager can be found at https://developer.nvidia.com/dcgm/.

The following steps should be performed after running ubuntu-drivers and having rebooted the machine to ensure that the correct NVIDIA driver has been loaded and the GPUs are accessible.

Determine which driver version you are using:
```
# modinfo nvidia |grep -i ^version
# version:        525.105.17
```
You’re looking for the major version, in this example, 525.
Install the datacenter-gpu-manager, fabricmanager and libnvidia-nscq packages appropriate for your driver version:
```
# sudo apt install nvidia-fabricmanager-525 libnvidia-nscq-525 datacenter-gpu-manager
```

Start the fabricmanager service:

# sudo systemctl start nvidia-fabricmanager.service

Start the persistence daemon:

# sudo service nvidia-persistenced start

Start nv-hostengine:
```
# sudo nv-hostengine
```
Set up a group:
```
# dcgmi group -c GPU_Group
# dcgmi group -l
```
The output will show you the GPU groups and the ID number for each.
Discover GPUs:
```
# dcgmi discovery -l
```
The output will show you the GPUs on the machine and the ID number for each.

Add GPUs to group:

# dcgmi group -g 2 -a 0,1,2,3
# dcgmi group -g 2 -i

Set up health monitoring:
```
# dcgmi health -g 2 -s mpi
```
Run the diag to check:
```
# dcgmi diag -g 2 -r 1
```

At this point, NVLink should be configured and ready to go. You can also test this by quickly running one of the NVIDIA sample tests such as p2pBandwidthLatencyTest which is provided by the cuda-samples snap. To run it, type cuda-samples p2pBandwidthLatencyTest.

Alternately, you can also run a quick stress test with gpu-burn like so:

# gpu-burn 10

Testing the GPGPU(s)¶

To test the GPGPU, you only need to run the test-gpgpu command as a normal user, much in the same manner as you run any of the certify-* or test-* commands provided by the canonical-certification-server package.

Running test-gpgpu will execute some automated tests and a stress test that will run for approximately 4 hours against all discovered GPGPUs in the SUT in parallel. Once testing is complete, the tool will upload results to the SUT’s Hardware Entry on the Certification Portal. You do not need to create a separate certificate request for GPGPU test results, simply add a note to the certificate created from the main test results with a link to the GPGPU submission and the certification team will review them together.