Appendix G - Setting Up and Testing a GPGPU¶
Requirements for GPGPU testing¶
SUT prepared for testing as described in this document
NVIDIA or AMD GPGPU(s) installed in SUT
At this time, only NVIDIA and AMD GPGPUs are supported for Certification Testing.
Internet connection
The SUT must be able to talk to the Internet in order to download a significant number of packages from the NVIDIA repositories.
Installation of the
checkbox-provider-gpgpu
package – typesudo apt install checkbox-provider-gpgpu
after deploying the node. This package is installed from the Certification PPA, which should be enabled when you deployed the node or installed Checkbox manually.This will install the snapped tools used for certification:
cuda-samples
,gpu-burn
, androcm-validation-suite
.cuda-samples
andgpu-burn
are used for NVIDIA GPGPUs whilerocm-validation-suite
is used for AMD GPGPUs.
Setting Up GPGPU(s) for Testing¶
New tests cases have been added to test that NVIDIA and AMD GPGPUs work with Ubuntu. With this addition, GPGPUs can be certified on any Ubuntu LTS Release or Point Release starting with Ubuntu 18.04 LTS using the 4.15 kernel.
AMD GPGPUs should work with the default drivers installed by Ubuntu.
If you’re using an NVIDIA GPGPU, install the ubuntu-drivers
utility if it
isn’t already installed. To install it, type
sudo apt install ubuntu-drivers-common
. This tool automates the setup of
the NVIDIA drivers on Ubuntu. To use it, type ubuntu-drivers install
.
Once the tool completes the installation, you must reboot the SUT to ensure the correct driver is loaded.
GPGPUs that use NVLink¶
Some NVIDIA GPUs use NVLink for inter-device communication. NVLink is a high- bandwidth, energy-efficient interconnect technology developed by NVIDIA, aimed at replacing the traditional PCIe method of data transfer between the CPU and GPU or between multiple GPUs. Server configurations that use NVLink to connect multiple GPUs require extra configuration before testing can be performed. Failure to configure NVLink on systems where it is in use will result in the GPU tests failing to successfully run.
You must configure NVLink before launching tests. The following steps are provided as a guideline and as a general reference to the steps necessary to configure NVLink. It is not guaranteed that these steps will work in all cases as they depend somewhat on specific driver versions, tool versions, etc. which can change over time. It is expected that you understand how to configure your own hardware prior to testing.
Documentation and downloads for NVIDIA’s Data Center GPU Manager can be found at https://developer.nvidia.com/dcgm/.
The following steps should be performed after running ubuntu-drivers
and
having rebooted the machine to ensure that the correct NVIDIA driver has been
loaded and the GPUs are accessible.
Determine which driver version you are using:
# modinfo nvidia |grep -i ^version # version: 525.105.17
You’re looking for the major version, in this example, 525.
Install the
datacenter-gpu-manager
,fabricmanager
andlibnvidia-nscq
packages appropriate for your driver version:# sudo apt install nvidia-fabricmanager-525 libnvidia-nscq-525 datacenter-gpu-manager
Start the fabricmanager service:
# sudo systemctl start nvidia-fabricmanager.service
Start the persistence daemon:
# sudo service nvidia-persistenced start
Start nv-hostengine:
# sudo nv-hostengine
Set up a group:
# dcgmi group -c GPU_Group # dcgmi group -l
The output will show you the GPU groups and the ID number for each.
Discover GPUs:
# dcgmi discovery -l
The output will show you the GPUs on the machine and the ID number for each.
Add GPUs to group:
# dcgmi group -g 2 -a 0,1,2,3 # dcgmi group -g 2 -i
Set up health monitoring:
# dcgmi health -g 2 -s mpi
Run the
diag
to check:# dcgmi diag -g 2 -r 1
At this point, NVLink should be configured and ready to go. You can also test
this by quickly running one of the NVIDIA sample tests such as
p2pBandwidthLatencyTest
which is provided by the cuda-samples
snap.
To run it, type cuda-samples p2pBandwidthLatencyTest
.
Alternately, you can also run a quick stress test with gpu-burn
like so:
# gpu-burn 10
Testing the GPGPU(s)¶
To test the GPGPU, you only need to run the test-gpgpu
command as a normal
user, much in the same manner as you run any of the certify-*
or test-*
commands provided by the canonical-certification-server
package.
Running test-gpgpu
will execute some automated tests and a stress test that
will run for approximately 4 hours against all discovered GPGPUs in the SUT in
parallel. Once testing is complete, the tool will upload results to the SUT’s
Hardware Entry on the Certification Portal. You do not need to create a separate
certificate request for GPGPU test results, simply add a note to the certificate
created from the main test results with a link to the GPGPU submission and the
certification team will review them together.