Home GitHub

IMPORTANT NOTE: This site is not official Red Hat documentation and is provided for informational purposes only. These guides may be experimental, proof of concept, or early adoption. Officially supported documentation is available at docs.openshift.com and access.redhat.com.

ARO with Nvidia GPU Workloads

ARO guide to running Nvidia GPU workloads.

Author: Byron Miller

Table of Contents

Prerequisites

If you need to install an ARO cluster, please read our ARO Quick start guide. Please be sure if you’re installing or using an existing ARO cluster that it is 4.10.x or higher.

As of OpenShift 4.10, it is no longer necessary to set up entitlements to use the nVidia Operator. This has greatly simplified the setup of the cluster for GPU workloads.

Linux:

sudo dnf install jq moreutils gettext

MacOS

brew install jq moreutils gettext

GPU Quota

All GPU quotas in Azure are 0 by default. You will need to login to the azure portal and request GPU quota. There is a lot of competition for GPU workers, so you may have to provision an ARO cluster in a region where you can actually reserve GPU. ARO supports the following GPU workers:

Please remember that when you request quota that Azure is per core. To request a single NC4as T4 v3 node, you will need to request quota in groups of 4. If you wish to request an NC16as T4 v3 you will need to request quota of 16.

  1. Login to azure

    Login to portal.azure.com, type “quotas” in search by, click on Compute and in the search box type “NCAsv3_T4”. Select the region your cluster is in (select checkbox) and then click Request quota increase and ask for quota (I chose 8 so i can build two demo clusters of NC4as T4s).

  2. Configure quota

    GPU Quota Request on Azure

Log in to your ARO cluster

  1. Login to OpenShift - we’ll use the kubeadmin account here but you can login with your user account as long as you have cluster-admin.

    oc login <apiserver> -u kubeadmin -p <kubeadminpass>
    

Pull secret (Conditional)

We’ll update our pull secret to make sure that we can install operators as well as connect to cloud.redhat.com.

If you have already re-created a full pull secret with cloud.redhat.com enabled you can skip this step

  1. Log into cloud.redhat.com

  2. Browse to https://cloud.redhat.com/openshift/install/azure/aro-provisioned

  3. click the Download pull secret button and save it as pull-secret.txt

    The following steps will need to be ran in the same working directory as your pull-secret.txt

  4. Export existing pull secret

    oc get secret pull-secret -n openshift-config -o json | jq -r '.data.".dockerconfigjson"' | base64 --decode > export-pull.json
    
  5. Merge downloaded pull secret with system pull secret to add cloud.redhat.com

    jq -s '.[0] * .[1]' export-pull.json pull-secret.txt | tr -d "\n\r" > new-pull-secret.json
    
  6. Upload new secret file

    oc set data secret/pull-secret -n openshift-config --from-file=.dockerconfigjson=new-pull-secret.json
    

You may need to wait for about ~1hr for everything to sync up with cloud.redhat.com.

  1. Delete secrets

    rm pull-secret.txt export-pull.json new-pull-secret.json 
    

GPU Machine Set

ARO still uses Kubernetes Machinsets to create a machine set. I’m going to export the first machine set in my cluster (az 1) and use that as a template to build a single GPU machine in southcentralus region 1.

  1. View existing machine sets

    For ease of set up, I’m going to grab the first machine set and use that as the one I will clone to create our GPU machine set.

    MACHINESET=$(oc get machineset -n openshift-machine-api -o=jsonpath='{.items[0]}' | jq -r '[.metadata.name] | @tsv')
    
  2. Save a copy of example machine set

    oc get machineset -n openshift-machine-api $MACHINESET -o json > gpu_machineset.json
    
  3. Change the .metadata.name field to a new unique name

    I’m going to create a unique name for this single node machine set that shows nvidia-worker- that follows a similar pattern as all the other machine sets.

    jq '.metadata.name = "nvidia-worker-southcentralus1"' gpu_machineset.json| sponge gpu_machineset.json
    
  4. Ensure spec.replicas matches the desired replica count for the MachineSet

     jq '.spec.replicas = 1' gpu_machineset.json| sponge gpu_machineset.json
    
  5. Change the .spec.selector.matchLabels.machine.openshift.io/cluster-api-machineset field to match the .metadata.name field

    jq '.spec.selector.matchLabels."machine.openshift.io/cluster-api-machineset" = "nvidia-worker-southcentralus1"' gpu_machineset.json| sponge gpu_machineset.json
    
  6. Change the .spec.template.metadata.labels.machine.openshift.io/cluster-api-machineset to match the .metadata.name field

    jq '.spec.template.metadata.labels."machine.openshift.io/cluster-api-machineset" = "nvidia-worker-southcentralus1"' gpu_machineset.json| sponge gpu_machineset.json
    
  7. Change the spec.template.spec.providerSpec.value.vmSize to match the desired GPU instance type from Azure.

    The machine we’re using is Standard_NC4as_T4_v3.

    jq '.spec.template.spec.providerSpec.value.vmSize = "Standard_NC4as_T4_v3"' gpu_machineset.json | sponge gpu_machineset.json
    
  8. Change the spec.template.spec.providerSpec.value.zone to match the desired zone from Azure

    jq '.spec.template.spec.providerSpec.value.zone = "1"' gpu_machineset.json | sponge gpu_machineset.json
    
  9. Delete the .status section of the yaml file

    jq 'del(.status)' gpu_machineset.json | sponge gpu_machineset.json
    
  10. Verify the other data in the yaml file.

Create GPU machine set

These steps will create the new GPU machine. It may take 10-15 minutes to provision a new GPU machine. If this step fails, please login to the azure portal and ensure you didn’t run across availability issues. You can go “Virtual Machines” and search for the worker name you created above to see the status of VMs.

  1. Create GPU Machine set

    oc create -f gpu_machineset.json
    

    This command will take a few minutes to complete.

  2. Verify GPU machine set

    Machines should be getting deployed. You can view the status of the machine set with the following commands

    oc get machineset -n openshift-machine-api
    oc get machine -n openshift-machine-api
    

    Once the machines are provisioned, which could take 5-15 minutes, machines will show as nodes in the node list.

    oc get nodes
    

    You should see a node with the “nvidia-worker-southcentralus1” name it we created earlier.

Install Nvidia GPU Operator

This will create the nvidia-gpu-operator name space, set up the operator group and install the Nvidia GPU Operator.

  1. Create Nvidia namespace

    cat <<EOF | oc apply -f -
    apiVersion: v1
    kind: Namespace
    metadata:
      name: nvidia-gpu-operator
    EOF
    
  2. Create Operator Group

    cat <<EOF | oc apply -f -
    apiVersion: operators.coreos.com/v1
    kind: OperatorGroup
    metadata:
      name: nvidia-gpu-operator-group
      namespace: nvidia-gpu-operator
    spec:
     targetNamespaces:
     - nvidia-gpu-operator
    EOF
    
  3. Get latest nvidia channel

    CHANNEL=$(oc get packagemanifest gpu-operator-certified -n openshift-marketplace -o jsonpath='{.status.defaultChannel}')
    
  4. Get latest nvidia package

    PACKAGE=$(oc get packagemanifests/gpu-operator-certified -n openshift-marketplace -ojson | jq -r '.status.channels[] | select(.name == "'$CHANNEL'") | .currentCSV')
    
  5. Create Subscription

    envsubst  <<EOF | oc apply -f -
    apiVersion: operators.coreos.com/v1alpha1
    kind: Subscription
    metadata:
      name: gpu-operator-certified
      namespace: nvidia-gpu-operator
    spec:
      channel: "$CHANNEL"
      installPlanApproval: Automatic
      name: gpu-operator-certified
      source: certified-operators
      sourceNamespace: openshift-marketplace
      startingCSV: "$PACKAGE"
    EOF
    
  6. Wait for Operator to finish installing

    Don’t proceed until you have verified that the operator has finished installing. It’s also a good point to ensure that your GPU worker is online.

    Verify Operator

Install Node Feature Discovery Operator

The node feature discovery operator will discover the GPU on your nodes and appropriately label the nodes so you can target them for workloads. We’ll install the NFD operator into the opneshift-ndf namespace and create the “subscription” which is the configuration for NFD.

Official Documentation for Installing Node Feature Discovery Operator

  1. Set up Name Space

    cat <<EOF | oc apply -f -
    apiVersion: v1
    kind: Namespace
    metadata:
      name: openshift-nfd
    EOF
    
  2. Create OperatorGroup

    cat <<EOF | oc apply -f -
    apiVersion: operators.coreos.com/v1
    kind: OperatorGroup
    metadata:
      generateName: openshift-nfd-
      name: openshift-nfd
      namespace: openshift-nfd
    EOF
    
  3. Create Subscription

    cat <<EOF | oc apply -f -
    apiVersion: operators.coreos.com/v1alpha1
    kind: Subscription
    metadata:
      name: nfd
      namespace: openshift-nfd
    spec:
      channel: "stable"
      installPlanApproval: Automatic
      name: nfd
      source: redhat-operators
      sourceNamespace: openshift-marketplace
    EOF
    
  4. Wait for Node Feature discovery to complete installation

    You can login to your openshift console and view operators or simply wait a few minutes. The next step will error until the operator has finished installing.

  5. Create NFD Instance

    cat <<EOF | oc apply -f -
    kind: NodeFeatureDiscovery
    apiVersion: nfd.openshift.io/v1
    metadata:
      name: nfd-instance
      namespace: openshift-nfd
    spec:
      customConfig:
        configData: |
          #    - name: "more.kernel.features"
          #      matchOn:
          #      - loadedKMod: ["example_kmod3"]
          #    - name: "more.features.by.nodename"
          #      value: customValue
          #      matchOn:
          #      - nodename: ["special-.*-node-.*"]
      operand:
        image: >-
          registry.redhat.io/openshift4/ose-node-feature-discovery@sha256:07658ef3df4b264b02396e67af813a52ba416b47ab6e1d2d08025a350ccd2b7b
        servicePort: 12000
      workerConfig:
        configData: |
          core:
          #  labelWhiteList:
          #  noPublish: false
            sleepInterval: 60s
          #  sources: [all]
          #  klog:
          #    addDirHeader: false
          #    alsologtostderr: false
          #    logBacktraceAt:
          #    logtostderr: true
          #    skipHeaders: false
          #    stderrthreshold: 2
          #    v: 0
          #    vmodule:
          ##   NOTE: the following options are not dynamically run-time 
          ##          configurable and require a nfd-worker restart to take effect
          ##          after being changed
          #    logDir:
          #    logFile:
          #    logFileMaxSize: 1800
          #    skipLogHeaders: false
          sources:
          #  cpu:
          #    cpuid:
          ##     NOTE: whitelist has priority over blacklist
          #      attributeBlacklist:
          #        - "BMI1"
          #        - "BMI2"
          #        - "CLMUL"
          #        - "CMOV"
          #        - "CX16"
          #        - "ERMS"
          #        - "F16C"
          #        - "HTT"
          #        - "LZCNT"
          #        - "MMX"
          #        - "MMXEXT"
          #        - "NX"
          #        - "POPCNT"
          #        - "RDRAND"
          #        - "RDSEED"
          #        - "RDTSCP"
          #        - "SGX"
          #        - "SSE"
          #        - "SSE2"
          #        - "SSE3"
          #        - "SSE4.1"
          #        - "SSE4.2"
          #        - "SSSE3"
          #      attributeWhitelist:
          #  kernel:
          #    kconfigFile: "/path/to/kconfig"
          #    configOpts:
          #      - "NO_HZ"
          #      - "X86"
          #      - "DMI"
            pci:
              deviceClassWhitelist:
                - "0200"
                - "03"
                - "12"
              deviceLabelFields:
          #      - "class"
                - "vendor"
          #      - "device"
          #      - "subsystem_vendor"
          #      - "subsystem_device"
          #  usb:
          #    deviceClassWhitelist:
          #      - "0e"
          #      - "ef"
          #      - "fe"
          #      - "ff"
          #    deviceLabelFields:
          #      - "class"
          #      - "vendor"
          #      - "device"
          #  custom:
          #    - name: "my.kernel.feature"
          #      matchOn:
          #        - loadedKMod: ["example_kmod1", "example_kmod2"]
          #    - name: "my.pci.feature"
          #      matchOn:
          #        - pciId:
          #            class: ["0200"]
          #            vendor: ["15b3"]
          #            device: ["1014", "1017"]
          #        - pciId :
          #            vendor: ["8086"]
          #            device: ["1000", "1100"]
          #    - name: "my.usb.feature"
          #      matchOn:
          #        - usbId:
          #          class: ["ff"]
          #          vendor: ["03e7"]
          #          device: ["2485"]
          #        - usbId:
          #          class: ["fe"]
          #          vendor: ["1a6e"]
          #          device: ["089a"]
          #    - name: "my.combined.feature"
          #      matchOn:
          #        - pciId:
          #            vendor: ["15b3"]
          #            device: ["1014", "1017"]
          #          loadedKMod : ["vendor_kmod1", "vendor_kmod2"]
    EOF
    
  6. Verify NFD is ready.

    This operator should say Available in the status

    NFD Operator Ready

Apply nVidia Cluster Config

We’ll now apply the nvidia cluster config. Please read the nvidia documentation on customizing this if you have your own private repos or specific settings. This will be another process that takes a few minutes to complete.

  1. Apply cluster config

    cat <<EOF | oc apply -f -
    apiVersion: nvidia.com/v1
    kind: ClusterPolicy
    metadata:
      name: gpu-cluster-policy
    spec:
      migManager:
        enabled: true
      operator:
        defaultRuntime: crio
        initContainer: {}
        runtimeClass: nvidia
        deployGFD: true
      dcgm:
        enabled: true
      gfd: {}
      dcgmExporter:
        config:
          name: ''
      driver:
        licensingConfig:
          nlsEnabled: false
          configMapName: ''
        certConfig:
          name: ''
        kernelModuleConfig:
          name: ''
        repoConfig:
          configMapName: ''
        virtualTopology:
          config: ''
        enabled: true
        use_ocp_driver_toolkit: true
      devicePlugin: {}
      mig:
        strategy: single
      validator:
        plugin:
          env:
            - name: WITH_WORKLOAD
              value: 'true'
      nodeStatusExporter:
        enabled: true
      daemonsets: {}
      toolkit:
        enabled: true
    EOF
    
  2. Verify Cluster Policy

    Login to OpenShift console and browse to operators and make sure you’re in nvidia-gpu-operator namespace. You should see it say State: Ready once everything is complete.

    cluster policy

Validate GPU

It may take some time for the nVidia Operator and NFD to completely install and self-identify the machines. These commands can be ran to help validate that everything is running as expected.

  1. Verify NFD can see your GPU(s)

    oc describe node | egrep 'Roles|pci-10de' | grep -v master
    

    You should see output like:

    Roles:              worker
                    feature.node.kubernetes.io/pci-10de.present=true
    
  2. Verify node labels

    You can see the node labels by logging into the OpenShift console -> Compute -> Nodes -> nvidia-worker-southcentralus1-. You should see a bunch of nvidia GPU labels and the pci-10de device from above.

    NFD Node labels

  3. Create Pod to run a GPU workload

    oc project nvidia-gpu-operator
    cat <<EOF | oc apply -f -
    apiVersion: v1
    kind: Pod
    metadata:
      name: cuda-vector-add
    spec:
      restartPolicy: OnFailure
      containers:
        - name: cuda-vector-add
          image: "quay.io/giantswarm/nvidia-gpu-demo:latest"
          resources:
            limits:
              nvidia.com/gpu: 1
          nodeSelector:
            nvidia.com/gpu.present: true
    EOF
    
  4. View logs

    oc logs cuda-vector-add --tail=-1
    

    Please note, if you get an error “Error from server (BadRequest): container “cuda-vector-add” in pod “cuda-vector-add” is waiting to start: ContainerCreating” try running “oc delete pod cuda-vector-add” and then re-run the create statement above. I’ve seen issues where if this step is ran before all of the operator consolidation is done it may just sit there.

    Output:

    Begin
    Allocating device memory on host..
    Copying to device..
    Doing GPU Vector add
    Doing CPU Vector add
    10000000 0.000078 0.044028
    
  5. If successful, the pod can be deleted

    oc delete pod cuda-vector-add