Cloud Experts Documentation

ROSA with Nvidia GPU Workloads

This content is authored by Red Hat experts, but has not yet been tested on every supported configuration.

ROSA guide to running Nvidia GPU workloads.

Prerequisites

  • ROSA Cluster (4.10+)
  • rosa cli #logged-in
  • oc cli #logged-in-cluster-admin
  • jq

If you need to install a ROSA cluster, please read our ROSA Quickstart Guide . Please be sure you are installing or using an existing ROSA cluster that it is 4.10.x or higher.

As of OpenShift 4.10, it is no longer necessary to set up entitlements to use the nVidia Operator. This has greatly simplified the setup of the cluster for GPU workloads.

Enter the oc login command, username, and password from the output of the previous command:

Example login:

oc login https://api.cluster_name.t6k4.i1.organization.org:6443 \
> --username cluster-admin \
> --password mypa55w0rd
Login successful.
You have access to 77 projects, the list has been suppressed. You can list all projects with ' projects'

Linux:

sudo dnf install jq

MacOS

brew install jq

Helm Prerequisites

If you plan to use Helm to deploy the GPU operator, you will need to do the following

  1. Add the MOBB chart repository to your Helm

    helm repo add mobb https://rh-mobb.github.io/helm-charts/
    
  2. Update your repositories

    helm repo update
    

GPU Quota

  1. View the list of supported GPU instance types in ROSA

    rosa list instance-types | grep accelerated
    
  2. Select a GPU instance type

    The guide uses g5.xlarge as an example. Please be mindful of the GPU cost of the type you choose.

    export GPU_INSTANCE_TYPE='g5.xlarge'
    
  3. Login to AWS

    Login to AWS Consoleexternal link (opens in new tab) , type “quotas” in search by, click on “Service Quotas” -> “AWS services” -> “Amazon Elastic Compute Cloud (Amazon EC2). Search for “Running On-Demand [instance-family] instances” (e.g. Running On-Demand G and VT instances).

Please remember that when you request quota that AWS is per core. As an example, to request a single g5.xlarge, you will need to request quota in groups of 4; to request a single g5.8xlarge, you will need to request quota in groups of 32.

  1. Verify quota and request increase if necessary

    GPU Quota Request on AWS

GPU Machine Pool

  1. Set environment variables

    export CLUSTER_NAME=<YOUR-CLUSTER>
    export MACHINE_POOL_NAME=nvidia-gpu-pool
    export MACHINE_POOL_REPLICA_COUNT=1
    
  2. Create GPU machine pool

    rosa create machinepool --cluster=$CLUSTER_NAME \
      --name=$MACHINE_POOL_NAME \
      --replicas=$MACHINE_POOL_REPLICA_COUNT \
      --instance-type=$GPU_INSTANCE_TYPE
    
  3. Verify GPU machine pool

It may take 10-15 minutes to provision a new GPU machine. If this step fails, please login to the AWS Consoleexternal link (opens in new tab) and ensure you didn’t run across availability issues. You can go to EC2 and search for instances by cluster name to see the instance state.

oc wait --for=jsonpath='{.status.readyReplicas}'=1 machineset \
  -l hive.openshift.io/machine-pool=$MACHINE_POOL_NAME \
  -n openshift-machine-api --timeout=600s

Install and Configure Nvidia GPU

This section configures the Node Feature Discovery Operator (to allow OpenShift to discover the GPU nodes) and the Nvidia GPU Operator.

Two options: Helm or Manual

Helm

  1. Create namespaces

    oc create namespace openshift-nfd
    oc create namespace nvidia-gpu-operator
    
  2. Use the mobb/operatorhub chart to deploy the needed operators

    helm upgrade -n nvidia-gpu-operator nvidia-gpu-operator \
      mobb/operatorhub --install \
      --values https://raw.githubusercontent.com/rh-mobb/helm-charts/main/charts/nvidia-gpu/files/operatorhub.yaml
    
  3. Wait until the two operators are running

    oc rollout status deploy/nfd-controller-manager -n openshift-nfd --timeout=300s
    
    oc rollout status deploy/gpu-operator -n nvidia-gpu-operator --timeout=300s
    
  4. Install the Nvidia GPU Operator chart

    helm upgrade --install -n nvidia-gpu-operator nvidia-gpu \
      mobb/nvidia-gpu --disable-openapi-validation
    
  5. Wait until NFD instances are ready

    NOTE: If you are deploying ROSA in single-AZ change the replicas from 3 to 1 nfd-master

    oc wait --for=jsonpath='{.status.availableReplicas}'=3 -l app=nfd-master deployment -n openshift-nfd
    
    oc wait --for=jsonpath='{.status.numberReady}'=5 -l app=nfd-worker ds -n openshift-nfd
    
  6. Wait until Cluster Policy is ready

    oc wait --for=jsonpath='{.status.state}'=ready clusterpolicy \
      gpu-cluster-policy -n nvidia-gpu-operator --timeout=600s
    
  7. Skip to Validate GPU

Manually

Install Nvidia GPU Operator

  1. Create Nvidia namespace

    oc create namespace nvidia-gpu-operator
    
  2. Create Operator Group

    cat <<EOF | oc apply -f -
    apiVersion: operators.coreos.com/v1
    kind: OperatorGroup
    metadata:
      name: nvidia-gpu-operator-group
      namespace: nvidia-gpu-operator
    spec:
     targetNamespaces:
     - nvidia-gpu-operator
    EOF
    
  3. Get latest nvidia channel

    CHANNEL=$(oc get packagemanifest gpu-operator-certified -n openshift-marketplace -o jsonpath='{.status.defaultChannel}')
    
  4. Get latest nvidia package

    PACKAGE=$(oc get packagemanifests/gpu-operator-certified -n openshift-marketplace -ojson | jq -r '.status.channels[] | select(.name == "'$CHANNEL'") | .currentCSV')
    
  5. Create Subscription

    envsubst  <<EOF | oc apply -f -
    apiVersion: operators.coreos.com/v1alpha1
    kind: Subscription
    metadata:
      name: gpu-operator-certified
      namespace: nvidia-gpu-operator
    spec:
      channel: "$CHANNEL"
      installPlanApproval: Automatic
      name: gpu-operator-certified
      source: certified-operators
      sourceNamespace: openshift-marketplace
      startingCSV: "$PACKAGE"
    EOF
    
  6. Wait for Operator to finish installing

    oc rollout status deploy/gpu-operator -n nvidia-gpu-operator --timeout=300s
    

Install Node Feature Discovery Operator

The node feature discovery operator will discover the GPU on your nodes and appropriately label the nodes so you can target them for workloads. We’ll install the NFD operator into the opneshift-ndf namespace and create the “subscription” which is the configuration for NFD.

Official Documentation for Installing Node Feature Discovery Operator

  1. Set up namespace

    oc create namespace openshift-nfd
    
  2. Create OperatorGroup

    cat <<EOF | oc apply -f -
    apiVersion: operators.coreos.com/v1
    kind: OperatorGroup
    metadata:
      generateName: openshift-nfd-
      name: openshift-nfd
      namespace: openshift-nfd
    EOF
    
  3. Create Subscription

    cat <<EOF | oc apply -f -
    apiVersion: operators.coreos.com/v1alpha1
    kind: Subscription
    metadata:
      name: nfd
      namespace: openshift-nfd
    spec:
      channel: "stable"
      installPlanApproval: Automatic
      name: nfd
      source: redhat-operators
      sourceNamespace: openshift-marketplace
    EOF
    
  4. Wait for Node Feature discovery to complete installation

    oc rollout status deploy/nfd-controller-manager -n openshift-nfd --timeout=300s
    
  5. Create NFD Instance

    cat <<EOF | oc apply -f -
    kind: NodeFeatureDiscovery
    apiVersion: nfd.openshift.io/v1
    metadata:
      name: nfd-instance
      namespace: openshift-nfd
    spec:
      customConfig:
        configData: |
          #    - name: "more.kernel.features"
          #      matchOn:
          #      - loadedKMod: ["example_kmod3"]
          #    - name: "more.features.by.nodename"
          #      value: customValue
          #      matchOn:
          #      - nodename: ["special-.*-node-.*"]      
      operand:
        image: >-
          registry.redhat.io/openshift4/ose-node-feature-discovery@sha256:07658ef3df4b264b02396e67af813a52ba416b47ab6e1d2d08025a350ccd2b7b      
        servicePort: 12000
      workerConfig:
        configData: |
          core:
          #  labelWhiteList:
          #  noPublish: false
            sleepInterval: 60s
          #  sources: [all]
          #  klog:
          #    addDirHeader: false
          #    alsologtostderr: false
          #    logBacktraceAt:
          #    logtostderr: true
          #    skipHeaders: false
          #    stderrthreshold: 2
          #    v: 0
          #    vmodule:
          ##   NOTE: the following options are not dynamically run-time
          ##          configurable and require a nfd-worker restart to take effect
          ##          after being changed
          #    logDir:
          #    logFile:
          #    logFileMaxSize: 1800
          #    skipLogHeaders: false
          sources:
          #  cpu:
          #    cpuid:
          ##     NOTE: whitelist has priority over blacklist
          #      attributeBlacklist:
          #        - "BMI1"
          #        - "BMI2"
          #        - "CLMUL"
          #        - "CMOV"
          #        - "CX16"
          #        - "ERMS"
          #        - "F16C"
          #        - "HTT"
          #        - "LZCNT"
          #        - "MMX"
          #        - "MMXEXT"
          #        - "NX"
          #        - "POPCNT"
          #        - "RDRAND"
          #        - "RDSEED"
          #        - "RDTSCP"
          #        - "SGX"
          #        - "SSE"
          #        - "SSE2"
          #        - "SSE3"
          #        - "SSE4.1"
          #        - "SSE4.2"
          #        - "SSSE3"
          #      attributeWhitelist:
          #  kernel:
          #    kconfigFile: "/path/to/kconfig"
          #    configOpts:
          #      - "NO_HZ"
          #      - "X86"
          #      - "DMI"
            pci:
              deviceClassWhitelist:
                - "0200"
                - "03"
                - "12"
              deviceLabelFields:
          #      - "class"
                - "vendor"
          #      - "device"
          #      - "subsystem_vendor"
          #      - "subsystem_device"
          #  usb:
          #    deviceClassWhitelist:
          #      - "0e"
          #      - "ef"
          #      - "fe"
          #      - "ff"
          #    deviceLabelFields:
          #      - "class"
          #      - "vendor"
          #      - "device"
          #  custom:
          #    - name: "my.kernel.feature"
          #      matchOn:
          #        - loadedKMod: ["example_kmod1", "example_kmod2"]
          #    - name: "my.pci.feature"
          #      matchOn:
          #        - pciId:
          #            class: ["0200"]
          #            vendor: ["15b3"]
          #            device: ["1014", "1017"]
          #        - pciId :
          #            vendor: ["8086"]
          #            device: ["1000", "1100"]
          #    - name: "my.usb.feature"
          #      matchOn:
          #        - usbId:
          #          class: ["ff"]
          #          vendor: ["03e7"]
          #          device: ["2485"]
          #        - usbId:
          #          class: ["fe"]
          #          vendor: ["1a6e"]
          #          device: ["089a"]
          #    - name: "my.combined.feature"
          #      matchOn:
          #        - pciId:
          #            vendor: ["15b3"]
          #            device: ["1014", "1017"]
          #          loadedKMod : ["vendor_kmod1", "vendor_kmod2"]      
    EOF
    
  6. Wait until NFD instances are ready

    oc wait --for=jsonpath='{.status.numberReady}'=3 -l app=nfd-master ds -n openshift-nfd
    
    oc wait --for=jsonpath='{.status.numberReady}'=5 -l app=nfd-worker ds -n openshift-nfd
    

Apply nVidia Cluster Config

We’ll now apply the nvidia cluster config. Please read the nvidia documentationexternal link (opens in new tab) on customizing this if you have your own private repos or specific settings. This will be another process that takes a few minutes to complete.

  1. Create cluster config

    cat <<EOF | oc create -f -
    apiVersion: nvidia.com/v1
    kind: ClusterPolicy
    metadata:
      name: gpu-cluster-policy
    spec:
      migManager:
        enabled: true
      operator:
        defaultRuntime: crio
        initContainer: {}
        runtimeClass: nvidia
        deployGFD: true
      dcgm:
        enabled: true
      gfd: {}
      dcgmExporter:
        config:
          name: ''
      driver:
        licensingConfig:
          nlsEnabled: false
          configMapName: ''
        certConfig:
          name: ''
        kernelModuleConfig:
          name: ''
        repoConfig:
          configMapName: ''
        virtualTopology:
          config: ''
        enabled: true
        use_ocp_driver_toolkit: true
      devicePlugin: {}
      mig:
        strategy: single
      validator:
        plugin:
          env:
            - name: WITH_WORKLOAD
              value: 'true'
      nodeStatusExporter:
        enabled: true
      daemonsets: {}
      toolkit:
        enabled: true
    EOF
    
  2. Wait until Cluster Policy is ready

    oc wait --for=jsonpath='{.status.state}'=ready clusterpolicy \
     gpu-cluster-policy -n nvidia-gpu-operator --timeout=600s
    

Validate GPU

  1. Verify NFD can see your GPU(s)

    oc describe node -l node.kubernetes.io/instance-type=$GPU_INSTANCE_TYPE \
      | egrep 'Roles|pci-10de' | grep -v master
    

    You should see output like:

    Roles:              worker
                        feature.node.kubernetes.io/pci-10de.present=true
    
  2. Verify GPU Operator added node label to your GPU nodes

    oc get node -l nvidia.com/gpu.present
    
  3. [Optional] Test GPU access using Nvidia SMI

    oc project nvidia-gpu-operator
    
    for i in $(oc get pod -lopenshift.driver-toolkit=true --no-headers |awk '{print $1}'); do echo $i; oc exec -it $i -- nvidia-smi ; echo -e '\n' ;  done
    

    You should see output that shows the GPUs available on the host such as this example screenshot. (Varies depending on GPU worker type)

    Nvidia SMI
  4. Create Pod to run a GPU workload

    oc project nvidia-gpu-operator
    
    cat <<EOF | oc create -f -
    apiVersion: v1
    kind: Pod
    metadata:
      name: cuda-vector-add
    spec:
      restartPolicy: OnFailure
      containers:
        - name: cuda-vector-add
          image: "nvidia/samples:vectoradd-cuda11.2.1"
          resources:
            limits:
              nvidia.com/gpu: 1
          nodeSelector:
            nvidia.com/gpu.present: true
    EOF
    
  5. View logs

    oc logs cuda-vector-add --tail=-1
    

    Please note, if you get an error “Error from server (BadRequest): container “cuda-vector-add” in pod “cuda-vector-add” is waiting to start: ContainerCreating” try running “oc delete pod cuda-vector-add” and then re-run the create statement above. We’ve seen issues where if this step is ran before all of the operator consolidation is done it may just sit there.

    You should see Output like the following (mary vary depending on GPU):

    [Vector addition of 5000 elements]
    Copy input data from the host memory to the CUDA device
    CUDA kernel launch with 196 blocks of 256 threads
    Copy output data from the CUDA device to the host memory
    Test PASSED
    Done
    
  6. If successful, the pod can be deleted

    oc delete pod cuda-vector-add
    

Interested in contributing to these docs?

Collaboration drives progress. Help improve our documentation The Red Hat Way.

Red Hat logo LinkedIn YouTube Facebook Twitter

Products

Tools

Try, buy & sell

Communicate

About Red Hat

We’re the world’s leading provider of enterprise open source solutions—including Linux, cloud, container, and Kubernetes. We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Subscribe to our newsletter, Red Hat Shares

Sign up now
© 2023 Red Hat, Inc.