Skip to content

Smart MIG Placement #932

@Pavel-Okruhlica-SZN

Description

@Pavel-Okruhlica-SZN

TLDR:
When starting multiple MIG device using pods, they don't get created on one GPU and then the others, but instead it puts them randomly.

Full story:
I have a deployment that is requesting 7 x 1g10gb MIG devices on node with Nvidia H100 GPU

apiVersion: apps/v1
kind: Deployment
metadata:
  name: abstract-mig-claim
  namespace: nvidia-dra-driver-gpu
  labels:
    app: abstract-mig-claim
spec:
  replicas: 7
  selector:
    matchLabels:
      app: abstract-mig-claim
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: abstract-mig-claim
    spec:
      restartPolicy: Always
      containers:
        - name: abstract-mig-claiming-pod
          image: docker.ops.iszn.cz/ftxt-gpu/cuda:13.1.1-runtime-ubuntu24.04
          command: ["sleep", "6000"]
          resources:
            claims:
            - name: mig-device
              request: mig-10gb
      resourceClaims:
        - name: mig-device
          resourceClaimTemplateName: at-least-10gb-mig-template

with ResourceClaimTemplate:

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: at-least-10gb-mig-template
spec:
  spec:
    devices:
      requests:
      - name: mig-10gb
        exactly:
          deviceClassName: mig.nvidia.com
          selectors:
          - cel:
              expression: |
                device.capacity['gpu.nvidia.com'].multiprocessors.isGreaterThan(quantity("10"))
                &&
                device.capacity['gpu.nvidia.com'].memory.isGreaterThan(quantity("9Gi"))

      constraints:
      - requests: []
        matchAttribute: "gpu.nvidia.com/parentUUID"

and when I apply the deployment. i get the MIG devices randomly assigned instead of them trying to fit where they can and then move to other GPUs

GPU 0: NVIDIA H100 PCIe (UUID: GPU-c5dc08af-14bd-8444-4b8c-807a2b927bfc)
  MIG 1g.10gb     Device  0: (UUID: MIG-7bfd1ff3-36c9-5943-801e-ee320b69b080)
  MIG 1g.10gb     Device  1: (UUID: MIG-289084bd-9d24-58a1-b108-bd165d812ba0)
  MIG 1g.10gb     Device  2: (UUID: MIG-d5da8058-8391-52ad-9a82-4d71b789ff88)
GPU 1: NVIDIA H100 PCIe (UUID: GPU-26feb2ed-98d7-c1b0-85cc-9742bde90813)
GPU 2: NVIDIA H100 PCIe (UUID: GPU-406c8304-24fb-4e0f-ac82-8cc39e5deabe)
  MIG 1g.10gb     Device  0: (UUID: MIG-0eb84170-f5b5-5480-bb0b-1a68ec838b22)
GPU 3: NVIDIA H100 PCIe (UUID: GPU-75578d92-6462-98be-dc5e-9d90ec4d5fed)
  MIG 1g.10gb     Device  0: (UUID: MIG-ca1bcc65-fefc-5bb1-9009-76b161bc870b)
  MIG 1g.10gb     Device  1: (UUID: MIG-bc74463a-73bc-5c66-bb5e-78d95757f444)
  MIG 1g.10gb     Device  2: (UUID: MIG-83a994de-db16-5943-9cc4-92eb9951048e)

This basically makes most of the GPUs not usable as full by other pods that would require a full GPU (1 full GPU available when it could have been 3) and you would need to separate "MIG-able" and full GPU pods.

I was thinking that logic behind this could be:

New MIG device request comes -> Is there a GPU with MIG device already? if YES -> Can it fit the requested device? if YES -> create the device on that GPU.
If the answer is at any time NO. It moves to other GPU while checking the whole cluster before it creates MIG device on a GPU with no MIG already.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions