Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No need to run device-plugin manually anymore. #25

Conversation

rohitagarwal003
Copy link
Contributor

It's an addon in GCP now. The manifest for that is here:
https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml

Instead of that a pause container is run, so that the installer can remain an init container.
Also, added node affinity to only run on nodes that have a label with key 'cloud.google.com/gke-accelerator'.


There are three files now:

  1. device-plugin-daemonset.yaml
  2. daemonset.yaml
  3. daemonset-preloaded.yaml

(1) already existed but its name no longer makes sense. (2) is copy of (1).
(3) is copy of (2) except that the installer image is assumed to be present on
the node. (1) would be deleted once tests in kubernetes/kubernetes are updated
to point to (2) or (3).

It's an addon in GCP now. The manifest for that is here:
https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml

Instead of that a pause container is run, so that the installer can remain an init container.
Also, added node affinity to only run on nodes that have a label with key 'cloud.google.com/gke-accelerator'.

---

There are three files now:
1) device-plugin-daemonset.yaml
2) daemonset.yaml
3) daemonset-preloaded.yaml

(1) already existed but its name no longer makes sense. (2) is copy of (1).
(3) is copy of (2) except that the installer image is assumed to be present on
the node. (1) would be deleted once tests in kubernetes/kubernetes are updated
to point to (2) or (3).
name: nvidia-driver-installer
resources:
requests:
cpu: 0.15
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should we increase this based on what we measured in PR kubernetes/kubernetes#53541 that installer sometimes can take up to 2 cores?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure. That would in effect reduce the capacity of all GPU nodes by 2 cores.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also have similar concern. What I am worried about is installer may take much longer time when the node is short of cpu. Perhaps we can discuss this later outside this PR.

/lgtm

@rohitagarwal003 rohitagarwal003 merged commit 8dcc32e into GoogleCloudPlatform:master Nov 14, 2017
k8s-github-robot pushed a commit to kubernetes/kubernetes that referenced this pull request Nov 18, 2017
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Update URLs for nvidia gpu device plugin and nvidia driver installer.

Device plugin is now an addon and its manifest is now in kubernetes/kubernetes. The manifest on
GoogleCloudPlatform/container-engine-accelerators no longer contains device plugin.

This is needed after #54826 and GoogleCloudPlatform/container-engine-accelerators#25

**Release note**:
```release-note
NONE
```

/sig scheduling
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants