feat(nvidia-device-plugin): add nvidia-device-plugin (#20132)

**Description**
Hello,

This adds the nvidia-device-plugin preconfigured for 5 vcpu per pgpu.

⚒️ Fixes  # <!--(issue)-->

**⚙️ Type of change**

- [X] ⚙️ Feature/App addition
- [ ] 🪛 Bugfix
- [ ] ⚠️ Breaking change (fix or feature that would cause existing
functionality to not work as expected)
- [ ] 🔃 Refactor of current code

**🧪 How Has This Been Tested?**
<!--
Please describe the tests that you ran to verify your changes. Provide
instructions so we can reproduce. Please also list any relevant details
for your test configuration
-->

**📃 Notes:**
<!-- Please enter any other relevant information here -->

**✔️ Checklist:**

- [X] ⚖️ My code follows the style guidelines of this project
- [X] 👀 I have performed a self-review of my own code
- [X] #️⃣ I have commented my code, particularly in hard-to-understand
areas
- [X] 📄 I have made corresponding changes to the documentation
- [ ] ⚠️ My changes generate no new warnings
- [ ] 🧪 I have added tests to this description that prove my fix is
effective or that my feature works
- [X] ⬆️ I increased versions for any altered app according to semantic
versioning
- [X] I made sure the title starts with `feat(chart-name):`,
`fix(chart-name):` or `chore(chart-name):`

** App addition**

If this PR is an app addition please make sure you have done the
following.

- [ ] 🖼️ I have added an icon in the Chart's root directory called
`icon.png`

---

_Please don't blindly check all the boxes. Read them and only check
those that apply.
Those checkboxes are there for the reviewer to see what is this all
about and
the status of this PR with a quick glance._

---------

Signed-off-by: bitpushr <91350598+bitpushr@users.noreply.github.com>
Signed-off-by: Kjeld Schouten <info@kjeldschouten.nl>
Co-authored-by: bitpushr <91350598+bitpushr@users.noreply.github.com>
Co-authored-by: Kjeld Schouten <info@kjeldschouten.nl>
This commit is contained in:
ばか雪 2024-04-09 16:51:26 +09:00 committed by GitHub
parent 41f12da1d8
commit 52deaaac46
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
8 changed files with 239 additions and 0 deletions

View File

@ -0,0 +1,30 @@
# Patterns to ignore when building packages.
# This supports shell glob matching, relative path matching, and
# negation (prefixed with !). Only one pattern per line.
.DS_Store
# Common VCS dirs
.git/
.gitignore
.bzr/
.bzrignore
.hg/
.hgignore
.svn/
# Common backup files
*.swp
*.bak
*.tmp
*~
# Various IDEs
.project
.idea/
*.tmproj
.vscode/
# OWNERS file for Kubernetes
OWNERS
# helm-docs templates
*.gotmpl
# docs folder
/docs
# icon
icon.png

View File

@ -0,0 +1,44 @@
annotations:
max_scale_version: 24.04.0
min_scale_version: 23.10.0
truecharts.org/SCALE-support: "false"
truecharts.org/category: operators
truecharts.org/max_helm_version: "3.14"
truecharts.org/min_helm_version: "3.11"
truecharts.org/train: system
apiVersion: v2
appVersion: 0.0.1
dependencies:
- name: common
version: 20.2.10
repository: oci://tccr.io/truecharts
condition: ""
alias: ""
tags: []
import-values: []
- name: nvidia-device-plugin
version: 0.14.5
repository: https://nvidia.github.io/k8s-device-plugin
condition: ""
alias: nvdp
tags: []
import-values: []
deprecated: false
description: NVIDIA device plugin for Kubernetes
home: https://truecharts.org/charts/system/nvidia-device-plugin
icon: https://truecharts.org/img/hotlink-ok/chart-icons/nvidia-device-plugin.png
keywords:
- nvidia
- plugins
kubeVersion: ">=1.24.0-0"
maintainers:
- name: TrueCharts
email: info@truecharts.org
url: https://truecharts.org
name: kubeapps
sources:
- https://cert-manager.io/
- https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#deployment-via-helm
- https://github.com/truecharts/charts/tree/master/charts/system/nvidia-device-plugin
type: application
version: 0.14.5

View File

@ -0,0 +1,28 @@
---
title: README
---
## General Info
TrueCharts can be installed as both _normal_ Helm Charts or as Apps on TrueNAS SCALE.
However only installations using the TrueNAS SCALE Apps system are supported.
For more information about this App, please check the docs on the TrueCharts [website](https://truecharts.org/charts/system/kubeapps)
**This chart is not maintained by the upstream project and any issues with the chart should be raised [here](https://github.com/truecharts/charts/issues/new/choose)**
## Support
- Please check our [quick-start guides for TrueNAS SCALE](https://truecharts.org/manual/SCALE/guides/scale-intro).
- See the [Website](https://truecharts.org)
- Check our [Discord](https://discord.gg/tVsPTHWTtr)
- Open a [issue](https://github.com/truecharts/charts/issues/new/choose)
---
## Sponsor TrueCharts
TrueCharts can only exist due to the incredible effort of our staff.
Please consider making a [donation](https://truecharts.org/sponsor) or contributing back to the project any way you can!
_All Rights Reserved - The TrueCharts Project_

View File

@ -0,0 +1,93 @@
# Talos Linux Setup
## Enable NVIDIA kernel modules
Before installing the device plugin, some initial steps need to be taken per
[Talos Documentation][1]. Please make sure you have installed the correct system
extensions through a combination of patches + the correct [factory image][2] for your
use case.
example gpu-worker-patch.yaml
```yaml
machine:
kernel:
modules:
- name: nvidia
- name: nvidia_uvm
- name: nvidia_drm
- name: nvidia_modeset
sysctls:
net.core.bpf_jit_harden: 1
```
### Quick Sanity Check
If running these commands does not produce similar output, you haven't set up base
system completely:
```
talosctl read /proc/modules
nvidia_uvm 1482752 - - Live 0xffffffffc3b4e000 (PO)
nvidia_drm 73728 - - Live 0xffffffffc3b3b000 (PO)
nvidia_modeset 1290240 - - Live 0xffffffffc39dc000 (PO)
nvidia 56602624 - - Live 0xffffffffc03e0000 (PO)
talosctl get extensions
NODE NAMESPACE TYPE ID VERSION NAME VERSION
192.168.2.104 runtime ExtensionStatus 0 1 nonfree-kmod-nvidia 535.129.03-v1.6.7
192.168.2.104 runtime ExtensionStatus 2 1 nvidia-container-toolkit 535.129.03-v1.13.5
192.168.2.104 runtime ExtensionStatus 4 1 schematic a22f54cdf137d9d058e9a399adecf4bab2f3cc74b15b5bee00005811433e06b0
192.168.2.104 runtime ExtensionStatus modules.dep 1 modules.dep 6.1.82-talos
```
## Create NVIDIA runtime class:
You will need to add this runtime class to pods you wish to add GPU resources to.
```
cat <<EOF | kubectl apply -f -
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia
EOF
```
### Adding runtimeClass to pods with common
```yaml
workload:
main:
podSpec:
runtimeClassName: "nvidia"
containers:
...
```
## Create nvidia-device-plugin namespace & enable privileged podsecurity
*Note: This is only required if you want multiple GPU resources per physical GPU. If you are happy with 1 to 1 GPU to POD mapping, you can just create namespace, it won't need privileges. You will need to turn off a setting below.*
```
kubectl create namespace nvidia-device-plugin
kubectl label namespace nvidia-device-plugin pod-security.kubernetes.io/enforce=privileged
```
## Install nvidia-device-plugin from kubeapps
There are notes in values.yaml, but the following defines how many resources are made per GPU:
```yaml
resources:
- name: nvidia.com/gpu
replicas: 5
```
*Note: If you do not want multigpu mapping, set replicas to 1 and change the following line to false.*
```yaml
gfd:
enabled: true
```
### Enable GPU in values.yaml
```yaml
resources:
limits:
nvidia.com/gpu: 1
```
[1]: https://www.talos.dev/v1.6/talos-guides/configuration/nvidia-gpu-proprietary/
[2]: https://factory.talos.dev

Binary file not shown.

After

Width:  |  Height:  |  Size: 5.0 KiB

View File

@ -0,0 +1 @@
{{- include "tc.v1.common.lib.chart.notes" $ -}}

View File

@ -0,0 +1,5 @@
{{/* Make sure all variables are set properly */}}
{{- include "tc.v1.common.loader.init" . }}
{{/* Render the templates */}}
{{ include "tc.v1.common.loader.apply" . }}

View File

@ -0,0 +1,38 @@
image:
repository: tccr.io/tccr/scratch
pullPolicy: IfNotPresent
tag: latest
# don't install this unless you've followed our docs or talos docs!
# ref - https://www.talos.dev/v1.6/talos-guides/configuration/nvidia-gpu-proprietary/
# need to override naming
configmap:
nvidia-device-plugin-configs:
enabled: true
expandObjectName: false
data:
# set "replicas" key to number of shares per original resource
# example: 2 physical GPU * 5 replica = 10vGPU
config: |
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 5
nvdp:
runtimeClassName: nvidia
gfd:
enabled: true
# don't edit below here
service:
main:
enabled: false
ports:
main:
enabled: false
workload:
main:
enabled: false