feat(nvidia-device-plugin): add nvidia-device-plugin (#20132)
**Description** Hello, This adds the nvidia-device-plugin preconfigured for 5 vcpu per pgpu. ⚒️ Fixes # <!--(issue)--> **⚙️ Type of change** - [X] ⚙️ Feature/App addition - [ ] 🪛 Bugfix - [ ] ⚠️ Breaking change (fix or feature that would cause existing functionality to not work as expected) - [ ] 🔃 Refactor of current code **🧪 How Has This Been Tested?** <!-- Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration --> **📃 Notes:** <!-- Please enter any other relevant information here --> **✔️ Checklist:** - [X] ⚖️ My code follows the style guidelines of this project - [X] 👀 I have performed a self-review of my own code - [X] #️⃣ I have commented my code, particularly in hard-to-understand areas - [X] 📄 I have made corresponding changes to the documentation - [ ] ⚠️ My changes generate no new warnings - [ ] 🧪 I have added tests to this description that prove my fix is effective or that my feature works - [X] ⬆️ I increased versions for any altered app according to semantic versioning - [X] I made sure the title starts with `feat(chart-name):`, `fix(chart-name):` or `chore(chart-name):` **➕ App addition** If this PR is an app addition please make sure you have done the following. - [ ] 🖼️ I have added an icon in the Chart's root directory called `icon.png` --- _Please don't blindly check all the boxes. Read them and only check those that apply. Those checkboxes are there for the reviewer to see what is this all about and the status of this PR with a quick glance._ --------- Signed-off-by: bitpushr <91350598+bitpushr@users.noreply.github.com> Signed-off-by: Kjeld Schouten <info@kjeldschouten.nl> Co-authored-by: bitpushr <91350598+bitpushr@users.noreply.github.com> Co-authored-by: Kjeld Schouten <info@kjeldschouten.nl>
This commit is contained in:
parent
41f12da1d8
commit
52deaaac46
|
@ -0,0 +1,30 @@
|
|||
# Patterns to ignore when building packages.
|
||||
# This supports shell glob matching, relative path matching, and
|
||||
# negation (prefixed with !). Only one pattern per line.
|
||||
.DS_Store
|
||||
# Common VCS dirs
|
||||
.git/
|
||||
.gitignore
|
||||
.bzr/
|
||||
.bzrignore
|
||||
.hg/
|
||||
.hgignore
|
||||
.svn/
|
||||
# Common backup files
|
||||
*.swp
|
||||
*.bak
|
||||
*.tmp
|
||||
*~
|
||||
# Various IDEs
|
||||
.project
|
||||
.idea/
|
||||
*.tmproj
|
||||
.vscode/
|
||||
# OWNERS file for Kubernetes
|
||||
OWNERS
|
||||
# helm-docs templates
|
||||
*.gotmpl
|
||||
# docs folder
|
||||
/docs
|
||||
# icon
|
||||
icon.png
|
|
@ -0,0 +1,44 @@
|
|||
annotations:
|
||||
max_scale_version: 24.04.0
|
||||
min_scale_version: 23.10.0
|
||||
truecharts.org/SCALE-support: "false"
|
||||
truecharts.org/category: operators
|
||||
truecharts.org/max_helm_version: "3.14"
|
||||
truecharts.org/min_helm_version: "3.11"
|
||||
truecharts.org/train: system
|
||||
apiVersion: v2
|
||||
appVersion: 0.0.1
|
||||
dependencies:
|
||||
- name: common
|
||||
version: 20.2.10
|
||||
repository: oci://tccr.io/truecharts
|
||||
condition: ""
|
||||
alias: ""
|
||||
tags: []
|
||||
import-values: []
|
||||
- name: nvidia-device-plugin
|
||||
version: 0.14.5
|
||||
repository: https://nvidia.github.io/k8s-device-plugin
|
||||
condition: ""
|
||||
alias: nvdp
|
||||
tags: []
|
||||
import-values: []
|
||||
deprecated: false
|
||||
description: NVIDIA device plugin for Kubernetes
|
||||
home: https://truecharts.org/charts/system/nvidia-device-plugin
|
||||
icon: https://truecharts.org/img/hotlink-ok/chart-icons/nvidia-device-plugin.png
|
||||
keywords:
|
||||
- nvidia
|
||||
- plugins
|
||||
kubeVersion: ">=1.24.0-0"
|
||||
maintainers:
|
||||
- name: TrueCharts
|
||||
email: info@truecharts.org
|
||||
url: https://truecharts.org
|
||||
name: kubeapps
|
||||
sources:
|
||||
- https://cert-manager.io/
|
||||
- https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#deployment-via-helm
|
||||
- https://github.com/truecharts/charts/tree/master/charts/system/nvidia-device-plugin
|
||||
type: application
|
||||
version: 0.14.5
|
|
@ -0,0 +1,28 @@
|
|||
---
|
||||
title: README
|
||||
---
|
||||
|
||||
## General Info
|
||||
|
||||
TrueCharts can be installed as both _normal_ Helm Charts or as Apps on TrueNAS SCALE.
|
||||
However only installations using the TrueNAS SCALE Apps system are supported.
|
||||
|
||||
For more information about this App, please check the docs on the TrueCharts [website](https://truecharts.org/charts/system/kubeapps)
|
||||
|
||||
**This chart is not maintained by the upstream project and any issues with the chart should be raised [here](https://github.com/truecharts/charts/issues/new/choose)**
|
||||
|
||||
## Support
|
||||
|
||||
- Please check our [quick-start guides for TrueNAS SCALE](https://truecharts.org/manual/SCALE/guides/scale-intro).
|
||||
- See the [Website](https://truecharts.org)
|
||||
- Check our [Discord](https://discord.gg/tVsPTHWTtr)
|
||||
- Open a [issue](https://github.com/truecharts/charts/issues/new/choose)
|
||||
|
||||
---
|
||||
|
||||
## Sponsor TrueCharts
|
||||
|
||||
TrueCharts can only exist due to the incredible effort of our staff.
|
||||
Please consider making a [donation](https://truecharts.org/sponsor) or contributing back to the project any way you can!
|
||||
|
||||
_All Rights Reserved - The TrueCharts Project_
|
|
@ -0,0 +1,93 @@
|
|||
# Talos Linux Setup
|
||||
|
||||
## Enable NVIDIA kernel modules
|
||||
Before installing the device plugin, some initial steps need to be taken per
|
||||
[Talos Documentation][1]. Please make sure you have installed the correct system
|
||||
extensions through a combination of patches + the correct [factory image][2] for your
|
||||
use case.
|
||||
|
||||
example gpu-worker-patch.yaml
|
||||
```yaml
|
||||
machine:
|
||||
kernel:
|
||||
modules:
|
||||
- name: nvidia
|
||||
- name: nvidia_uvm
|
||||
- name: nvidia_drm
|
||||
- name: nvidia_modeset
|
||||
sysctls:
|
||||
net.core.bpf_jit_harden: 1
|
||||
```
|
||||
|
||||
### Quick Sanity Check
|
||||
If running these commands does not produce similar output, you haven't set up base
|
||||
system completely:
|
||||
```
|
||||
❯ talosctl read /proc/modules
|
||||
nvidia_uvm 1482752 - - Live 0xffffffffc3b4e000 (PO)
|
||||
nvidia_drm 73728 - - Live 0xffffffffc3b3b000 (PO)
|
||||
nvidia_modeset 1290240 - - Live 0xffffffffc39dc000 (PO)
|
||||
nvidia 56602624 - - Live 0xffffffffc03e0000 (PO)
|
||||
|
||||
❯ talosctl get extensions
|
||||
NODE NAMESPACE TYPE ID VERSION NAME VERSION
|
||||
192.168.2.104 runtime ExtensionStatus 0 1 nonfree-kmod-nvidia 535.129.03-v1.6.7
|
||||
192.168.2.104 runtime ExtensionStatus 2 1 nvidia-container-toolkit 535.129.03-v1.13.5
|
||||
192.168.2.104 runtime ExtensionStatus 4 1 schematic a22f54cdf137d9d058e9a399adecf4bab2f3cc74b15b5bee00005811433e06b0
|
||||
192.168.2.104 runtime ExtensionStatus modules.dep 1 modules.dep 6.1.82-talos
|
||||
```
|
||||
|
||||
## Create NVIDIA runtime class:
|
||||
You will need to add this runtime class to pods you wish to add GPU resources to.
|
||||
```
|
||||
❯ cat <<EOF | kubectl apply -f -
|
||||
apiVersion: node.k8s.io/v1
|
||||
kind: RuntimeClass
|
||||
metadata:
|
||||
name: nvidia
|
||||
handler: nvidia
|
||||
EOF
|
||||
```
|
||||
|
||||
### Adding runtimeClass to pods with common
|
||||
```yaml
|
||||
workload:
|
||||
main:
|
||||
podSpec:
|
||||
runtimeClassName: "nvidia"
|
||||
containers:
|
||||
...
|
||||
```
|
||||
|
||||
## Create nvidia-device-plugin namespace & enable privileged podsecurity
|
||||
|
||||
*Note: This is only required if you want multiple GPU resources per physical GPU. If you are happy with 1 to 1 GPU to POD mapping, you can just create namespace, it won't need privileges. You will need to turn off a setting below.*
|
||||
|
||||
```
|
||||
❯ kubectl create namespace nvidia-device-plugin
|
||||
❯ kubectl label namespace nvidia-device-plugin pod-security.kubernetes.io/enforce=privileged
|
||||
```
|
||||
|
||||
## Install nvidia-device-plugin from kubeapps
|
||||
There are notes in values.yaml, but the following defines how many resources are made per GPU:
|
||||
```yaml
|
||||
resources:
|
||||
- name: nvidia.com/gpu
|
||||
replicas: 5
|
||||
```
|
||||
|
||||
*Note: If you do not want multigpu mapping, set replicas to 1 and change the following line to false.*
|
||||
```yaml
|
||||
gfd:
|
||||
enabled: true
|
||||
```
|
||||
|
||||
### Enable GPU in values.yaml
|
||||
```yaml
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
```
|
||||
|
||||
[1]: https://www.talos.dev/v1.6/talos-guides/configuration/nvidia-gpu-proprietary/
|
||||
[2]: https://factory.talos.dev
|
Binary file not shown.
After Width: | Height: | Size: 5.0 KiB |
|
@ -0,0 +1 @@
|
|||
{{- include "tc.v1.common.lib.chart.notes" $ -}}
|
|
@ -0,0 +1,5 @@
|
|||
{{/* Make sure all variables are set properly */}}
|
||||
{{- include "tc.v1.common.loader.init" . }}
|
||||
|
||||
{{/* Render the templates */}}
|
||||
{{ include "tc.v1.common.loader.apply" . }}
|
|
@ -0,0 +1,38 @@
|
|||
image:
|
||||
repository: tccr.io/tccr/scratch
|
||||
pullPolicy: IfNotPresent
|
||||
tag: latest
|
||||
|
||||
# don't install this unless you've followed our docs or talos docs!
|
||||
# ref - https://www.talos.dev/v1.6/talos-guides/configuration/nvidia-gpu-proprietary/
|
||||
|
||||
# need to override naming
|
||||
configmap:
|
||||
nvidia-device-plugin-configs:
|
||||
enabled: true
|
||||
expandObjectName: false
|
||||
data:
|
||||
# set "replicas" key to number of shares per original resource
|
||||
# example: 2 physical GPU * 5 replica = 10vGPU
|
||||
config: |
|
||||
version: v1
|
||||
sharing:
|
||||
timeSlicing:
|
||||
resources:
|
||||
- name: nvidia.com/gpu
|
||||
replicas: 5
|
||||
|
||||
nvdp:
|
||||
runtimeClassName: nvidia
|
||||
gfd:
|
||||
enabled: true
|
||||
# don't edit below here
|
||||
service:
|
||||
main:
|
||||
enabled: false
|
||||
ports:
|
||||
main:
|
||||
enabled: false
|
||||
workload:
|
||||
main:
|
||||
enabled: false
|
Loading…
Reference in New Issue