Comprehensive Guide: Running GPU Workloads on Kubernetes

Kubernetes

GPU

Cloud Computing

DevOps

Deep dive into running and managing GPU workloads on Kubernetes clusters, including setup, optimization, and best practices

Published

December 6, 2024

Understanding GPU Computing in Kubernetes

Architectural Overview

GPU (Graphics Processing Unit) computing leverages specialized processors originally designed for rendering graphics to perform parallel computations. Unlike CPUs which are optimized for sequential processing with few cores, GPUs contain thousands of smaller cores optimized for handling multiple tasks simultaneously.

Key Components in Kubernetes GPU Architecture:

Hardware Layer
- Physical GPU cards (e.g., NVIDIA Tesla, V100, A100)
- PCIe interface connection
- GPU memory (VRAM)
- CUDA cores for parallel processing
Driver Layer
- NVIDIA drivers: Interface between hardware and software
- CUDA toolkit: Programming interface for GPU computing
- Container runtime hooks: Enable GPU access from containers
Container Runtime Layer
- NVIDIA Container Toolkit (nvidia-docker2)
- Container runtime (containerd/Docker)
- Device plugin interface
Kubernetes Layer
- NVIDIA Device Plugin
- Kubernetes scheduler
- Resource allocation system

How GPU Scheduling Works in Kubernetes

Resource Advertisement
- The NVIDIA Device Plugin runs as a DaemonSet
- Discovers GPUs on each node
- Advertises GPU resources to Kubernetes API server
- Updates node capacity with nvidia.com/gpu resource
Resource Scheduling Process
```
Pod Request → Scheduler → Device Plugin → GPU Allocation
```
1. Pod makes GPU request through resource limits
2. Kubernetes scheduler finds eligible nodes
3. Device Plugin handles GPU assignment
4. Container runtime configures GPU access
GPU Resource Isolation
- Each container gets exclusive access to assigned GPUs
- GPU memory is not oversubscribed
- NVIDIA driver enforces hardware-level isolation

Deep Dive: NVIDIA Device Plugin Workflow

Initialization Phase
```
Device Plugin Start → GPU Discovery → Resource Registration
```
- Scans system for NVIDIA GPUs
- Creates socket for Kubernetes communication
- Registers as device plugin with kubelet
Operation Phase
```
List GPUs → Monitor Health → Handle Allocation
```
- Maintains list of available GPUs
- Monitors GPU health and status
- Handles allocation requests from kubelet
Resource Management
```
Pod Request → Allocation → Environment Setup → Container Start
```
- Maps GPU devices to containers
- Sets up NVIDIA runtime environment
- Configures container GPU access

Multi-Instance GPU (MIG) Technology

For NVIDIA A100 GPUs, MIG allows: 1. Partitioning - Single GPU split into up to 7 instances - Each instance has dedicated: * Compute resources * Memory * Memory bandwidth * Cache

Isolation Levels
- Hardware-level isolation
- Memory protection
- Error containment
- QoS guarantee

GPU Memory Management

Allocation Modes
- Exclusive process mode
- Time-slicing mode (shared)
- MIG mode (for A100)

Memory Hierarchy

Global Memory → L2 Cache → L1 Cache → CUDA Cores

Resource Limits
- GPU memory is not oversubscribed
- Containers see full GPU memory
- Memory limits enforced by NVIDIA driver

Understanding the GPU Workflow in Kubernetes

1. Pod Scheduling Flow

Pod Creation → Resource Check → Node Selection → GPU Binding → Container Start

Pod Creation
- User creates pod with GPU requirements
- Scheduler receives pod specification
Resource Check
- Scheduler checks GPU availability
- Validates node constraints and taints
- Considers topology requirements
Node Selection
- Scheduler selects optimal node
- Considers GPU availability and type
- Evaluates other scheduling constraints
GPU Binding
- Device plugin assigns specific GPUs
- Sets up environment variables
- Configures container runtime
Container Start
- Container runtime initializes with GPU access
- NVIDIA driver provides GPU interface
- Application gains GPU access

2. Data Flow in GPU Computing

Application → CUDA API → Driver → GPU → Memory → Results

Application Layer
- Makes CUDA API calls
- Manages data transfers
- Orchestrates computations
CUDA Layer
- Translates API calls to driver commands
- Manages memory transfers
- Handles kernel execution
Driver Layer
- Controls GPU hardware
- Manages memory allocation
- Schedules operations
Hardware Layer
- Executes CUDA kernels
- Performs memory operations
- Returns results

3. Resource Lifecycle

Allocation → Usage → Release → Cleanup

Resource Allocation
- Kubernetes reserves GPU
- Device plugin configures access
- Container gets exclusive use
Resource Usage
- Application uses GPU
- Monitoring tracks utilization
- Resource limits enforced
Resource Release
- Pod termination triggers release
- GPU returned to pool
- Resources cleaned up
Cleanup Process
- Memory cleared
- GPU state reset
- Resources marked available

Best Practices and Guidelines

Resource Optimization

Batch Processing
- Group similar workloads
- Use job queues effectively
- Implement proper backoff strategies
Memory Management
- Monitor GPU memory usage
- Implement proper cleanup
- Use appropriate batch sizes
Compute Optimization
- Use optimal CUDA algorithms
- Balance CPU and GPU work
- Minimize data transfers

Enough talk, let’s dig in

Prerequisites

Before starting, ensure you have: - An active Kubernetes cluster (e.g., AKS cluster) - kubectl command-line tool installed and configured - Access to create or modify node pools - Helm v3 installed (for NVIDIA device plugin installation)

GPU Node Pool Setup

1. Select appropriate GPU VM size

Choose a GPU-enabled VM size based on your workload requirements. Common options include: - Standard_NC6s_v3 (1 NVIDIA Tesla V100) - Standard_NC24rs_v3 (4 NVIDIA Tesla V100 with RDMA) - Standard_ND96asr_v4 (8 NVIDIA A100)

2. Create GPU node pool

az aks nodepool add \
    --resource-group myResourceGroup \
    --cluster-name myAKSCluster \
    --name gpunodepool \
    --node-count 1 \
    --node-vm-size Standard_NC6s_v3 \
    --node-taints sku=gpu:NoSchedule \
    --enable-cluster-autoscaler \
    --min-count 1 \
    --max-count 3

Key parameters explained: - --node-taints: Ensures only GPU workloads are scheduled on these expensive nodes - --enable-cluster-autoscaler: Automatically scales nodes based on demand - --min-count and --max-count: Define scaling boundaries

NVIDIA Driver Installation

Option 1: Default AKS Installation

By default, AKS automatically installs NVIDIA drivers on GPU-capable nodes. This is the recommended approach for most users.

Option 2: Manual Installation (if needed)

If you need to manually install or update drivers:

Connect to the node:

# Get node name
kubectl get nodes

# Connect to node (requires SSH access)
ssh username@node-ip

Install NVIDIA drivers:

# Update package list
sudo apt update

# Install ubuntu-drivers utility
sudo apt install -y ubuntu-drivers-common

# Install recommended NVIDIA drivers
sudo ubuntu-drivers install

# Reboot node
sudo reboot

NVIDIA Device Plugin Installation

The NVIDIA device plugin is required for Kubernetes to recognize and schedule GPU resources.

Using Helm (Recommended Method)

Add NVIDIA Helm repository:

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update

Install the device plugin:

helm install nvdp nvdp/nvidia-device-plugin \
    --version=0.15.0 \
    --namespace nvidia-device-plugin \
    --create-namespace

Verification Steps

Verify node GPU capability:

kubectl get nodes -o wide
kubectl describe node <gpu-node-name>

Look for the following in the output:

Capacity:
  nvidia.com/gpu: 1
Allocatable:
  nvidia.com/gpu: 1

Test GPU detection:

# Create a test pod
kubectl run nvidia-smi --rm -it \
    --image=nvidia/cuda:11.8.0-base-ubuntu22.04 \
    --limits=nvidia.com/gpu=1 \
    --command -- nvidia-smi

Running GPU Workloads

Example GPU Workload YAML

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
  - name: gpu-container
    image: nvidia/cuda:11.8.0-base-ubuntu22.04
    command: ["nvidia-smi", "-l", "30"]
    resources:
      limits:
        nvidia.com/gpu: 1
  tolerations:
  - key: "sku"
    operator: "Equal"
    value: "gpu"
    effect: "NoSchedule"

Key components explained: - resources.limits.nvidia.com/gpu: Specifies GPU requirement - tolerations: Matches node taints to allow scheduling - image: Use CUDA-compatible container image

Deploy the workload:

kubectl apply -f gpu-workload.yaml

Monitoring and Troubleshooting

Monitor GPU Usage

Using Container Insights, you can monitor: - GPU duty cycle - GPU memory usage - Number of GPUs allocated/available

Key metrics: - containerGpuDutyCycle - containerGpumemoryUsedBytes - nodeGpuAllocatable

Common Troubleshooting Steps

Check GPU driver status:

kubectl exec -it <pod-name> -- nvidia-smi

Verify NVIDIA device plugin:

kubectl get pods -n nvidia-device-plugin

Check pod events:

kubectl describe pod <pod-name>

Best Practices and Advanced Considerations

1. Resource Management Strategy

Optimal Resource Allocation

Define explicit GPU resource requests and limits in pod specifications
Implement resource quotas at namespace level to control GPU allocation
Use pod priority classes for critical GPU workloads

apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
spec:
  priorityClassName: high-priority-gpu
  containers:
  - name: gpu-container
    resources:
      limits:
        nvidia.com/gpu: 1
      requests:
        nvidia.com/gpu: 1

Node Management

Implement node taints and tolerations for GPU nodes

# Node taint example
kubectl taint nodes gpu-node-1 gpu=true:NoSchedule

# Pod toleration example
tolerations:
- key: "gpu"
  operator: "Equal"
  value: "true"
  effect: "NoSchedule"

Use node labels for GPU-specific workload targeting
Configure node affinity rules for specialized GPU workloads

2. Container Optimization

Image Management

Base image selection:
- Use official NVIDIA CUDA images as base
- Choose appropriate CUDA version for your workload
- Consider slim variants for reduced image size

Build Optimization

Implement multi-stage builds to minimize image size
Include only necessary CUDA libraries
Cache commonly used data in persistent volumes

# Example multi-stage build
FROM nvidia/cuda:11.8.0-base-ubuntu22.04 as builder
# Build steps...

FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04
COPY --from=builder /app/binary /app/

3. Performance Monitoring and Optimization

Metrics Collection

Implement comprehensive monitoring:
- GPU utilization percentage
- Memory usage patterns
- Temperature and power consumption
- Error rates and throttling events

Alert Configuration

# Example PrometheusRule for GPU alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: gpu-alerts
spec:
  groups:
  - name: gpu.rules
    rules:
    - alert: HighGPUUsage
      expr: nvidia_gpu_duty_cycle > 90
      for: 10m

4. Cost Management

Resource Scheduling

Implement spot instances for fault-tolerant workloads
Use node auto-scaling based on GPU demand
Configure pod disruption budgets for critical workloads

Workload Optimization

# Example CronJob for off-peak processing
apiVersion: batch/v1
kind: CronJob
metadata:
  name: gpu-batch-job
spec:
  schedule: "0 2 * * *"  # Run at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: gpu-processor
            resources:
              limits:
                nvidia.com/gpu: 1

5. Security Implementation

Access Control

Implement RBAC for GPU resource access

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: gpu-user
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["create"]
  resourceNames: ["nvidia.com/gpu"]

Runtime Security

Enable SecurityContext for GPU containers
Implement network policies for GPU workloads
Regular security scanning of GPU container images

6. Advanced Scenarios

Multi-Instance GPU (MIG) Configuration

# Example MIG profile configuration
apiVersion: v1
kind: Pod
metadata:
  name: mig-workload
spec:
  containers:
  - name: gpu-container
    resources:
      limits:
        nvidia.com/mig-1g.5gb: 1

RDMA for High-Performance Computing

Configure RDMA-capable networks
Optimize for low-latency communication
Implement proper RDMA security controls

Development Environment Optimization

Set up GPU sharing for development teams
Implement resource quotas per development namespace
Create development-specific GPU profiles

Finishing Points

Regular Maintenance Checklist
- Weekly driver updates verification
- Monthly performance baseline checking
- Quarterly capacity planning review
Documentation Requirements
- Maintain GPU allocation policies
- Document troubleshooting procedures
- Keep upgrade procedures updated
Future Considerations
- Plan for GPU architecture upgrades
- Evaluate emerging GPU technologies
- Monitor Kubernetes GPU feature development
Emergency Procedures
- GPU failure handling protocol
- Workload failover procedures
- Emergency contact information