k8s_gpu_pod_troubleshooter

Official

Diagnose Kubernetes GPU pod issues.

AuthorProject-HAMi
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill helps diagnose and resolve complex issues related to GPU pod scheduling, allocation, and runtime failures within Kubernetes clusters managed by HAMi.

Core Features & Use Cases

  • Component Health Check: Verifies the status of HAMi scheduler, device plugins, and webhooks.
  • Resource Analysis: Assesses GPU node capacity, allocatable resources, and pending pod requests.
  • Troubleshooting Workflow: Guides users through steps to identify root causes of GPU-related CrashLoopBackOff, pending pods, and scheduling conflicts.
  • Use Case: A data science team is experiencing CrashLoopBackOff errors for their GPU-accelerated training pods. This Skill can systematically identify whether the issue stems from HAMi component failures, insufficient GPU resources, incorrect webhook configurations, or driver mismatches on the nodes.

Quick Start

Use the k8s_gpu_pod_troubleshooter skill to diagnose why my GPU pods are stuck in pending status in the 'ai-ml' namespace.

Dependency Matrix

Required Modules

None required

Components

references

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: k8s_gpu_pod_troubleshooter
Download link: https://github.com/Project-HAMi/HAMi/archive/main.zip#k8s-gpu-pod-troubleshooter

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.