k8s_gpu_pod_troubleshooter
OfficialDiagnose Kubernetes GPU pod issues.
AuthorProject-HAMi
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill helps diagnose and resolve complex issues related to GPU pod scheduling, allocation, and runtime failures within Kubernetes clusters managed by HAMi.
Core Features & Use Cases
- Component Health Check: Verifies the status of HAMi scheduler, device plugins, and webhooks.
- Resource Analysis: Assesses GPU node capacity, allocatable resources, and pending pod requests.
- Troubleshooting Workflow: Guides users through steps to identify root causes of GPU-related
CrashLoopBackOff, pending pods, and scheduling conflicts. - Use Case: A data science team is experiencing
CrashLoopBackOfferrors for their GPU-accelerated training pods. This Skill can systematically identify whether the issue stems from HAMi component failures, insufficient GPU resources, incorrect webhook configurations, or driver mismatches on the nodes.
Quick Start
Use the k8s_gpu_pod_troubleshooter skill to diagnose why my GPU pods are stuck in pending status in the 'ai-ml' namespace.
Dependency Matrix
Required Modules
None requiredComponents
references
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: k8s_gpu_pod_troubleshooter Download link: https://github.com/Project-HAMi/HAMi/archive/main.zip#k8s-gpu-pod-troubleshooter Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.