volcano-gang-scheduling
OfficialDebug Volcano Gang Scheduling issues
Authorscitix
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill helps diagnose and resolve issues where Volcano PodGroups fail to schedule because their Gang Scheduling constraints (like minMember or minResources) cannot be met simultaneously by the cluster.
Core Features & Use Cases
- Gang Scheduling Diagnosis: Identifies why PodGroups remain pending due to unmet simultaneous scheduling requirements.
- Resource Analysis: Checks cluster and queue resources against PodGroup demands.
- Event Interpretation: Parses Kubernetes and Volcano events for specific Gang scheduling errors.
- Use Case: When your distributed training jobs (e.g., PyTorch, TensorFlow) using Volcano are stuck with
Pendingpods, this Skill guides you through finding out if it's due to insufficient simultaneous resources, resource fragmentation, or queue limitations.
Quick Start
Use the volcano-gang-scheduling skill to diagnose why a PodGroup named 'my-training-pg' in the 'ai-jobs' namespace is stuck in pending.
Dependency Matrix
Required Modules
None requiredComponents
scriptsreferences
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: volcano-gang-scheduling Download link: https://github.com/scitix/siclaw/archive/main.zip#volcano-gang-scheduling Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.