volcano-gang-scheduling

Official

Debug Volcano Gang Scheduling issues

Authorscitix
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill helps diagnose and resolve issues where Volcano PodGroups fail to schedule because their Gang Scheduling constraints (like minMember or minResources) cannot be met simultaneously by the cluster.

Core Features & Use Cases

  • Gang Scheduling Diagnosis: Identifies why PodGroups remain pending due to unmet simultaneous scheduling requirements.
  • Resource Analysis: Checks cluster and queue resources against PodGroup demands.
  • Event Interpretation: Parses Kubernetes and Volcano events for specific Gang scheduling errors.
  • Use Case: When your distributed training jobs (e.g., PyTorch, TensorFlow) using Volcano are stuck with Pending pods, this Skill guides you through finding out if it's due to insufficient simultaneous resources, resource fragmentation, or queue limitations.

Quick Start

Use the volcano-gang-scheduling skill to diagnose why a PodGroup named 'my-training-pg' in the 'ai-jobs' namespace is stuck in pending.

Dependency Matrix

Required Modules

None required

Components

scriptsreferences

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: volcano-gang-scheduling
Download link: https://github.com/scitix/siclaw/archive/main.zip#volcano-gang-scheduling

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.