ml-debug

Official

Debug ML failures with precision.

AuthorLeeroo-AI
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill systematically diagnoses and resolves failures in ML/AI workflows, such as Out-of-Memory (OOM) errors, NaN values, divergence, crashes, poor throughput, incorrect outputs, and dependency conflicts, by leveraging framework-specific knowledge and grounding in documentation.

Core Features & Use Cases

  • Root Cause Analysis: Identifies the underlying cause of ML failures through systematic diagnosis.
  • Framework-Specific Debugging: Utilizes knowledge bases and web fetching to provide accurate, context-aware solutions for various ML frameworks (PyTorch, DeepSpeed, vLLM, Hugging Face Transformers, etc.).
  • Guided Fixes: Provides step-by-step instructions, including specific configuration changes, code patches, and verification scripts, to resolve identified issues.
  • Prevention Strategies: Offers actionable advice and runnable guardrails to prevent similar issues in the future.
  • Use Case: When a distributed training job fails with an OOM error on a specific GPU, this Skill can pinpoint whether it's due to activation memory, optimizer states, or KV cache, and provide a precise configuration adjustment to fix it.

Quick Start

Use the ml-debug skill to diagnose and fix an OOM error encountered during LLM fine-tuning.

Dependency Matrix

Required Modules

None required

Components

scriptsreferences

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: ml-debug
Download link: https://github.com/Leeroo-AI/superml/archive/main.zip#ml-debug

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.