ml-debug

Name: ml-debug
Availability: InStock
Author: Leeroo-AI

Official

Debug ML failures with precision.

Software Engineering #performance optimization #oom error #ml debugging #ai troubleshooting #nan values #framework configuration

AuthorLeeroo-AI

Version1.0.0

Installs0

System Documentation

What problem does it solve?

This Skill systematically diagnoses and resolves failures in ML/AI workflows, such as Out-of-Memory (OOM) errors, NaN values, divergence, crashes, poor throughput, incorrect outputs, and dependency conflicts, by leveraging framework-specific knowledge and grounding in documentation.

Core Features & Use Cases

Root Cause Analysis: Identifies the underlying cause of ML failures through systematic diagnosis.
Framework-Specific Debugging: Utilizes knowledge bases and web fetching to provide accurate, context-aware solutions for various ML frameworks (PyTorch, DeepSpeed, vLLM, Hugging Face Transformers, etc.).
Guided Fixes: Provides step-by-step instructions, including specific configuration changes, code patches, and verification scripts, to resolve identified issues.
Prevention Strategies: Offers actionable advice and runnable guardrails to prevent similar issues in the future.
Use Case: When a distributed training job fails with an OOM error on a specific GPU, this Skill can pinpoint whether it's due to activation memory, optimizer states, or KV cache, and provide a precise configuration adjustment to fix it.

Quick Start

Use the ml-debug skill to diagnose and fix an OOM error encountered during LLM fine-tuning.

ml-debug

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper