Name: pytorch-distributed
Availability: InStock
Author: cuba6112

System Documentation

What problem does it solve?

This Skill enables efficient training of large PyTorch models across multiple GPUs and nodes, overcoming single-GPU memory limitations and accelerating training times.

Core Features & Use Cases

Distributed Training: Implements DistributedDataParallel (DDP) for multi-process data parallelism and Fully Sharded Data Parallel (FSDP) for memory-intensive models.
Launch Utility: Integrates with torchrun for simplified job launching and fault tolerance.
Checkpointing: Provides guidance on safe and efficient checkpointing in distributed environments.
Use Case: Train a massive language model that requires more GPU memory than available on a single machine by distributing its parameters and computations across multiple GPUs using FSDP.

Quick Start

Use the pytorch-distributed skill to launch a DDP training script using torchrun.

Please help me install this Skill: Name: pytorch-distributed Download link: https://github.com/cuba6112/skillfactory/archive/main.zip#pytorch-distributed Please download this .zip file, extract it, and install it in the .claude/skills/ directory.

pytorch-distributed

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper