pytorch-distributed
CommunityScale PyTorch training across GPUs.
Authorcuba6112
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill enables efficient training of large PyTorch models across multiple GPUs and nodes, overcoming single-GPU memory limitations and accelerating training times.
Core Features & Use Cases
- Distributed Training: Implements DistributedDataParallel (DDP) for multi-process data parallelism and Fully Sharded Data Parallel (FSDP) for memory-intensive models.
- Launch Utility: Integrates with
torchrunfor simplified job launching and fault tolerance. - Checkpointing: Provides guidance on safe and efficient checkpointing in distributed environments.
- Use Case: Train a massive language model that requires more GPU memory than available on a single machine by distributing its parameters and computations across multiple GPUs using FSDP.
Quick Start
Use the pytorch-distributed skill to launch a DDP training script using torchrun.
Dependency Matrix
Required Modules
torchncclgloo
Components
scriptsreferences
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: pytorch-distributed Download link: https://github.com/cuba6112/skillfactory/archive/main.zip#pytorch-distributed Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.