pytorch-distributed

Community

Scale PyTorch training across GPUs.

Authorcuba6112
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill enables efficient training of large PyTorch models across multiple GPUs and nodes, overcoming single-GPU memory limitations and accelerating training times.

Core Features & Use Cases

  • Distributed Training: Implements DistributedDataParallel (DDP) for multi-process data parallelism and Fully Sharded Data Parallel (FSDP) for memory-intensive models.
  • Launch Utility: Integrates with torchrun for simplified job launching and fault tolerance.
  • Checkpointing: Provides guidance on safe and efficient checkpointing in distributed environments.
  • Use Case: Train a massive language model that requires more GPU memory than available on a single machine by distributing its parameters and computations across multiple GPUs using FSDP.

Quick Start

Use the pytorch-distributed skill to launch a DDP training script using torchrun.

Dependency Matrix

Required Modules

torchncclgloo

Components

scriptsreferences

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: pytorch-distributed
Download link: https://github.com/cuba6112/skillfactory/archive/main.zip#pytorch-distributed

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.