Distributed Training Patterns
CommunityScale ML training across GPUs.
AuthorHermeticOrmus
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill addresses the challenge of training large machine learning models that exceed the memory or computational capacity of a single GPU, enabling efficient scaling across multiple GPUs and nodes.
Core Features & Use Cases
- Distributed Data Parallel (DDP): Standard PyTorch DDP setup for multi-GPU training.
- Fully Sharded Data Parallel (FSDP): Advanced memory optimization for massive models, sharding parameters, gradients, and optimizer states.
- DeepSpeed ZeRO-3: Configuration for extreme-scale training with advanced memory optimization and communication strategies.
- Mixed Precision Training: Utilizes AMP and GradScaler for FP16 training to reduce memory usage and speed up computation.
- Gradient Checkpointing: Trades compute for memory by recomputing activations during the backward pass.
- Efficient Data Loading: Strategies for optimized data loading across distributed ranks.
- Use Case: Train a multi-billion parameter LLM by distributing its layers and optimizer states across dozens of GPUs using FSDP or DeepSpeed ZeRO-3, while leveraging mixed precision and gradient checkpointing to fit within hardware constraints.
Quick Start
Use the distributed training skill to set up a PyTorch DDP training loop for a custom model and dataset.
Dependency Matrix
Required Modules
None requiredComponents
references
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: Distributed Training Patterns Download link: https://github.com/HermeticOrmus/LibreMLOps-Claude-Code/archive/main.zip#distributed-training-patterns Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.