Distributed Training Patterns

Community

Scale ML training across GPUs.

AuthorHermeticOrmus
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill addresses the challenge of training large machine learning models that exceed the memory or computational capacity of a single GPU, enabling efficient scaling across multiple GPUs and nodes.

Core Features & Use Cases

  • Distributed Data Parallel (DDP): Standard PyTorch DDP setup for multi-GPU training.
  • Fully Sharded Data Parallel (FSDP): Advanced memory optimization for massive models, sharding parameters, gradients, and optimizer states.
  • DeepSpeed ZeRO-3: Configuration for extreme-scale training with advanced memory optimization and communication strategies.
  • Mixed Precision Training: Utilizes AMP and GradScaler for FP16 training to reduce memory usage and speed up computation.
  • Gradient Checkpointing: Trades compute for memory by recomputing activations during the backward pass.
  • Efficient Data Loading: Strategies for optimized data loading across distributed ranks.
  • Use Case: Train a multi-billion parameter LLM by distributing its layers and optimizer states across dozens of GPUs using FSDP or DeepSpeed ZeRO-3, while leveraging mixed precision and gradient checkpointing to fit within hardware constraints.

Quick Start

Use the distributed training skill to set up a PyTorch DDP training loop for a custom model and dataset.

Dependency Matrix

Required Modules

None required

Components

references

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: Distributed Training Patterns
Download link: https://github.com/HermeticOrmus/LibreMLOps-Claude-Code/archive/main.zip#distributed-training-patterns

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.