distributed-llm-pretraining-torchtitan
CommunityScale LLM pretraining with 4D parallelism.
AuthorAum08Desai
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill addresses the challenge of efficiently pretraining large language models (LLMs) at scale, enabling users to train models like Llama 3.1 from 8 to 512+ GPUs.
Core Features & Use Cases
- 4D Parallelism: Supports FSDP2, Tensor Parallelism (TP), Pipeline Parallelism (PP), and Context Parallelism (CP) for optimal resource utilization.
- Advanced Training Techniques: Integrates Float8 precision and
torch.compilefor significant speedups on H100 GPUs. - Use Case: Pretraining a custom LLM from scratch on a large dataset, requiring distributed training across multiple nodes and GPUs, leveraging advanced optimization techniques for faster convergence and reduced memory footprint.
Quick Start
Launch distributed LLM pretraining for Llama 3.1 8B on 8 GPUs using a custom configuration file.
Dependency Matrix
Required Modules
None requiredComponents
scriptsreferences
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: distributed-llm-pretraining-torchtitan Download link: https://github.com/Aum08Desai/hermes-research-agent/archive/main.zip#distributed-llm-pretraining-torchtitan Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.