distributed-llm-pretraining-torchtitan

Community

Scale LLM pretraining with 4D parallelism.

AuthorAum08Desai
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill addresses the challenge of efficiently pretraining large language models (LLMs) at scale, enabling users to train models like Llama 3.1 from 8 to 512+ GPUs.

Core Features & Use Cases

  • 4D Parallelism: Supports FSDP2, Tensor Parallelism (TP), Pipeline Parallelism (PP), and Context Parallelism (CP) for optimal resource utilization.
  • Advanced Training Techniques: Integrates Float8 precision and torch.compile for significant speedups on H100 GPUs.
  • Use Case: Pretraining a custom LLM from scratch on a large dataset, requiring distributed training across multiple nodes and GPUs, leveraging advanced optimization techniques for faster convergence and reduced memory footprint.

Quick Start

Launch distributed LLM pretraining for Llama 3.1 8B on 8 GPUs using a custom configuration file.

Dependency Matrix

Required Modules

None required

Components

scriptsreferences

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: distributed-llm-pretraining-torchtitan
Download link: https://github.com/Aum08Desai/hermes-research-agent/archive/main.zip#distributed-llm-pretraining-torchtitan

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.