optimizing-attention-flash
CommunityAccelerate transformer training & inference.
Software Engineering#pytorch#gpu memory#flash attention#long context#transformer optimization#inference speed
AuthorAum08Desai
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill significantly speeds up transformer models and drastically reduces their memory footprint, enabling the use of longer sequences and larger models.
Core Features & Use Cases
- Speed & Memory Optimization: Achieves 2-4x speedup and 10-20x memory reduction for attention mechanisms.
- Use Cases: Ideal for training or running transformers with long sequences (>512 tokens), encountering GPU memory issues with attention, or needing faster inference. Supports PyTorch native SDPA, flash-attn library, H100 FP8, and sliding window attention.
Quick Start
Use the optimizing-attention-flash skill to enable Flash Attention in your PyTorch model by replacing standard attention with torch.nn.functional.scaled_dot_product_attention.
Dependency Matrix
Required Modules
flash-attntorchtransformers
Components
references
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: optimizing-attention-flash Download link: https://github.com/Aum08Desai/hermes-research-agent/archive/main.zip#optimizing-attention-flash Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.